How does GTFS realtime data behave - gtfs

I'm trying to construct some sort of realtime API for and I've got everything I need, but now that it's actually come time to query the data, it seems....wrong. I'm seriously hoping it's not and I'm doing something stupid, but from what I'm looking at it just doesn't seem to work.
I have my static GTFS data which I can query for a certain stop to get the departure times for the routes at that stop. Then I can take the trip IDs associated with each one and query my realtime data.
Every 30 seconds I fetch realtime data from the GTFS-R feed but most of the time the relevant trip IDs aren't found in there. Am I right in saying that when I fetch from the GTFS-R feed I get back the full set of realtime changes? Or should I be storing and updating the responses I get? I just can't figure out exactly what comes back in the GTFS-R response. I.e. is it all realtime changes each time, so if a bus is 5 mins late will that come back every time I query the endpoint, or just once?
Any help appreciated.

Related

Active Collab 5 Webhooks / Maintaining "metric" data

I have an application I am working on that basically takes the data from Active Collab and creates reports / graphs out of the data. The API itself is insufficient to get the proper data on a per request basis so I resorted to pulling the data down into a separate data set that can be queried more efficiently.
So in order to avoid needing to query the entire API constantly I decided to make use of webhooks in order to make the transformations to the relevant data and lower the need to resync the data.
However I notice not all events are sent, notably the following.
TaskListUpdated
MemberUpdated
TimeRecordUpdated
ProjectUpdated
There is probably more but these are the main ones I noticed so far,
Time reports is probably the most important, in fact it missing from webhooks means that almost any application has a good chance of incorrect data if it needs time record data. Its fairly common to do a typo in a time record and then adjust it later.
So am I missing anything here? Is there some way to see these events reliably?
EDIT:
In order to avoid a long comment to Ilija I am putting the bulk here.
Webhooks apart, what information do you need to pull? API that powers
time tracking reports can do all sorts of cross project filtering, so
your approach to keep a separate database may be an overkill.
Basically we are doing a multi-variable tiered time report. It can be sorted / grouped by any conceivable method you may want to look at.
http://www.appsmagnet.com/product/time-reports-plus/
This is the closest to what we are trying to do, back when we used Active Collab 4 this did the job, but even with it we had to consolidate it in our own spreadsheets.
So the idea of this is to better integrate our Active Collab data into our own workflow.
So the main data we are looking for in this case is
Job Types
Projects
Task Lists
Tasks
Time Records
Categories
Members / Clients
Companies
These items can feed not only our reports, but many other aspects of our company as well. For us Active Collab is the point of truth, so we want the data quickly accessible and fully query-able.
So I have set up a sync system that initially grabs all the data it can from Active Collab and then uses a mix of cron's and webhooks to keep it up to date.
Cron jobs work well for all aspects that do not have "sub items" (projects/tasks/task lists/time records). So those I need to rely on the webhook since syncing them takes to much time to be able to keep it up to date in real time.
For the webhook I noticed the above do not carry through. Time Records I figured out a way around it listed in my answer, and member can be done through the cron. However Task list and project updating are the only 2 of some concern. Project is fairly important as the budget can change and that would be used in reports, task lists has the start / end dates that could be used as well. Since going through every project / task list constantly to see if there is a change is really not a great idea I am looking for a way to reliably see updates for them.
I have based this system on https://developers.activecollab.com/api-documentation/ but I know there are at least a few end points that are not listed.
Cross-project time-record filtering using Active Collab 5 API
This question is actually from another developer on the same system (and also shows a TrackingFilter report not listed in the docs). Due to issues with maintaining an accurate set of data we had to adapt it. I actually notice that you (Ilija) are the person replying and did recommend we move over to this style of system.
This is not a total answer but a way to solve the issue with TimeRecordUpdated not going through the webhook.
There is another API endpoint for /whats-new This endpoint describes changes for the last day or so and it has a category called TrackingObjectUpdatedActivityLog this refers to an updated time record.
So I set up a cron job to check this fairly consistently and manually push the TimeRecordUpdated event through my system to keep it consistent.
For MemberUpdated since the data for a member being updated is unlikely to affect much, having a daily cron for checking the users seems good enough.
ProjectUpdated could technically be considered the same, but with the absence of TaskListUpdated that leads to far to many api calls to sync the data. I have not found a solution for this yet unfortunately.

HTTP POST Transmission suggestion

I'm building a system which requires an Arduino board to send data to the server.
The requirements/constraints of the app are:
The server must receive data and store them in a MySQL database.
A web application is used to graph and plot historical data.
Data consumption is critical
Web application must also be able to plot data in real time.
So far, the system is working fine, however, optimization is required.
The current adopted steps are:
Accumulate data in Arduino board for 10 seconds.
Send the data to the server using POST with data containing an XML string representing the 10 records.
The server parse the received XML and store the values in the database.
This approach is good for historical data, but not for realtime monitoring.
My question is: Is there a difference between:
Accumulating the data and send them as XML, and,
Send the data each second.
In term of data consumption, is sending a POST request each second too much?
Thanks
EDIT: Can anybody provide a mathematical formula benchmarking the two approaches in term of data consumption?
For your data consumption question you need to figure out how much each POST costs you giving your cell phone plan. I don't know if there is a mathematical formula, but you could easily test and work it out.
However, using 3G (even Wifi for that matter), the power consumption will be an issue, especially if your circuit runs on a battery; each POST bursts around 1.5 amps, that's too much for sending data every second.
But again, why would you send data every second?
Real time doesn't mean sending data every second, it means being at least as fast as the system.
For example, if you are sending temperatures, temperature doesn't change from 0° to 100° in one second. So all those POSTs will be a waste of power and data.
You need to know how fast the parameters change in your system and adapt your POST accordingly.

Amazon API submitting requests too quickly

I am creating a games comparison website and would like to get Amazon prices included within it. The problem I am facing is using their API to get the prices for the 25,000 products I already have.
I am currently using the ItemLookup from Amazons API and have it working to retrieve the price, however after about 10 results I get an error saying 'You are submitting requests too quickly. Please retry your requests at a slower rate'.
What is the best way to slow down the request rate?
Thanks,
If your application is trying to submit requests that exceed the maximum request limit for your account, you may receive error messages from Product Advertising API. The request limit for each account is calculated based on revenue performance. Each account used to access the Product Advertising API is allowed an initial usage limit of 1 request per second. Each account will receive an additional 1 request per second (up to a maximum of 10) for every $4,600 of shipped item revenue driven in a trailing 30-day period (about $0.11 per minute).
From Amazon API Docs
If you're just planning on running this once, then simply sleep for a second in between requests.
If this is something you're planning on running more frequently it'd probably be worth optimising it more by making sure that the length of time it takes the query to return is taken off that sleep (so, if my API query takes 200ms to come back, we only sleep for 800ms)
Since it only says that after 10 results you should check how many results you can get. If it always appears after 10 fast request you could use
wait(500)
or some more ms. If its only after 10 times, you could build a loop and do this every 9th request.
when your request A lot of repetition.
then you can create a cache every day clear context.
or Contact the aws purchase authorization
I went through the same problem even if I put 1 or more seconds delay.
I believe when you begin to make too much requests with only one second delay, Amazon doesn't like that and thinks you're a spammer.
You'll have to generate another key pair (and use it when making further requests) and put a delay of 1.1 second to be able to make fast requests again.
This worked for me.

Storing elasticsearch query result in Django session

I am currently in a development team that has implemented a search app using Flask-WhooshAlchemy. Admittedly, we did not think this completely through.
The greatest problem we face is being unable to store query results into a Flask session without serializing the data set first. The '__QueryObject' being returned via Whoosh can be JSON serialized using Marshmallow. We have gone through this route and, yes, we are able to store and manipulate the retrieved data, but at a tradeoff: initial searches will take a very long time (at least 30 seconds for larger result sets, due to serialization). For the time being, we are currently stuck with having to re-query anytime there are changes to the data set (changes that shouldn't require a fresh search, such as switching between result views and changing the number of results per page). Adding insult to injury, whoosh is probably not scalable for our purposes; Elasticsearch seems a better contender.
In short:
How can we store elasticsearch query results in a Django session so that we may be able to manipulate these results?
Any other guidance will be greatly appreciated.
If anyone cares, we finally got everything up and running and yes, it is possible to store elasticsearch query results in a Django session.

Getting OVER_QUERY_LIMIT when seeding production database. (Too many requests?)

I am using gmaps4rails gem to mark locations on a map.
When I seed my heroku app's database, I get
#<Gmaps4rails::GeocodeStatus: The address you passed seems invalid, status was: OVER_QUERY_LIMIT.
for a number of the records that should be included in the database.
Thus, a good number of records end up being excluded. I'm pretty sure the reason is that there are too many requests being sent at one time.
One time, I got really lucky and there won't be any of these. However, most of the time there are quite a few. End up getting a map with around half of the locations I want marked on the map.
Any idea how I can perhaps put in some type of delay after seeding a certain number of records to prevent too many requests from being sent at one time?