Sql queries for getting information from GTFS files in Java - mysql

I'm working on a project for school which uses gtfs database (MySQL).
I wrote some code that parses the gtfs files and inserts them into MySQL DB (each file is a table in my DB).
I'm trying to write two SQL queries:
Given a stationId, time, and line number - I want to get all trips that pass by this station in the next 10 minutes.
Given a tripId, directionId and stopId - I want to get all the remaining stations in this trip (in order to draw on a map the stations to come in my trip).
Does anyone knows how can I state this SQL queries in Java?
I tried this:
SELECT * FROM stops, routes, stop_times, calendar, trips
where departure_time > "08:24:00"
and departure_time < "16:40:00"
and route_short_name = "10"
and stops.stop_id = 29335
and stops.stop_id = stop_times.stop_id
and stop_times.trip_id = trips.trip_id
and calendar.service_id = trips.service_id
and calendar.sunday = 1

I have fixed this problem exactly for GTFS data in Belgium. The code is available on github:
https://github.com/iRail/MIVBSTIBResource/blob/master/MIVBSTIBStopTimesDao.php

Related

How can I pull data from my database using the Django ORM that annotates values for each day?

I have a Django app that is attached to a MySQL database. The database is full of records - several million of them.
My models look like this:
class LAN(models.Model):
...
class Record(models.Model):
start_time = models.DateTimeField(...)
end_time = models.DateTimeField(...)
ip_address = models.CharField(...)
LAN = models.ForeignKey(LAN, related_name="records", ...)
bytes_downloaded = models.BigIntegerField(...)
bytes_uploaded = models.BigIntegerField(...)
Each record reflects a window of time, and shows if a particular IP address on particular LAN did any downloading or uploading during that window.
What I need to know is this:
Given a beginning date, and end date, give me a table of which DAYS a particular LAN had ANY activity (has any records)
Ex:
Between Jan 1 and Jan 31, tell me which DAYS LAN A had ANY records on
them
Assume that once in a while, a LAN will shut down for days at a time and have no records or any activity on those days.
My Solution:
I can do this the slow way by attaching some methods to my LAN model:
class LAN(models.Model):
...
# Returns True if there are records for the current LAN between 2 given dates
# Returns False otherwise
def online(self, start, end):
criterion1 = Q(start_time__lt=end)
criterion2 = Q(end_time__gt=start)
return self.records.filter(criterion1 & criterion2).exists()
# Returns a list of days that a LAN was online for between 2 given dates
def list_online_days(self, start, end):
start_date = timezone.make_aware(timezone.datetime.strptime(start, "%b %d, %Y"))
end_date = timezone.make_aware(timezone.datetime.strptime(end, "%b %d, %Y"))
end_date = end_date.replace(hour=23, minute=59, second=59, microsecond=999999)
days_online = []
current_date = start.astimezone()
while current_date <= end:
start_of_day = current_date.replace(hour=0, minute=0, second=0, microsecond=0)
end_of_day = current_date.replace(hour=23, minute=59, second=59, microsecond=999999)
if self.online(start=start_of_day, end=end_of_day):
days_online.append(current_date.date())
current_date += timezone.timedelta(days=1)
return days_online
At which point, I can run:
lan = LAN.objects.get(id=1) # Or whatever LAN I'm interested in
days_online = lan.list_online_days(start="Jan 1, 2020", end="Jan 31, 2020")
This works, but results in one query being run per day between my start date and end date. In this case, 31 queries (Jan 1, Jan 2, etc.).
This makes it really, really slow for large time periods, as it needs to go through all the records in the database 31 times. Database indexing helps, but it's still slow with enough data in the database.
Is there a way to do a single database query to give me what I need?
I feel like it would look something like this, but I can't quite get it right:
lan.records.filter(criterion1 & criterion2).annotate(date=TruncDay('start_time')).order_by('date').distinct().values('date').annotate(exists=Exists(SOMETHING))
The first part:
lan.records.filter(criterion1 & criterion2).annotate(date=TruncDay('start_time')).order_by('date').distinct().values('date')
Seems to give me what I want - one value per day, but I'm not sure how to annotate the result with an exists field that shows if any records exist on that day.
Note: This is a simplified version of my app - not the exact models and fields, so if certain things could be improved, like not using CharField for the ip_address field, don't focus too much on that
The answer ended up being simpler than I thought, mostly because I already had it.
This:
lan.records.filter(criterion1 & criterion2).annotate(date=TruncDay('start_time')).order_by('date').distinct().values('date').annotate(exists=Exists(Record.objects.filter(pk=OuterRef('pk'))))
Was what I was expecting, but all it does is return exists=True for all days returned, which is accurate, but not overly helpful. This is because any days that had no records on them are already omitted from the results.
That means I can skip the entire annotate section, and just do this:
lan.records.filter(criterion1 & criterion2).annotate(date=TruncDay('start_time')).order_by('date').distinct().values('date')
which already gives me a list of datetime objects when there were records present, and skips any where there weren't.

Django: Filter on Annotated Value

I have a situation where I have a model called trip. Each trip has a departure_airport and an arrival_airport which are related fields and both part of the airport model. Each object in the airport model has a location represented by latitude and longitude fields.
I need to be able to take inputs from two (potentially) separate departure and arrival airport locations using something like the Haversine formula. That formula would calculate the distance from each departure/arrival airport in the database to the location of the airports that have been taken as input.
The difficult part of this query is that I annotate the trip queryset with the locations of the departure and arrival airports, however because there's two sets of latitude/longitude fields (one for each airport) with the same name and you can't use annotated fields in a sql where clause, I'm not able to use both sets of airports in a query.
I believe the solution is to use a subquery on the annotated fields so that query executes before the where clause, however I've been unable to determine if this is possible for this query. The other option is to write raw_sql.
Here's what I have so far:
GCD_FORMULA_TO = """3961 * acos(
cos(radians(%s)) * cos(radians(arrival_lat))
* cos(radians(arrival_lon) - radians(%s)) +
sin(radians(%s)) * sin(radians(arrival_lat)))"""
GCD_FORMULA_FROM = """3961 * acos(
cos(radians(%s)) * cos(radians(departure_lat))
* cos(radians(departure_lon) - radians(%s)) +
sin(radians(%s)) * sin(radians(departure_lat)))"""
location_to = Q(location_to__lt=self.arrival_airport_rad)
location_from = Q(location_from__lt=self.departure_airport_rad)
qs = self.queryset\
.annotate(arrival_lat=F('arrival_airport__latitude_deg'))\
.annotate(arrival_lon_to=F('arrival_airport__longitude_deg'))\
.annotate(departure_lat=F('departure_airport__latitude_deg'))\
.annotate(longitude_lon=F('departure_airport__longitude_deg'))\
.annotate(location_to=RawSQL(GCD_FORMULA_TO, (self.arrival_airport.latitude_deg, self.arrival_airport.longitude_deg,
self.arrival_airport.latitude_deg)))\
.annotate(location_from=RawSQL(self.GCD_FORMULA_FROM, (self.departure_airport.latitude_deg, self.departure_airport.longitude_deg,
self.departure_airport.latitude_deg)))\
.filter(location_to and location_from)
return qs
Any ideas? Also open to other ways to go about this.
You're doing this the hard way.
If your python code has a pair of locations, use this:
from geopy.distance import distance
loc1 = (lat1, lng1)
loc2 = (lat2, lng2)
d = distance(loc1, loc2).km
If you're querying a database, perhaps you would prefer that it runs PostGIS / Postgres, rather than mysql,
so you can compute distance and shape membership.
The syntax sometimes is on the clunky side, but the indexing works great.
Here is an example for departing from London Heathrow:
SELECT a.airport_name,
ST_Distance('SRID=4326; POINT(-0.461389 51.4775)'::geography,
ST_Point(a.longitude, a.latitude)) AS distance
FROM arrival_airports a
ORDER BY distance;
As a separate matter, you might consider defining an arrival and/or departure VIEW on your table, and then JOIN, with a distinct model for each view.

How can this query be optimized for speed?

This query creates an export for UPS from the deliveries history:
select 'key'
, ACC.Name
, CON.FullName
, CON.Phone
, ADR.AddressLine1
, ADR.AddressLine2
, ADR.AddressLine3
, ACC.Postcode
, ADR.City
, ADR.Country
, ACC.Code
, DEL.DeliveryNumber
, CON.Email
, case
when CON.Email is not null
then 'Y'
else 'N'
end
Ship_Not_Option
, 'Y' Ship_Not
, 'ABCDEFG' Description_Goods
, '1' numberofpkgs
, 'PP' billing
, 'CP' pkgstype
, 'ST' service
, '1' weight
, null Shippernr
from ExactOnlineREST..GoodsDeliveries del
join ExactOnlineREST..Accounts acc
on ACC.ID = del.DeliveryAccount
join ExactOnlineREST..Addresses ADR
on ADR.ID = DEL.DeliveryAddress
join ExactOnlineREST..Contacts CON
on CON.ID = DEL.DeliveryContact
where DeliveryDate between $P{P_SHIPDATE_FROM} and $P{P_SHIPDATE_TO}
order
by DEL.DeliveryNumber
It takes many minutes to run. The number of deliveries and accounts grows with several hundreds each day. Addresses and contacts are mostly 1:1 with accounts. How can this query be optimized for speed in Invantive Control for Excel?
Probably this query is run at most once every day, since the deliverydate does not contain time. Therefore, the number of rows selected from ExactOnlineREST..GoodsDeliveries is several hundreds. Based upon the statistics given, the number of accounts, deliveryaddresses and contacts is also approximately several hundreds.
Normally, such a query would be optimized by a solution such as Exact Online query with joins runs more than 15 minutes, but that solution will not work here: the third value of a join_set(soe, orderid, 100) is the maximum number of rows on the left-hand side to be used with index joins. At this moment, the maximum number on the left-hand side is something like 125, based upon constraints on the URL length for OData requests to Exact Online. Please remember the actual OData query is a GET using an URL, not a POST with unlimited size for the filter.
The alternatives are:
Split volume
Data Cache
Data Replicator
Have SQL engine or Exact Online adapted :-)
Split Volume
In a separate query select the eligible GoodsDeliveries and put them in an in-memory or database table using for instance:
create or replace table gdy#inmemorystorage as select ... from ...
Then create a temporary table per 100 or similar rows such as:
create or replace table gdysubpartition1#inmemorystorage as select ... from ... where rowidx$ between 0 and 99
... etc for 100, 200, 300, 400, 500
And then run the query several times, each time with a different gdysubpartition1..gdysubpartition5 instead of the original from ExactOnlineREST..GoodsDeliveries.
Of course, you can also avoid the use of intermediate tables by using an inline view like:
from (select * from goodsdeliveries where date... limit 100)
or alike.
Data Cache
When you run the query multiple times per day (unlikely, but I don't know), you might want to cache the Accounts in a relational database and update it every day.
You can also use a 'local memorize results clipboard andlocal save results clipboard to to save the last results to a file manually and later restore them usinglocal load results clipboard from ...andlocal insert results clipboard in table . And maybe theninsert into from exactonlinerest..accounts where datecreated > trunc(sysdate)`.
Data Replicator
With Data Replicator enabled, you can have replicas created and maintained automatically within an on-premise or cloud relational database for Exact Online API entities. For low latency, you will need to enable the Exact webhooks.
Have SQL Engine or Exact adapted
You can also register a request to have the SQL engine to allow higher number in the join_set hint, which would require addressing the EOL APIs in another way. Or register a request at Exact to also allow POST requests to the API with the filter in the body.

mysql db for time-temperature values

I need your help to build my db the right way.
I need to store time-temperature values for different rooms of my house
and I want to use DyGraph to graph the data sets.
I want to implement different time windows: 1 hour, 24 hours, 48 hours,
1 week, ....
I will be detecting the temperature with a 15 minutes interval, so I will have 4 time-temperature values per hour.
Each room has an ID so the time-temperature values will be associated
to the proper room.
The table I built i very simple:
----------------------------------
| ID | DATE | TEMP |
----------------------------------
| 1 |2014-04-30 00:00:00 | 18.6 |
----------------------------------
| 2 |2014-04-30 00:00:00 | 18.3 |
----------------------------------
| 3 |2014-04-30 00:00:00 | 18.3 |
----------------------------------
| 1 |2014-04-30 00:15:00 | 18.5 |
----------------------------------
For some strange reason, when the number of rows gets to 500 or so,the
server becomes very slow.
Also, I have a web page were I can read the different temperatures of
the rooms: this page polls the server through AJAX every 5 seconds (because it needs
to be frequently updated!), but when the number of rows of the table
gets around 500, it hangs.
I tried to split the table and I created a table for each room, then a
table for each time-window and now everything seems to be working fine.
Since I do not think this is the best/most efficient way to organize
this thing, I need your help to give it a better structure.
I use a php script to retrieve the temperature data for all the rooms of my house:
$query = "SELECT * FROM temperature t1
WHERE (id, date) IN
(SELECT id,MAX(date) FROM
temperature t2 GROUP BY id)";
this query allows me to collect the temperature values in an array called $options:
$result_set = mysql_query($query, $connection);
while($rows = mysql_fetch_array($result_set)){
$options [] = $rows;
}
then, I json-encode the array:
$j = json_encode($options);
and send it to the ajax script, which shows the data on the web page:
echo $j;
In the ajax script, I save the data in a variable and then parse it:
var return_data = xhr.responseText;
var temperature = JSON.parse(return_data);
next I loop through the array to extract the temperature values and put it in the right place on the web page:
for(var j=0; j<temperature.length; j++){
document.getElementById("TEMPArea" + j).innerHTML = temperature[j].temp + "°C";
}
This works fine as long as the rows in the 'temperature' table are less than 600 or so: polling every 5 seconds is not a problem.
Above 600, the page refresh gets slow and eventually it hangs and stops refreshing.
EDIT: Right now, I am working on a virtual machine with Windows 7 64bit, Apache, PHP and MySQL, 4GB RAM. Do you think this could be an issue?
It seems like I was poor in details, so here's something more to what I said.
I use a php script to retrieve the temperature data for all the rooms of my house:
$query = "SELECT * FROM temperature t1
WHERE (id, date) IN
(SELECT id,MAX(date) FROM
temperature t2 GROUP BY id)";
this query allows me to collect the temperature values in an array called $options:
$result_set = mysql_query($query, $connection);
while($rows = mysql_fetch_array($result_set)){
$options [] = $rows;
}
then, I json-encode the array:
$j = json_encode($options);
and send it to the ajax script, which shows the data on the web page:
echo $j;
In the ajax script, I save the data in a variable and then parse it:
var return_data = xhr.responseText;
var temperature = JSON.parse(return_data);
next I loop through the array to extract the temperature values and put it in the right place on the web page:
for(var j=0; j<temperature.length; j++){
document.getElementById("TEMPArea" + j).innerHTML = temperature[j].temp + "°C";
}
As I said in the first message, this works fine as long as the rows in the 'temperature' table are less than 600 or so: polling every 5 seconds is not a problem.
Above 600, the page refresh gets slow and eventually it hangs and stops refreshing.
I am not an expert, the code is pretty simple and straight forward, so I am having problems detecing the cause.
Thanks again.
I think the query is the main source of problems:
it's a slow way of getting the answer you want (you can always run it in Workbench and study the output of EXPLAIN - see the manual for more details
it implicitly supposes that all sensors with transmit at the same time, and as soon as that's not the case your output dataset won't be complete. Normally you'll want the latest data from each individual sensor
so I propose a somewhat different approach:
add an index on date and one on id to speed up queries. The lack of a PK is an issue, but let's first focus on solving the current issues...
obtain the list of available sensors - minimal solution
select distinct id from temperature;
but it would be better to store a list of available sensors in some other table - this query will also get slower as the number of records in temperature grows.
iterate over the results of that list to fetch the latest value for each of the sensors
select * from temperature
where id = (value obtained in previous step)
order by date desc
limit 1;
with this query you'll only get the most recent record associated with each sensor. Thanks to the indexes, the speed impact of a growing table should be minimal.
reassemble these results in a data structure to send to your client web page.
Also, as stated in the documentation, the mysql_* extension is deprecated and should not be used in new programs. Use mysqli_ or preferably PDO. Both these extensions will also allow you to use parameterized queries, the only real protection against SQL Injection issues. See here for a quick introduction on how to use them

Google map display markers by hours and location

I am trying to implement a recovery community meeting finder. I have a database and map setup. I am trying to add a variable to display the current day's meetings that is also based upon a "nearest" location priority. How do I get the today's date variable in my database to selectively display only that days meetings? I'm using google maps api.
Thanks,
Terry
In your database you have a datetime field that you store the date and time of the meetings I am guessing.
When you pull the information from your database to put in the map, simply use whichever apprrpriate sql language to select only those records that match the date you need.
This is actually more of a sql problem.
The closest location may not be catered for, but what you need to do is also a sql question. You need to add lat and long fields to your database and store the lat/longs for each of the meeting locations.
Then you can do a distance based sql search once you have those and the lat/long of the user.
Maps doesn't come into the selection process at all.
...
Edit.
The selecting by time is fairly simple but I thought I would share a distance based sql SELECT I used a while back. Note. It was used with MySQL but I think it should work with almost any sql db.
"SELECT name, tel, (((acos(sin((".$latitude."*pi()/180)) * sin((`latitude`*pi()/180))+cos((".$latitude."*pi()/180)) * cos((`latitude`*pi()/180)) * cos(((".$longitude."- `longitude`)*pi()/180))))*180/pi())*60*1.1515) AS distance FROM takeaway WHERE active AND (((acos(sin((".$latitude."*pi()/180)) * sin((`latitude`*pi()/180))+cos((".$latitude."*pi()/180)) * cos((`latitude`*pi()/180)) * cos(((".$longitude."- `longitude`)*pi()/180))))*180/pi())*60*1.1515) < 10 AND tel AND type = 'transport/taxi' ORDER BY distance LIMIT 5"
That gives you the basics for editing and reusing. Just remember to add the time/date check into your final string.