Suppose i have one bus, which departures at 08:00 and 10:00.
Since the bus uses the same stops, stop_sequence, should i split the trip to a specific stop_time, or can i use the same trip_id for multiple stop_times.
Example:
TripA - Stop_timesA (Departures at 08:00, TripA), Stop_timesB (Departures at 10:00, TripA)
Or
TripA - Stop_timesA (Departures at 08:00, TripA)
TripB - Stop_timesB (Departures at 10:00, TripB)
Thanks.
Generally, you'd use separate trips.
For example, if the two stop-times both belonged to the same trip AND had the same stop_sequence values, it'd typically be considered a validation error, since stop_sequence should be unique for each stop-time in the same trip.
Just to be clear, there is nothing wrong with having the same stop appear in the same trip more than once, especially for loop routes. However, if aren't modeling a loop route, you should use separate trips. Otherwise, you are saying that a rider can get on at the first stop-time and ride through all the intermediate stop-times and arrive at the same stop again two hours later. Maybe that's your case, but I'm guessing not.
According the specification for GTFS, "A trip is a sequence of two or more stops that occurs at specific time". This would indicate that each of the departures would be a separate trip and have a separate trip_id in the data set.
However, the question would indicate that each departure (or trip) should be on the same route.
It took me a while to really understand how GTFS really works. The spec is a good place to start and read carefully.
My answer is quite late, but... NO, you shouldn't.
The mechanism to "repeat" the same trip at different times on the same day is by using the frequencies table.
In your example, you would define a single TripA in both trips and stop_times tables.
On the frequencies table, you declare the start_time as "08:00:00", end_time as "11:59:59" and headway_secs as "7200" (two hours). All this means that the trips will run every 2 hours starting at 08:00:00, but no trip will start after 11:59:59 - so there will be only two trips starting at 08:00:00 and 10:00:00.
If you duplicate your trip by creating tripA and tripB, you have at least two problems:
all records will be duplicated on the stop_times table, making the GTFS file bigger/heavier if you have multiple stops and/or multiple trips on the same day
maintenance will be much more complicated - if one single stop is changed, you have to change it in all trip "clones"
Related
I have thousands of trips and each trip has a starting point and destination. My goal is to actually store miles traveled in each state in each trip. (If I travel from Denver to SLC, to I will end up with some miles in CO and some miles in UT.
I already have a table that has start and finish points and I need to think about good architecture to storing Miles in each trip per state.
Idea #1:
- Create 1 table with trip_id and 48 columns -> 1 for each lower continental state.
Idea #2:
- Create a table with trip_id, state_id (+ additional table with locations), total_miles
Solution 1 seems to be more efficient when it comes to storing data (less rows), and I am not sure how one table would handle 2 millions trips + multiple entries per state in the new table.
Any suggestions?
Idea#2 sounds like a much better idea. Let me explain some reasons.
First, your definition of "state" might change and adding/removing columns is quite cumbersome. For instance, "DC" might be a state. Or you might expand to Canada. Or even bring in Alaska. Adding new rows to a table is trivial.
Second, for simple tasks such as adding up the total miles for the trip, an simple aggregation query is much simpler than adding up 48+ columns.
Third, what about trips that might re-enter a state more than one time. Imagine someone driving from Kalamazoo (MI) to Houghton (MI) on the upper peninsula. They could very reasonably take the route through Chicago, passing through MI --> IN --> IL --> WI and then back to MI.
Intro
I'm using a slightly modified GTFS database.
I have a first step algorithm that given two geographical locations provides:
the list of stops around departure and arrival
the list of routes that connects those list of stops
The second step algorithm finds the best journeys matching those stops and routes.
This is working well on direct journeys as well as journeys using one connection.
My problem arises when trying to find the best journey using 2 connections (so there are 3 trips to be searched).
Database
The GTFS format has the following tables (each table has a foreign key to the previous/next table in this list):
stops: stop information (geolocation, name, etc)
stop_times: timetable
trips: itinerary taken by a vehicle (bus, metro, etc)
routes: family of trips that roughly take the same path (e.g. standard and express trips on the same route, but different stops taken)
I have added the following tables
stop_connections: stop to stop connections (around 1 to 20)
stops_routes: lists the available routes at every stop
Here's the table row count in a city where I get slow results (Paris, France):
stops: 28k
stop_times: 12M
trips: 513k
routes: 1k
stop_connections: 365k
stops_routes: 227k
Algorithm
The first step of my algo takes two latitude/longitude points as input, and provides:
the list of stops at each location
the routes that can be used to connect those stops (with up to two connections)
The second step takes each start stop, and analyses the available journeys that use only the routes selected by the first step.
This is the part that I'm trying to optimize. Here's how I'm querying the database:
My search terms (green in the picture):
one departure stop
several arrival stops (1 to 20)
allowed routes at departure, at first connection and on last trip
service ID (not relevant here, can be ignored)
Here's what I do now:
Start from a stop => get timetable => get trips => get routes; filter on allowed routes.
Connect the arrival stops of the first trip to a list of possible stops using stop_connections
Repeat from step 1 two times so that I have 3 trips/2 connections
The problem
This is working fine on some cases, but it can be very slow in others. Usually as soon as I join the timetable or the stop connections, there is a 10x increase of the returned rows. Since I'm joining these table 8 times, there are potentially 10^8 rows to be searched by the engine.
Now I'm sure that I can get this to be more efficient.
My problem is that the number of rows increases at every join, and the arrival stop selection is made at the very end.
I mean I get all the possible journeys from a given stop at a given departure time (there can be millions of combinations), and only when my search reaches the last trip, I can filter on the ~20 allowed arrival stops.
It could be much faster if I could somehow 'know' soon enough that a route isn't worth searching.
Optimizations
Here's what I tried/thought of:
1. Inner join stops_routes when joining stop_connections
Only select stops at a connection that lead to the allowed routes at next trip.
This is sometimes efficient when there is a lot of connections and not all the connected stops are interesting (some connected stop might only be used by a route we don't want to take).
However this inner join can increase the number of rows if there are not many connected stops and a lot of allowed routes.
2. Partition the stop_times table
Create a smaller copy of the stop_times that contains only the timetable of the next two hours or so. Indeed, having the database engine search for the timetable (up to 10pm for example) when my trips starts at 8am is useless. Keeping only 8am-10am is enough and much faster.
This is very efficient, because it dramatically decreases the number of rows to be searched.
I have implemented this with success, it decreased the search time by a factor of about 10x or even 100x.
3. Identify 'good' and 'bad' routes
There is usually, in a metropolitan area, large routes that are very useful when travelling large distances. But these routes aren't the best option when travelling small distances. A human person who knows his own city's public transportation system will quickly tell that from this neighborhood to this other, the best option is to take a specific route.
However this is very difficult to do, and requires a customization on every city.
I plan to make this algo completely independant of the city, so I'm not really willing to go down that road
4. Use crowdsourcing to identify paths that work well
The first search is slow, but the information taken from it can be used to serve fast result to the next person with a similar journey.
However there are so many combinations of departure and arrival stops that the information taken from one query might not be very useful.
I don't know if this is a good idea. I haven't implemented it.
Next
I'm running out of ideas. I know this is not a programming question, but rather a request of ideas on an algorithm. I hope this falls into the SO scope.
Having it on a network makes things a little bit interesting, but fundamentally, you're doing pathfinding, which is a slow process. You're running into the exponential nature of the problem, and doing so with only 3 connections.
I have a couple suggestions that you can perhaps use while doing this with mysql, and a couple that are likely not implementable within it.
Rather than partitioning the timetable, only take the next time for any given route. If you're leaving at 8 AM, you're correct, only looking at routes from 8-10 is better than looking at them all. However, if there's a route from A-B that leaves at 8:20, 8:40, 9:00, 9:15, 9:25, 9:45... there is zero reason to take them all: just take the first arrival time for any given route, since it's strictly better than the rest.
I presume you are pruning any routes that return to an already-visited location? If not, you perhaps should be: they're not useful for you. This may be somewhat difficult to do within the SQL framework.
Depending on its coverage, you could perhaps find a path using the (much smaller) routes table, and then find the best implementation of the top working paths from the trips table.
This is likely impossible within the framework of SQL, but the thing that makes most decent pathfinding algorithms fast is that they use a heuristic to search. Your search goes down every possible route -- it would be a lot faster to first look down the route that leads in the right direction. If it doesn't pan out, less likely directions are picked. The key here is that as soon as you have a result, you return it -- you effectively pruned every route you didn't yet search by the time you returned an answer.
Pre-calculated preferred routes: you suggest this would require human intervention, but I counter that you could do it computationally. Spend the time properly searching for routes from various points to various other points, and check on the statistics of how the routes worked. I would expect that you will find things allowing you to make a "anywhere over here to anywhere over there is going to use this intermediate path" table -- your problem is reduced from "find a path from A to B" to "find a path from A to C, followed by a path from D to B". Doing this will have the potential of causing you to find sub-optimal routes (as you are making an assumption from the precalculated statistics), but it may let you find that sub-optimal route much faster. On a mesh layout it will not work at all well; on a hub layout it will work excellently.
Thanks to zebediah49, I have implemented the following algorithm:
0. Lookup tables
First, I have created an ID on the trips table, that uniquely identifies it. It is based on the list of stops taken in sequence. So this ID guarantees that two trips with the same ID will take exactly the same route.
I called this ID trip_type.
I have improved my stop_connections table so that it includes a cost. This is used to select the best connection when two 'from' stops are connected to the same 'to' stop.
1. Get trips running from the departure stop(s)
Limit those trips to only 1 per trip type (group by trip_type)
2. Get arrival stops from these trips
Select only the best trip if there are two trips reaching the same stop
3. Get connected stops from these arrival stops
Select only the best connection if there are >1 stops that are connected to the same stop
4. Repeat from step 1
I have splitted this into several subqueries and temporary tables, because I can easily group and filter the best stops/trips at each step. This ensures that the minimum searches are sent to the SQL server.
I have stored this algorithm into an SQL procedure, that will do this in a single SQL statement:
call Get2CJourneys(dt, sd, sa, r1, r2, r3)
Where:
dt: departure time
sd: stops at departure point
sa: stops at arrival point
r1, r2, r3: allowed routes for the 1st, 2nd and 3rd trips
The procedure call returns interesting results in <600ms where my previous algorithm returned the same results in several minutes.
Expanding on #zebedia49's fourth point, you can precompute the vector traveled by a route, e.g. a route going due north has a vector of 0, due west = 90, due south = 180, due east = 270. Only return routes whose vectors are within, say, +/- 15 modulo 360 degrees from the as-the-crow-flies route (or +/- 30 if the +/- 15 query doesn't return any hits).
I have a table for news articles, containing amongst others the author, the time posted and the word count for each article. The table is rather large, containing more than one million entries and growing with an amount of 10.000 entries each day.
Based on this data, a statistical analysis is done, to determine the total number of words a specific author has published in a specific time-window (i.e. one for each hour of each day, one for each day, one for each month) combined with an average for a time-span. Here are two examples:
Author A published 3298 words on 2011-11-04 and 943.2 words on average for each day two month prior (from 2011-09-04 to 2011-11-03)
Author B published 435 words on 2012-01-21 between 1pm and 2pm and an average of 163.94 words each day between 1pm and 2pm in the 30 days before
Current practice is to start a script at the end of each defined time-window via cron-job, which calculates the count and the averages and stores it in a separate table for each time-window (i.e. one for each hourly window, one for each daily, one for each monthly etc...).
The calculation of sums and averages can easily be done in SQL, so I think Views might be a more elegant solution to this, but I don't know about the implications on performance.
Are Views an appropriate solution to the problem described above?
I think you can use materialize views for it. It's not really implemented in MySQL, but you can implement it with tables. Look at
views will not be equivalent to your denormalization.
if you are moving aggregate numbers somewhere else, then that has a certain cost, which you are paying - in order to keep the data correct, and a certain benefit, which is much less data to look through when querying.
a view will save you from having to think too hard about the query each time you run it, but it will still need to look through the larger amount of data in the original tables.
while i'm not a fan of denormalization, since you already did it, i think the view will not help.
I want to build a MySQL database for storing the ranking of a game every 1h.
Since this database will become quite large in a short time, I figured it's important to have a proper design. Therefor some advice would be gratefully appreciated.
In order to keep it as small as possible, I decided to log only the first 1500 positions of the ranking. Every ranking of a player holds the following values:
ranking position, playername, location, coordinates, alliance, race, level1, level2, points1, points2, points3, points4, points5, points6, date/time
My approach was to simply grab all values of each top 1500 player every hour by a php script and insert them into the MySQL as one row. So every day the MySQL will grow 36,000 rows. I will have a second script that deletes every row that is older than 28 days, otherwise the database would get insanely huge. Both scripts will run as a cronjob.
The following queries will be performed on this data:
The most important one is simply the query for a certain name. It should return all stats for the player for every hour as an array.
The second is a query in which all players have to be returned that didn't gain points1 during a certain time period from the latest entry. This should return a list of players that didn't gain points (for the last 24h for example).
The third is a query in which all players should be listed that lost a certain amount or more points2 in a certain time period from the latest entry.
The queries shouldn't take a lifetime, so I thought I should probably index playernames, points1 and points2.
Is my approach to this acceptable or will I run into a performance/handling disaster? Is there maybe a better way of doing this?
Here is where you risk a performance problem:
Your indexes will speed up your reads, but will considerably slow down your writes. Especially since your DB will have over 1 million rows in that one table at any given time. Since your writes are happening via cron, you should be okay as long as you insert your 1500 rows in batches rather than one round trip to the DB for every row. I'd also look into query compiling so that you save that overhead as well.
Ranhiru Cooray is correct, you should only store data like the player name once in the DB. Create a players table and use the primary key to reference the player in your ranking table. The same will go for location, alliance and race. I'm guessing that those are more or less enumerated values that you can store in another table to normalize your design and be returned in your results with appropriates JOINs. Normalizing your data will reduce the amount of redundant information in your database which will decrease it's size and increase it's performance.
Your design may also be flawed in your ranking position. Can that not be calculated by the DB when you select your rows? If not, can it be done by PHP? It's the same as with invoice tables, you never store the invoice total because it is redundant. The items/pricing/etc can be used to calculate the order totals.
With all the adding/deleting, I'd be sure to run OPTIMIZE frequently and keep good backups. MySQL tables---if using MyISAM---can become corrupted easily in high writing/deleting scenarios. InnoDB tends to fair a little better in those situations.
Those are some things to think about. Hope it helps.
I have created a train schedule database in MYSQL. There are several thousand routes for each day. But with a few exceptions most of the routes are similar for every working day, but differ on weekends.
At this time I basically update my SQL tables at midnight each day, to get the departures for the next 24 hours. This is however very inconvenient. So I need a way to store dates in my tables so I don't have to do this every day.
I tried to create a separate table where I stored dates for each routenumber (routenumbers are resetted each day), but this made my query so slow that it was impossible to use. Does this mean I would have to store my departure and arrival times as datetimes? In that case the main table containing routes would have several million entries.
Or is there another way?
My routetable looks like this:
StnCode (referenced in seperate Station table)
DepTime
ArrTime
Routenumber
legNumber
How were you storing the dates? A single date/time field? That'd certainly be the most compact representation, but also the most difficult to index and scan, especially if you're doing queries of the following type:
SELECT ...
WHERE MONTH(DepTime) = 4 AND DAY(DepTime) = 19;
Such a construct would require a full table scan to tear apart each date field and extract the month/day. For such a case, it'd be better to denomalize a bit and split the datetime into seperate year/month/day/hour/minute fields and place indeces onto them. Bit more of a hassle to maintain, but would also speed up querying by specific time parts immensely.
Instead of storing Schedules in terms of dates, you can store them against day (Sun, Mon, Tue, etc). This will eliminate storing the dates for routes. You can treat the routes as predetermined, and thus they are fixed. As the number of trains are around 8000(passenger trains) and days are fixed (7), routes are (50-1000), each table like, 1, 1A, AS PUBLISHED IN RAILWAY BOOKS,
This will avoid storing huge combinations of train schedules into the db since every date is translated into one of the weekdays and we are not missing any data.
You can create a table for storing days which will have at most 7 days.
I would suggest to model the database in such a way, that each station is a touch point, and not as station id....
and you can introduce the hub concept in the design, to identify 3-4 stations, which are of the same city....
each station is a touch point, where it is supported by facilities like,boarding point,HALT POINT etc...
cause, not all the stations are boarding points for all the trains..
facilities are the ones which are available at different stations...
all facilities are not available for all the trains...,
ex: Kazipet is a station, which is also a junction...but for FEW TRAINS,ON few routes,
they pass thru the station, and it also halts at the station, but, it will not allow new passengers to board at the station(s).
But, it will allow the same on reverse routes...