How to optimize my pathfinding algorithm on a GTFS network - mysql

Intro
I'm using a slightly modified GTFS database.
I have a first step algorithm that given two geographical locations provides:
the list of stops around departure and arrival
the list of routes that connects those list of stops
The second step algorithm finds the best journeys matching those stops and routes.
This is working well on direct journeys as well as journeys using one connection.
My problem arises when trying to find the best journey using 2 connections (so there are 3 trips to be searched).
Database
The GTFS format has the following tables (each table has a foreign key to the previous/next table in this list):
stops: stop information (geolocation, name, etc)
stop_times: timetable
trips: itinerary taken by a vehicle (bus, metro, etc)
routes: family of trips that roughly take the same path (e.g. standard and express trips on the same route, but different stops taken)
I have added the following tables
stop_connections: stop to stop connections (around 1 to 20)
stops_routes: lists the available routes at every stop
Here's the table row count in a city where I get slow results (Paris, France):
stops: 28k
stop_times: 12M
trips: 513k
routes: 1k
stop_connections: 365k
stops_routes: 227k
Algorithm
The first step of my algo takes two latitude/longitude points as input, and provides:
the list of stops at each location
the routes that can be used to connect those stops (with up to two connections)
The second step takes each start stop, and analyses the available journeys that use only the routes selected by the first step.
This is the part that I'm trying to optimize. Here's how I'm querying the database:
My search terms (green in the picture):
one departure stop
several arrival stops (1 to 20)
allowed routes at departure, at first connection and on last trip
service ID (not relevant here, can be ignored)
Here's what I do now:
Start from a stop => get timetable => get trips => get routes; filter on allowed routes.
Connect the arrival stops of the first trip to a list of possible stops using stop_connections
Repeat from step 1 two times so that I have 3 trips/2 connections
The problem
This is working fine on some cases, but it can be very slow in others. Usually as soon as I join the timetable or the stop connections, there is a 10x increase of the returned rows. Since I'm joining these table 8 times, there are potentially 10^8 rows to be searched by the engine.
Now I'm sure that I can get this to be more efficient.
My problem is that the number of rows increases at every join, and the arrival stop selection is made at the very end.
I mean I get all the possible journeys from a given stop at a given departure time (there can be millions of combinations), and only when my search reaches the last trip, I can filter on the ~20 allowed arrival stops.
It could be much faster if I could somehow 'know' soon enough that a route isn't worth searching.
Optimizations
Here's what I tried/thought of:
1. Inner join stops_routes when joining stop_connections
Only select stops at a connection that lead to the allowed routes at next trip.
This is sometimes efficient when there is a lot of connections and not all the connected stops are interesting (some connected stop might only be used by a route we don't want to take).
However this inner join can increase the number of rows if there are not many connected stops and a lot of allowed routes.
2. Partition the stop_times table
Create a smaller copy of the stop_times that contains only the timetable of the next two hours or so. Indeed, having the database engine search for the timetable (up to 10pm for example) when my trips starts at 8am is useless. Keeping only 8am-10am is enough and much faster.
This is very efficient, because it dramatically decreases the number of rows to be searched.
I have implemented this with success, it decreased the search time by a factor of about 10x or even 100x.
3. Identify 'good' and 'bad' routes
There is usually, in a metropolitan area, large routes that are very useful when travelling large distances. But these routes aren't the best option when travelling small distances. A human person who knows his own city's public transportation system will quickly tell that from this neighborhood to this other, the best option is to take a specific route.
However this is very difficult to do, and requires a customization on every city.
I plan to make this algo completely independant of the city, so I'm not really willing to go down that road
4. Use crowdsourcing to identify paths that work well
The first search is slow, but the information taken from it can be used to serve fast result to the next person with a similar journey.
However there are so many combinations of departure and arrival stops that the information taken from one query might not be very useful.
I don't know if this is a good idea. I haven't implemented it.
Next
I'm running out of ideas. I know this is not a programming question, but rather a request of ideas on an algorithm. I hope this falls into the SO scope.

Having it on a network makes things a little bit interesting, but fundamentally, you're doing pathfinding, which is a slow process. You're running into the exponential nature of the problem, and doing so with only 3 connections.
I have a couple suggestions that you can perhaps use while doing this with mysql, and a couple that are likely not implementable within it.
Rather than partitioning the timetable, only take the next time for any given route. If you're leaving at 8 AM, you're correct, only looking at routes from 8-10 is better than looking at them all. However, if there's a route from A-B that leaves at 8:20, 8:40, 9:00, 9:15, 9:25, 9:45... there is zero reason to take them all: just take the first arrival time for any given route, since it's strictly better than the rest.
I presume you are pruning any routes that return to an already-visited location? If not, you perhaps should be: they're not useful for you. This may be somewhat difficult to do within the SQL framework.
Depending on its coverage, you could perhaps find a path using the (much smaller) routes table, and then find the best implementation of the top working paths from the trips table.
This is likely impossible within the framework of SQL, but the thing that makes most decent pathfinding algorithms fast is that they use a heuristic to search. Your search goes down every possible route -- it would be a lot faster to first look down the route that leads in the right direction. If it doesn't pan out, less likely directions are picked. The key here is that as soon as you have a result, you return it -- you effectively pruned every route you didn't yet search by the time you returned an answer.
Pre-calculated preferred routes: you suggest this would require human intervention, but I counter that you could do it computationally. Spend the time properly searching for routes from various points to various other points, and check on the statistics of how the routes worked. I would expect that you will find things allowing you to make a "anywhere over here to anywhere over there is going to use this intermediate path" table -- your problem is reduced from "find a path from A to B" to "find a path from A to C, followed by a path from D to B". Doing this will have the potential of causing you to find sub-optimal routes (as you are making an assumption from the precalculated statistics), but it may let you find that sub-optimal route much faster. On a mesh layout it will not work at all well; on a hub layout it will work excellently.

Thanks to zebediah49, I have implemented the following algorithm:
0. Lookup tables
First, I have created an ID on the trips table, that uniquely identifies it. It is based on the list of stops taken in sequence. So this ID guarantees that two trips with the same ID will take exactly the same route.
I called this ID trip_type.
I have improved my stop_connections table so that it includes a cost. This is used to select the best connection when two 'from' stops are connected to the same 'to' stop.
1. Get trips running from the departure stop(s)
Limit those trips to only 1 per trip type (group by trip_type)
2. Get arrival stops from these trips
Select only the best trip if there are two trips reaching the same stop
3. Get connected stops from these arrival stops
Select only the best connection if there are >1 stops that are connected to the same stop
4. Repeat from step 1
I have splitted this into several subqueries and temporary tables, because I can easily group and filter the best stops/trips at each step. This ensures that the minimum searches are sent to the SQL server.
I have stored this algorithm into an SQL procedure, that will do this in a single SQL statement:
call Get2CJourneys(dt, sd, sa, r1, r2, r3)
Where:
dt: departure time
sd: stops at departure point
sa: stops at arrival point
r1, r2, r3: allowed routes for the 1st, 2nd and 3rd trips
The procedure call returns interesting results in <600ms where my previous algorithm returned the same results in several minutes.

Expanding on #zebedia49's fourth point, you can precompute the vector traveled by a route, e.g. a route going due north has a vector of 0, due west = 90, due south = 180, due east = 270. Only return routes whose vectors are within, say, +/- 15 modulo 360 degrees from the as-the-crow-flies route (or +/- 30 if the +/- 15 query doesn't return any hits).

Related

Best and most efficient way for ELO-score calculation for users in database

I'm having a hard time wrapping my head around the issue of an ELO-score-like calculation for a large amount of users on our platform.
For example. For every user in a large set of users, a complex formule, based on variable amounts of "things done", will result in a score for each user for a match-making-like principle.
For our situation, it's based on the amount of posts posted, connections accepted, messages sent, amount of sessions in a time period of one month, .. other things done etc.
I had two ideas to go about doing this:
Real-time: On every post, message, .. run the formula for that user
Once a week: Run the script to calculate everything for all users.
The concerns about these two I have:
Real-time: Would be an overkill of queries and calculations for each action a user performs. If let's say, 500 users are active, all of them are performing actions, the database would be having a hard time I think. There would them also run a script to re-calculate the score for inactive users (to lower their score)
Once a week: If we have for example 5.000 users (for our first phase), than that would result into running the calculation formula 5.000 times and could take a long time and will increase in time when more users join.
The calculation-queries for a single variable in a the entire formula of about 12 variables are mostly a simple 'COUNT FROM table', but a few are like counting "all connections of my connections" which takes a few joins.
I started with "logging" every action into a table for this purpose, just the counter values and increase/decrease them with every action and running the formula with these values (a record per week). This works but can't be applied for every variable (like the connections of connections).
Note: Our server-side is based on PHP with MySQL.
We're also running Redis, but I'm not sure if this could improve those bits and pieces.
We have the option to export/push data to other servers/databases if needed.
My main example is the app 'Tinder' which uses a sort-like algorithm for match making (maybe with less complex data variables because they're not using groups and communities that you can join)
I'm wondering if they run that real-time on every swipe, every setting change, .. or if they have like a script that runs continiously for a small batch of users each time.
Where it all comes down to. What would be the most efficient/non-database-table-locking way to do this, with keeping the idea in mind that there will be a moment that we're having 50.000 users for example?
The way I would handle this:
Implement the realtime algorithm.
Measure. Is it actually slow? Try optimizing
Still slow? Move the algorithm to a separate asynchronous process. Have the process run whenever there's an update. Really this is the same thing as 1, but it doesn't slow down PHP requests and if it gets busy, it can take more time to catch up.
Still slow? Now you might be able to optimize by batching several changes.
If you have 5000 users right now, make sure it runs well with 5000 users. You're not going to grow to 50.000 overnight, so adjust and invest in this as your problem changes. You might be surprised where your performance problems are.
Measuring is key though. If you really want to support 50K users right now, simulate and measure.
I suspect you should use the database as the "source of truth" aka "persistent storage".
Then fetch whatever is needed from the dataset when you update the ratings. Even lots of games by 5000 players should not take more than a few seconds to fetch and compute on.
Bottom line: Implement "realtime"; come back with table schema and SELECTs if you find that the table fetching is a significant fraction of the total time. Do the "math" in a programming language, not SQL.

What is the efficiency or other difference between these two queries?

I downloaded the Yelp dataset and put it into MySQL as the datasets I have been working with have been too small to really have to consider efficiency. I am trying to unlearn or become aware of bad SQL habits which will cause problems with larger datasets.
Here are two ways of pulling exactly the same information out of the database:
USE yelp_db;
SELECT name, hours FROM business
LEFT JOIN hours
ON business.id = hours.business_id;
-- time taken 0,0025sec, 776071 rows returned
SELECT name, hours FROM
(SELECT name, id from business) AS b
LEFT JOIN
(SELECT hours, business_id from hours) AS h
ON b.id = h.business_id;
-- time taken 0,0015sec, 776071 rows returned
Here is a sample of the output:
John's Chinese BBQ Restaurant NULL
Primal Brewery Monday|16:00-22:00
Primal Brewery Tuesday|16:00-22:00
Primal Brewery Friday|12:00-23:00
The first method takes 3 lines but appears to be slightly slower than the second method which takes 5 lines.
Is either of these methods preferred in terms of efficiency or elegance and if so why?
The first method is preferred for both performance and elegance -- your results not withstanding.
I'm a little suspicious about the timings. I would expect more than a millisecond or two to return close to a million rows.
In any case, most versions of MySQL (the most recent might be exceptions) materialize subqueries. This adds an additional layer of writes and reads to the query. It can also prevent the use of indexes.
As for elegance, unnecessary subqueries do nothing for "elegance". This might be a matter of opinion, but I'm guessing it is pretty wide-spread.
Just to expand on #GordanLinoff 's excellent answer why you would see this difference.
If you ran them in the order shown simple caching of the data from the first one could explain the timing. This caching can happen many places all the way down to the disc controllers.
The only way to test with useful results is to run many iterations and average the results after clearing all the caches.

One trip, multiple stop_times

Suppose i have one bus, which departures at 08:00 and 10:00.
Since the bus uses the same stops, stop_sequence, should i split the trip to a specific stop_time, or can i use the same trip_id for multiple stop_times.
Example:
TripA - Stop_timesA (Departures at 08:00, TripA), Stop_timesB (Departures at 10:00, TripA)
Or
TripA - Stop_timesA (Departures at 08:00, TripA)
TripB - Stop_timesB (Departures at 10:00, TripB)
Thanks.
Generally, you'd use separate trips.
For example, if the two stop-times both belonged to the same trip AND had the same stop_sequence values, it'd typically be considered a validation error, since stop_sequence should be unique for each stop-time in the same trip.
Just to be clear, there is nothing wrong with having the same stop appear in the same trip more than once, especially for loop routes. However, if aren't modeling a loop route, you should use separate trips. Otherwise, you are saying that a rider can get on at the first stop-time and ride through all the intermediate stop-times and arrive at the same stop again two hours later. Maybe that's your case, but I'm guessing not.
According the specification for GTFS, "A trip is a sequence of two or more stops that occurs at specific time". This would indicate that each of the departures would be a separate trip and have a separate trip_id in the data set.
However, the question would indicate that each departure (or trip) should be on the same route.
It took me a while to really understand how GTFS really works. The spec is a good place to start and read carefully.
My answer is quite late, but... NO, you shouldn't.
The mechanism to "repeat" the same trip at different times on the same day is by using the frequencies table.
In your example, you would define a single TripA in both trips and stop_times tables.
On the frequencies table, you declare the start_time as "08:00:00", end_time as "11:59:59" and headway_secs as "7200" (two hours). All this means that the trips will run every 2 hours starting at 08:00:00, but no trip will start after 11:59:59 - so there will be only two trips starting at 08:00:00 and 10:00:00.
If you duplicate your trip by creating tripA and tripB, you have at least two problems:
all records will be duplicated on the stop_times table, making the GTFS file bigger/heavier if you have multiple stops and/or multiple trips on the same day
maintenance will be much more complicated - if one single stop is changed, you have to change it in all trip "clones"

MySQL architecture for n * (n - 1) / 2 algorithm

I'm currently developing a website where users can search for other users based on attributes (age, height, town, education, etc.). I now want to implement some kind of rating between user profiles. The rating is calculated via its own algorithm based on similiarity between the 2 given profiles. User A has a rating "match rating" of 85 with User B and 79 with User C for example. B and C have a rating of 94 and so on....
The user should be able to search for certain attributes and filter the results by rating.
Since the rating differs from profile to profile and also depends on the user doing the search, I can't simply add a field to my users table and use ORDER BY. So far I came up with 2 solutions:
My first solution was to have a nightly batch job, that calculates the rating for every possible user combination and stores it in a separate table (user1, user2, rating). I then can join this table with the user table and order the result by rating. After doing some math I figured that this solution doesn't scale that well.
Based on the formula n * (n - 1) / 2 there are 45 possible combination for 10 users. For 1.000 users I suddenly have to insert 499.500 rating combinations into my rating table.
The second solution was to leave MySQL be and just calculate the rating on the fly within my application. This also doesn't scale well. Let's say the search should only return 100 results to the UI (with the highest rated on top). If I have 10.000 users and I want to do a search for every user living in New York sorted by rating, I have to load EVERY user that is living in NY into my app (let's say 3.000), apply the algorithm and then return only the top 100 to the user. This way I have loaded 2.900 useless user objects from the DB and wasted CPU on the algorithm without ever doing anything with it.
Any ideas how I can design this in my MySQL db or web app so that a user can have an individual rating with every other user in a way that the system scales beyond a couple thousand users?
If you have to match every user against every other user, the algorithm is O(N^2), whatever you do.
If you can exploit some sort of 1-dimensional "metric", then you can try and associate each user with a single synthetic value. But that's awkward and could be impossible.
But what you can do is to note which users require a change in their profiles (whenever any of the parameters on which the matching is based, changes). At that point you can batch-recalculate the table for those users only, thus working in O(N): if you have 10000 users and only 10 require recalculation, you have to examine 100,000 records instead of 100,000,000.
Other strategies would be to only run the main algorithm for records which have the greater chance of being compared: in your example, "same city". Or when updating records (but this would require to store (user_1, user_2, ranking, last_calculated), only recalculate those records with high ranking, very old, or never calculated. Lowest ranked matches aren't likely to change so much that they float to the top in a short time.
UPDATE
The problem is also operating with O(N^2) storage space.
How to reduce this space? I think I can see two approaches. One is to not put some information in the match table at all. The "match" function makes the more sense the more it is rigid and steep; having ten thousand "good matches" would mean that matching means very little. So we would still need lotsa recalculations when User1 changes some key data, in case it brings some of User1's "no-no" matches back into the "maybe" zone. But we would keep a smaller clique of active matches for each user.
Storage would still grow quadratically, but less steeply.
Another strategy would be to recalculate the match, and then we would need to develop some method for quickly selecting which users are likely to have a good match (thus limiting the number of rows retrieved by the JOIN), and some method to quickly calculate a match; which could entail somehow rewriting the match between User1 and User2 to a very simple function of a subset of DataUser1, DataUser2 (maybe using ancillary columns).
The challenge would be to leverage MySQL capabilities and offload some calculations the the MySQL engine.
To this purpose you might perhaps "map" some data, at input time (therefore in O(k)), to spatial information, or to strings and employ Levenshtein distance.
The storage for a single user would grow, but it would grow linearly, not quadratically, and MySQL SPATIAL indexes are very efficient.
If the search should only return the top 100 best matches, then why not just store those? It sounds like you would never want to search the bottom end of the results anyway, so just don't calculate them.
That way, your storage space is only o(n), rather than o(n^2), and updates should be, as well. If someone really wants to see matches past the first 100 (and you want to let them) then you have the option of running the query in real time at that point.
I agree with everything #Iserni says.
If you have a web app and users need to "login", then you might have an opportunity to create that user's rankings at that time and stash them into a temporary table (or rows in an existing table).
This will work in a reasonable amount of time (a few seconds) if all the data needed for the calculation fits into memory. The database engine should then be doing a full table scan and creating all the ratings.
This should work reasonably well for one user logging in. Passably for two . . . but it is not going to scale very well if you have, say, a dozen users logging in within one second.
Fundamentally, though, your rating does not scale well. You have to do a comparison of all users to all users to get the results. Whether this is batch (at night) or real-time (when someone has a query) doesn't change the nature of the problem. It is going to use a lot of computing resources, and multiple users making requests at the same time will be a bottleneck.

How do I make a simple bus route search Engine?

[Not:e user is asking this again at Development of railway enquiry system, how to model Trains, Stations and Stops? ]
My Problem Description:
Suppose I have a BUS-123 in ROUTE-1 it will travel through A, B, C, D, E, F, G, H and BUS-321 in ROUTE-2 through D, E, F, X, Y, Z .
if someone enters B as a source point and F as a destination point then ROUTE-1 with BUS-123 should display in the result. But if someone enters H as a source and A as destination result should not display, because returning may not always same with one that is traveled.
But if a person enters A as a source and Z as destination then BUS-123 with ROUTE-1 and BUS-321 with ROUTE-2 should display.
My Problem is:
How do I store that route information in Database? if i store in RDBMS like the following
BUS_NUMBER ROUTE_NUMBER VIA_ROUTES
BUS-123 ROUTE-1 A, B, C, D, E, F, G, H
BUS-321 ROUTE-2 D, E, F, X, Y, Z
Then how my search will work. I mean how to search it in a string.
And if I store all the VIA_ROUTES in different different columns then how it will be..? Please suggest me with your own technique. It is not urgent but I am planning to make a basic bus route search, so your comment with help is appreciated.
I'd model it as a cyclic graph. Each bus stop is represented by a vertice. Each direct connection between two stops is represented by an edge labelled with the route number; consequently, each route is a sequence of connected edges. Make the edges directed, too. Not all routes travelling from stop A to stop B will necessarily also travel from stop B to stop A in the other direction.
Probably want to populate each edge with the estimated travel time, a measure (or measures) of variance for that leg -- at 2am on a Sunday night, the variance might be low, but at 5pm on a Friday evening, it might be very high, and list of departure times as well.
Then its a matter of graph traversal and finding the "least cost" route, however you choose to define "least cost" -- Factors you might want to consider would include:
Total travel time
Total time spent waiting for the next leg to depart.
Wait time at any individual stop.
Distance?
One should note that too much wait time is bad (ever spend 40 minutes waiting for a bus in January when it's -10 F?). Too little is bad, too, as it increases the probability of missing a connection, given that buses tend to have a fairly large variability to their schedules since they are highly responsive to fluctuations in local traffic conditions.
That's how I would do it.
I don't believe I'd try to solve it directly in SQL, though.
The model is a good fit for SQL, though. You need the following entities, and then some, since you'll need to represent schedules, etc.:
Stop. A Bus stop. The vertices of the graph.
Route. A bus route.
Segment. The direct link between two stops. The edges of the graph.
RouteSegment. An associative entity representing ordered sequence of segments that composes the route.
I think the bus_numbers aren't important because you can look them up later. Maybe what you need is to create a 2d matrix with the bus_stops in a big matrix having them all and then use a graph traversing algorithm like dijkstra to find the shortest path from A to B. When you got that you can easily lookup the bus_numbers and show them to the client. Thus I think your database is already very good.
I'd have a route table and a route_part table. The latter would contain a reference to the route, plus an ordinal number for sorting, and a reference to a stop table. Thus, you can store any route.
In terms of searching, if you wish to search for a route between two stops, you could look up the two stops in the route_part table and see if they appear on the same route in any cases (bearing in mind that a route may exist in one direction and not the other).