How do I make a simple bus route search Engine? - mysql

[Not:e user is asking this again at Development of railway enquiry system, how to model Trains, Stations and Stops? ]
My Problem Description:
Suppose I have a BUS-123 in ROUTE-1 it will travel through A, B, C, D, E, F, G, H and BUS-321 in ROUTE-2 through D, E, F, X, Y, Z .
if someone enters B as a source point and F as a destination point then ROUTE-1 with BUS-123 should display in the result. But if someone enters H as a source and A as destination result should not display, because returning may not always same with one that is traveled.
But if a person enters A as a source and Z as destination then BUS-123 with ROUTE-1 and BUS-321 with ROUTE-2 should display.
My Problem is:
How do I store that route information in Database? if i store in RDBMS like the following
BUS_NUMBER ROUTE_NUMBER VIA_ROUTES
BUS-123 ROUTE-1 A, B, C, D, E, F, G, H
BUS-321 ROUTE-2 D, E, F, X, Y, Z
Then how my search will work. I mean how to search it in a string.
And if I store all the VIA_ROUTES in different different columns then how it will be..? Please suggest me with your own technique. It is not urgent but I am planning to make a basic bus route search, so your comment with help is appreciated.

I'd model it as a cyclic graph. Each bus stop is represented by a vertice. Each direct connection between two stops is represented by an edge labelled with the route number; consequently, each route is a sequence of connected edges. Make the edges directed, too. Not all routes travelling from stop A to stop B will necessarily also travel from stop B to stop A in the other direction.
Probably want to populate each edge with the estimated travel time, a measure (or measures) of variance for that leg -- at 2am on a Sunday night, the variance might be low, but at 5pm on a Friday evening, it might be very high, and list of departure times as well.
Then its a matter of graph traversal and finding the "least cost" route, however you choose to define "least cost" -- Factors you might want to consider would include:
Total travel time
Total time spent waiting for the next leg to depart.
Wait time at any individual stop.
Distance?
One should note that too much wait time is bad (ever spend 40 minutes waiting for a bus in January when it's -10 F?). Too little is bad, too, as it increases the probability of missing a connection, given that buses tend to have a fairly large variability to their schedules since they are highly responsive to fluctuations in local traffic conditions.
That's how I would do it.
I don't believe I'd try to solve it directly in SQL, though.
The model is a good fit for SQL, though. You need the following entities, and then some, since you'll need to represent schedules, etc.:
Stop. A Bus stop. The vertices of the graph.
Route. A bus route.
Segment. The direct link between two stops. The edges of the graph.
RouteSegment. An associative entity representing ordered sequence of segments that composes the route.

I think the bus_numbers aren't important because you can look them up later. Maybe what you need is to create a 2d matrix with the bus_stops in a big matrix having them all and then use a graph traversing algorithm like dijkstra to find the shortest path from A to B. When you got that you can easily lookup the bus_numbers and show them to the client. Thus I think your database is already very good.

I'd have a route table and a route_part table. The latter would contain a reference to the route, plus an ordinal number for sorting, and a reference to a stop table. Thus, you can store any route.
In terms of searching, if you wish to search for a route between two stops, you could look up the two stops in the route_part table and see if they appear on the same route in any cases (bearing in mind that a route may exist in one direction and not the other).

Related

Can I find price floors and ceilings with cuda

Background
I'm trying to convert an algorithm from sequential to parallel, but I am stuck.
Point and Figure Charts
I am creating point and figure charts.
Decreasing
While the stock is going down, add an O every time it breaks through the floor.
Increasing
While the stock is going up, add an X every time it breaks through the ceiling.
Reversal
If the stock reverses direction, but the change is less than a reversal threshold (3 units) do nothing. If the change is greater than the reversal threshold, start a new column (X or O)
Sequential vs Parallel
Sequentially, this is pretty straight forward. I keep a variable for the floor and ceiling. If the current price breaks through the floor or ceiling, or changes more than the reversal threshold, I can take the appropriate action.
My question is, is there a way to find these reversal point in parallel? I'm fairly new to thinking in parallel, so I'm sorry if this is trivial. I am trying to do this in CUDA, but I have been stuck for weeks. I have tried using the finite difference algorithms from NVidia. These produce local max / min but not the reversal points. Small fluctuations produce numerous relative max / min, but most of them are trivial because the change is not greater than the reversal size.
My question is, is there a way to find these reversal point in parallel?
one possible approach:
use thrust::unique to remove periods where the price is numerically constant
use thrust::adjacent_difference to produce 1st difference data
use thrust::adjacent_difference on 1st difference data to get the 2nd difference data, i.e the points where there is a change in the sign of the slope.
use these points of change in sign of slope to identify separate regions of data - build a key vector from these (e.g. with a prefix sum). This key vector segments the price data into "runs" where the price change is in a particular direction.
use thrust::exclusive_scan_by_key on the 1st difference data, to produce the net change of the run
Wherever the net change of the run exceeds a threshold, flag as a "reversal"
Your description of what constitutes a reversal may also be slightly unclear. The above method would not flag a reversal on certain data patterns that you might classify as a reversal. I suspect you are looking beyond a single run as I have defined it here. If that is the case, there may be a method to address that as well - with more steps.

How to generalise over multiple dependent actions in Reinforcement Learning

I am trying to build an RL agent to price paid for airline seats (not the ticket). The general set up is:
After choosing their flights (for n people on a booking), a customer will view a web page with the available seat types and their prices visible.
They select between zero and n seats from a seat map with a variety of different prices for different seats, to be added to their booking.
After perhaps some other steps, they pay for the booking and the agent is rewarded with the seat revenue.
I have not decided on a general architecture yet. I want to take various booking and flight information into account, so I know I will be using function approximation (most likely a neural net) to generalise over the state space.
However, I am less clear on how to set up my action space. I imagine an action would amount to a vector with a price for each different seat type. If I have, for example, 8 different seat types, and 10 different price points for each, this gives me a total of 10^8 different actions, many of which will be very similar. Additionally, each sub-action (pricing one seat type) is somewhat dependent on the others, in the sense that the price of one seat type will likely affect the demand (and hence reward contribution) for another. Hence, I doubt the problem can be decomposed into a set of sub-problems.
I'm interested if there has been any research into dealing with a problem like this. Clearly any agent I build needs some way to generalise across actions to some degree, since collecting real data on millions of actions is not possible, even just for one state.
As I see it, this comes down to two questions:
Is it possible to get an agent to understand actions in relative terms? Say for example, one set of potential prices is [10, 12, 20]. Can I get my agent to realise that there is a natural ordering there, and that the first two pricing actions are more similar to each other than to the third possible action?
Further to this, is it possible to generalise from this set of actions - could an agent be set up to understand that the set of prices [10, 13, 20] is very similar to the first set?
I haven't been able to find any literature on this, especially relating to the second question - any help would be much appreciated!
Correct me if I'm wrong, but I am going to assume this is what you are asking and will answer accordingly.
I am building an RL and it needs to be smart enough to understand that if I were to buy one airplane ticket, it will subsequently affect the price of other airplane tickets because there is now less supply.
Also, the RL agent must realize that actions very close to each other are relatively similar actions, such as [10, 12, 20] ≈ [10, 13, 20]
1) In order to provide memory to your RL agent, you can do this in two ways. The easy way is to feed the states as a vector of past purchased tickets, as well as the current ticket.
Example: Let's say we build the RL to remember at least the past 3 transactions. At the very beginning, our state vector will be [0, 0, 3], meaning that there was no purchases of tickets previously (the zeros), and currently, we are purchasing ticket #3. Then, the next time step's state vector can be [0, 3, 6], telling the RL agent that previously, ticket #3 has been picked, and now we're buying ticket #6. The neural network will learn that the state vector [0, 0, 6] should map to a different outcome compared to [0, 3, 6], because in the first case, ticket #6 was the first ticket purchased and there was lots of supply. But in the 2nd case, ticket 3 was already sold, so now all the remaining tickets went up in price.
The proper and more complex way would be to use a recurrent neural network as your function approximator for your RL agent. The recurrent neural network architecture allows for certain "important" states to be remembered by the neural network. In your case, the amount of tickets purchased previously is important, so the neural network will remember the previously purchased tickets, and calculate the output accordingly.
2) Any function approximation reinforcement learning algorithm will automatically generalize sets of actions close to each other. The only RL architectures that would not do this would be tabular based approaches.
The reason is the following:
We can think of these function approximators simply as a line. Neural networks simply build a highly nonlinear continuous line (neural networks are trained using backpropagation and gradient descent, so they must be continuous), and a set of states will map to a unique set of outputs. Because it is a line, sets of states that are very similar SHOULD map to outputs that are also very close. In the most basic case, imagine y = 2x. If our input x = 1, our y = 2. And if our input x is 1.1, which is very close to 1, our output y = 2.2, which is very close to 2 because they are on a continuous line.
For tabular approach, there is simply a matrix. On the y axis, you have the states, and on the x axis, you have the actions. In this approach, the states and actions are discrete. Depending on the discretization, the difference can be massive, and if the system is poorly discretized, the actions very close to each other MAY not be generalized.
I hope this helps, please let me know if anything is unclear.

Will alpha-beta pruning remove randomness in my solution with minimax?

Existing implementation:
In my implementation of Tic-Tac-Toe with minimax, I look for all boxes where I can get best result and chose 1 of them randomly, so that the same solution isn't displayed each time.
For ex. if the returned list is [1, 0 , 1, -1], at some point, I will randomly chose between the two highest values.
Question about Alpha-Beta Pruning:
Based on what I understood, when the algorithm finds that it is winning from one path, it would no longer need to look for other paths that might/ might not lead to a winning case.
So will this, like I feel, cause the earliest possible box that leads to the best solution to be displayed as the result and seem the same each time? For example at the time of first move, all moves lead to a draw. So will the 1st box be selected each time?
How can I bring randomness to the solution like with the minimax solution? One way that I thought about now could be to randomly pass the indices to the alpha-beta algorithm. So the result will be the first best solution in that randomly sorted list of positions.
Thanks in advance. If there is some literature on this, I'd be glad to read it.
If someone could post some good reference for aplha-beta pruning, That'll be excellent as I had a hard time understanding how to apply it.
To randomly pick among multiple best solutions (all equal) in alpha-beta pruning, you can modify your evaluation function to add a very small random number whenever you evaluate a game state. You should just make sure that the magnitude of that random number is never greater than the true difference between the evaluations of two states.
For example, if the true evaluation function for your game state can only return values -1, 0, and 1, you could add a randomly generated number in the range [0.0, 0.01] to the evaluation of every game state.
Without this, alpha-beta pruning doesn't necessarily find only one solution. Consider this example from wikipedia. In the middle, you see that two solutions with an evaluation of 6 were found, so it can find more than one. I do actually think it will still find all moves leading to optimal solutions at the root node, but not actually find all solutions deep down in the tree. Suppose, in the example image, that the pruned node with score of 9 in the middle actually had a score of 6. It would still get pruned there, so that particular solution wouldn't be found, but the move from root node leading to it (the middle move at root) would still be found. So, eventually, you would be able to reach it.
Some interesting notes:
This implementation would also work in minimax, and avoid the need to store a list of multiple (equally good) solutions
In more complex games than Tic Tac Toe, where you cannot search the complete state space, adding a small random number for the max player and deducting a small random number for the min player like this may actually slightly improve your heuristic evaluation function. The reason for this is as follows. Suppose in state A you have 5 moves available, and in state B you have 10 moves available, which all result in the same heuristic evaluation score. Intuitively, the successors of state B may be slightly better, because you had more moves available; in many games, having more moves available means that you are in a better position. Because you generated 10 random numbers for the 10 successors of state B, it is also a bit more likely that the highest generated random number is among those 10 (instead of the 5 numbers generated for successors of A)

How to optimize my pathfinding algorithm on a GTFS network

Intro
I'm using a slightly modified GTFS database.
I have a first step algorithm that given two geographical locations provides:
the list of stops around departure and arrival
the list of routes that connects those list of stops
The second step algorithm finds the best journeys matching those stops and routes.
This is working well on direct journeys as well as journeys using one connection.
My problem arises when trying to find the best journey using 2 connections (so there are 3 trips to be searched).
Database
The GTFS format has the following tables (each table has a foreign key to the previous/next table in this list):
stops: stop information (geolocation, name, etc)
stop_times: timetable
trips: itinerary taken by a vehicle (bus, metro, etc)
routes: family of trips that roughly take the same path (e.g. standard and express trips on the same route, but different stops taken)
I have added the following tables
stop_connections: stop to stop connections (around 1 to 20)
stops_routes: lists the available routes at every stop
Here's the table row count in a city where I get slow results (Paris, France):
stops: 28k
stop_times: 12M
trips: 513k
routes: 1k
stop_connections: 365k
stops_routes: 227k
Algorithm
The first step of my algo takes two latitude/longitude points as input, and provides:
the list of stops at each location
the routes that can be used to connect those stops (with up to two connections)
The second step takes each start stop, and analyses the available journeys that use only the routes selected by the first step.
This is the part that I'm trying to optimize. Here's how I'm querying the database:
My search terms (green in the picture):
one departure stop
several arrival stops (1 to 20)
allowed routes at departure, at first connection and on last trip
service ID (not relevant here, can be ignored)
Here's what I do now:
Start from a stop => get timetable => get trips => get routes; filter on allowed routes.
Connect the arrival stops of the first trip to a list of possible stops using stop_connections
Repeat from step 1 two times so that I have 3 trips/2 connections
The problem
This is working fine on some cases, but it can be very slow in others. Usually as soon as I join the timetable or the stop connections, there is a 10x increase of the returned rows. Since I'm joining these table 8 times, there are potentially 10^8 rows to be searched by the engine.
Now I'm sure that I can get this to be more efficient.
My problem is that the number of rows increases at every join, and the arrival stop selection is made at the very end.
I mean I get all the possible journeys from a given stop at a given departure time (there can be millions of combinations), and only when my search reaches the last trip, I can filter on the ~20 allowed arrival stops.
It could be much faster if I could somehow 'know' soon enough that a route isn't worth searching.
Optimizations
Here's what I tried/thought of:
1. Inner join stops_routes when joining stop_connections
Only select stops at a connection that lead to the allowed routes at next trip.
This is sometimes efficient when there is a lot of connections and not all the connected stops are interesting (some connected stop might only be used by a route we don't want to take).
However this inner join can increase the number of rows if there are not many connected stops and a lot of allowed routes.
2. Partition the stop_times table
Create a smaller copy of the stop_times that contains only the timetable of the next two hours or so. Indeed, having the database engine search for the timetable (up to 10pm for example) when my trips starts at 8am is useless. Keeping only 8am-10am is enough and much faster.
This is very efficient, because it dramatically decreases the number of rows to be searched.
I have implemented this with success, it decreased the search time by a factor of about 10x or even 100x.
3. Identify 'good' and 'bad' routes
There is usually, in a metropolitan area, large routes that are very useful when travelling large distances. But these routes aren't the best option when travelling small distances. A human person who knows his own city's public transportation system will quickly tell that from this neighborhood to this other, the best option is to take a specific route.
However this is very difficult to do, and requires a customization on every city.
I plan to make this algo completely independant of the city, so I'm not really willing to go down that road
4. Use crowdsourcing to identify paths that work well
The first search is slow, but the information taken from it can be used to serve fast result to the next person with a similar journey.
However there are so many combinations of departure and arrival stops that the information taken from one query might not be very useful.
I don't know if this is a good idea. I haven't implemented it.
Next
I'm running out of ideas. I know this is not a programming question, but rather a request of ideas on an algorithm. I hope this falls into the SO scope.
Having it on a network makes things a little bit interesting, but fundamentally, you're doing pathfinding, which is a slow process. You're running into the exponential nature of the problem, and doing so with only 3 connections.
I have a couple suggestions that you can perhaps use while doing this with mysql, and a couple that are likely not implementable within it.
Rather than partitioning the timetable, only take the next time for any given route. If you're leaving at 8 AM, you're correct, only looking at routes from 8-10 is better than looking at them all. However, if there's a route from A-B that leaves at 8:20, 8:40, 9:00, 9:15, 9:25, 9:45... there is zero reason to take them all: just take the first arrival time for any given route, since it's strictly better than the rest.
I presume you are pruning any routes that return to an already-visited location? If not, you perhaps should be: they're not useful for you. This may be somewhat difficult to do within the SQL framework.
Depending on its coverage, you could perhaps find a path using the (much smaller) routes table, and then find the best implementation of the top working paths from the trips table.
This is likely impossible within the framework of SQL, but the thing that makes most decent pathfinding algorithms fast is that they use a heuristic to search. Your search goes down every possible route -- it would be a lot faster to first look down the route that leads in the right direction. If it doesn't pan out, less likely directions are picked. The key here is that as soon as you have a result, you return it -- you effectively pruned every route you didn't yet search by the time you returned an answer.
Pre-calculated preferred routes: you suggest this would require human intervention, but I counter that you could do it computationally. Spend the time properly searching for routes from various points to various other points, and check on the statistics of how the routes worked. I would expect that you will find things allowing you to make a "anywhere over here to anywhere over there is going to use this intermediate path" table -- your problem is reduced from "find a path from A to B" to "find a path from A to C, followed by a path from D to B". Doing this will have the potential of causing you to find sub-optimal routes (as you are making an assumption from the precalculated statistics), but it may let you find that sub-optimal route much faster. On a mesh layout it will not work at all well; on a hub layout it will work excellently.
Thanks to zebediah49, I have implemented the following algorithm:
0. Lookup tables
First, I have created an ID on the trips table, that uniquely identifies it. It is based on the list of stops taken in sequence. So this ID guarantees that two trips with the same ID will take exactly the same route.
I called this ID trip_type.
I have improved my stop_connections table so that it includes a cost. This is used to select the best connection when two 'from' stops are connected to the same 'to' stop.
1. Get trips running from the departure stop(s)
Limit those trips to only 1 per trip type (group by trip_type)
2. Get arrival stops from these trips
Select only the best trip if there are two trips reaching the same stop
3. Get connected stops from these arrival stops
Select only the best connection if there are >1 stops that are connected to the same stop
4. Repeat from step 1
I have splitted this into several subqueries and temporary tables, because I can easily group and filter the best stops/trips at each step. This ensures that the minimum searches are sent to the SQL server.
I have stored this algorithm into an SQL procedure, that will do this in a single SQL statement:
call Get2CJourneys(dt, sd, sa, r1, r2, r3)
Where:
dt: departure time
sd: stops at departure point
sa: stops at arrival point
r1, r2, r3: allowed routes for the 1st, 2nd and 3rd trips
The procedure call returns interesting results in <600ms where my previous algorithm returned the same results in several minutes.
Expanding on #zebedia49's fourth point, you can precompute the vector traveled by a route, e.g. a route going due north has a vector of 0, due west = 90, due south = 180, due east = 270. Only return routes whose vectors are within, say, +/- 15 modulo 360 degrees from the as-the-crow-flies route (or +/- 30 if the +/- 15 query doesn't return any hits).

Randomly Assigning Positions

Here's my basic problem. Let's say I have 50 employees working on a certain day, and I want my program to randomly distribute them to a "position" (I.e.: front desk, phones, etc) based on what they have been trained on. The program already knows what each employee has been trained on. What is the best method pragmatically to go through and assign an employee to each of the 50 positions?
P.s. I am programming this into Access using VBA, but this is more a question of process than actual code.
Hi lukewarm,
You are looking for a maximum bipartite matching. This is a problem from graph theory. It boils down to determining the maximum flow in an undirected, bipartite graph with constant edge weights of 1:
You divide all vertices in Your graph in two separate sets. The first set contains all Your workers, the second one all available positions.
Now You insert an edge from every worker to every position she/he is able to work on.
Insert two more vertices: A source and a sink. Connect the source with every worker vertex and the sink with every position vertex.
Determine the maximum flow from source to sink
Hope I could help, greetings.
EDIT: Support for randomness
Since finding the maximum bipartite matching/maximum flow is a deterministic algorithm, it would always return the same result. In order to change that You could mix/shuffle the order of the edges in the graph before applying the algorithm.
In your position table have a sequence, 1, 2, 3, 4 and a count of positions to be filled. Then look at what the person did yesterday, and 1 to the position sequence and now they're assigned to the next position. If there are enough for that position today then go to the next priority position.
Not random but maybe close enough.