Count number of intersections of many random walks in a graph - intersection

I want to run k random walks from each node of a connected un-directed graph with length lambda. When two or more walks visit one node at a same time, they combine into one walk and continue as a single random walk until lambda steps finish. I want to know how many walks will be merged at the end of the lambda steps or at least know a good bound for it.

Related

LSTM for predicting multiple sequences at the same time

I'm currently working on a project regarding a dataset that contains smartphone usage data from roundabout 200 users over a period of 4 months. For each user, I have a dataframe consisting of app-log events (Name of the App, Time, Location etc.). My goal is to predict the dwell time for the next app a user is going to open. I don't want to build one model for each user, but instead, I'm trying to build a model for all combined users. Now I'm struggling with finding an architecture that is suitable for this project.
The records are not evenly spaced in time, and the length of each dataframe differs. I want to utilize the temporal dependencies while simultaneously learn from multiple users at once, thus my input would be multiple parallel sequences of app usage durations with additional features and my output again multiple parallel sequences containing the dwell-time for the next app, but as the sequences are not evenly spaced in time nor have the same length it seems not suitable. I just wanted to get some ideas on how to structure the data properly and what you think would be a suitable approach. I would really appreciate some ideas or reading recommendations.

Recurrent neural networks and continual variables

I am having an issue with processing continuous data using RNN. So far I've used MEL spectograms as inputs for my Listen, Attend and Spell architecture but I've decided to play around with that input by interpolating spectogram bins data rather than using MEL. Here is an example how one could achieve this: https://dspguru.com/dsp/howtos/how-to-interpolate-fft-peak/
The idea is simple, information gets lost in the process of creating MEL spectograms, some components get shifted etc. and my goal was to try and see if some sort of preprocessing could help deep network in the process of learning but I can't express this newly formed input to match RNN. MEL spectogram is 2D data with final number of bins, let's say 39. This new input can have arbitrary number of components for each time step. Every of these components is represented by a tuple of frequency and amplitude while both of them are continual variables. Frequency range for each time step can vary so I can't adopt a final number of them, first index for one time step could hold frequency value of, let's say, 5Hz while in the next time step it could very well be few hundreds of Hz. Could convolutional network be used as a tool before RNN? Is there any other solution? Thanks!

different between effect of episodes and time in DQN and where is the updating the experience replay

In DQN paper of DeepMind company, there are two loops one for episodes and one for running time in each step (one for training and one for different time-step of running). Am I right?
Since, nothing is done in outer loop except initialization and reset to conditions of first step, what are their differences?
For instance, in case 1, if we run for 1000 episodes and 400 time steps what are the differences we should expected in case 2, if we run for 4000 episodes and 100 time steps?
(is their difference that the second one has more chance to get rid of local minimum or something similar to that? or both are the same?)
Another question is where updating the experience replay is investigated?
enter image description here
For your first question: the answer is yes, there are two loops, and they do have differences.
You have to think of the true meaning of an episode. In most cases, we can consider each episode a 'game'. A 'game' needs to have an end. And we need to do our best to let every game end within the length of an episode (imagine what you can learn if you cannot get out of a labyrinth game). The Q values of DQN is an approximation of 'current reward' + 'discounted future rewards', while you need to know when will the future ends to make a better approximation.
So assume we usually take 200 steps to finish the game, then an episode of 100 time steps has a huge difference from an episode of 400 time steps.
For experience replay update, it happens in every time step. I don't get what you're asking. If you can explain your question in detail I think I could answer it.

How to optimize my pathfinding algorithm on a GTFS network

Intro
I'm using a slightly modified GTFS database.
I have a first step algorithm that given two geographical locations provides:
the list of stops around departure and arrival
the list of routes that connects those list of stops
The second step algorithm finds the best journeys matching those stops and routes.
This is working well on direct journeys as well as journeys using one connection.
My problem arises when trying to find the best journey using 2 connections (so there are 3 trips to be searched).
Database
The GTFS format has the following tables (each table has a foreign key to the previous/next table in this list):
stops: stop information (geolocation, name, etc)
stop_times: timetable
trips: itinerary taken by a vehicle (bus, metro, etc)
routes: family of trips that roughly take the same path (e.g. standard and express trips on the same route, but different stops taken)
I have added the following tables
stop_connections: stop to stop connections (around 1 to 20)
stops_routes: lists the available routes at every stop
Here's the table row count in a city where I get slow results (Paris, France):
stops: 28k
stop_times: 12M
trips: 513k
routes: 1k
stop_connections: 365k
stops_routes: 227k
Algorithm
The first step of my algo takes two latitude/longitude points as input, and provides:
the list of stops at each location
the routes that can be used to connect those stops (with up to two connections)
The second step takes each start stop, and analyses the available journeys that use only the routes selected by the first step.
This is the part that I'm trying to optimize. Here's how I'm querying the database:
My search terms (green in the picture):
one departure stop
several arrival stops (1 to 20)
allowed routes at departure, at first connection and on last trip
service ID (not relevant here, can be ignored)
Here's what I do now:
Start from a stop => get timetable => get trips => get routes; filter on allowed routes.
Connect the arrival stops of the first trip to a list of possible stops using stop_connections
Repeat from step 1 two times so that I have 3 trips/2 connections
The problem
This is working fine on some cases, but it can be very slow in others. Usually as soon as I join the timetable or the stop connections, there is a 10x increase of the returned rows. Since I'm joining these table 8 times, there are potentially 10^8 rows to be searched by the engine.
Now I'm sure that I can get this to be more efficient.
My problem is that the number of rows increases at every join, and the arrival stop selection is made at the very end.
I mean I get all the possible journeys from a given stop at a given departure time (there can be millions of combinations), and only when my search reaches the last trip, I can filter on the ~20 allowed arrival stops.
It could be much faster if I could somehow 'know' soon enough that a route isn't worth searching.
Optimizations
Here's what I tried/thought of:
1. Inner join stops_routes when joining stop_connections
Only select stops at a connection that lead to the allowed routes at next trip.
This is sometimes efficient when there is a lot of connections and not all the connected stops are interesting (some connected stop might only be used by a route we don't want to take).
However this inner join can increase the number of rows if there are not many connected stops and a lot of allowed routes.
2. Partition the stop_times table
Create a smaller copy of the stop_times that contains only the timetable of the next two hours or so. Indeed, having the database engine search for the timetable (up to 10pm for example) when my trips starts at 8am is useless. Keeping only 8am-10am is enough and much faster.
This is very efficient, because it dramatically decreases the number of rows to be searched.
I have implemented this with success, it decreased the search time by a factor of about 10x or even 100x.
3. Identify 'good' and 'bad' routes
There is usually, in a metropolitan area, large routes that are very useful when travelling large distances. But these routes aren't the best option when travelling small distances. A human person who knows his own city's public transportation system will quickly tell that from this neighborhood to this other, the best option is to take a specific route.
However this is very difficult to do, and requires a customization on every city.
I plan to make this algo completely independant of the city, so I'm not really willing to go down that road
4. Use crowdsourcing to identify paths that work well
The first search is slow, but the information taken from it can be used to serve fast result to the next person with a similar journey.
However there are so many combinations of departure and arrival stops that the information taken from one query might not be very useful.
I don't know if this is a good idea. I haven't implemented it.
Next
I'm running out of ideas. I know this is not a programming question, but rather a request of ideas on an algorithm. I hope this falls into the SO scope.
Having it on a network makes things a little bit interesting, but fundamentally, you're doing pathfinding, which is a slow process. You're running into the exponential nature of the problem, and doing so with only 3 connections.
I have a couple suggestions that you can perhaps use while doing this with mysql, and a couple that are likely not implementable within it.
Rather than partitioning the timetable, only take the next time for any given route. If you're leaving at 8 AM, you're correct, only looking at routes from 8-10 is better than looking at them all. However, if there's a route from A-B that leaves at 8:20, 8:40, 9:00, 9:15, 9:25, 9:45... there is zero reason to take them all: just take the first arrival time for any given route, since it's strictly better than the rest.
I presume you are pruning any routes that return to an already-visited location? If not, you perhaps should be: they're not useful for you. This may be somewhat difficult to do within the SQL framework.
Depending on its coverage, you could perhaps find a path using the (much smaller) routes table, and then find the best implementation of the top working paths from the trips table.
This is likely impossible within the framework of SQL, but the thing that makes most decent pathfinding algorithms fast is that they use a heuristic to search. Your search goes down every possible route -- it would be a lot faster to first look down the route that leads in the right direction. If it doesn't pan out, less likely directions are picked. The key here is that as soon as you have a result, you return it -- you effectively pruned every route you didn't yet search by the time you returned an answer.
Pre-calculated preferred routes: you suggest this would require human intervention, but I counter that you could do it computationally. Spend the time properly searching for routes from various points to various other points, and check on the statistics of how the routes worked. I would expect that you will find things allowing you to make a "anywhere over here to anywhere over there is going to use this intermediate path" table -- your problem is reduced from "find a path from A to B" to "find a path from A to C, followed by a path from D to B". Doing this will have the potential of causing you to find sub-optimal routes (as you are making an assumption from the precalculated statistics), but it may let you find that sub-optimal route much faster. On a mesh layout it will not work at all well; on a hub layout it will work excellently.
Thanks to zebediah49, I have implemented the following algorithm:
0. Lookup tables
First, I have created an ID on the trips table, that uniquely identifies it. It is based on the list of stops taken in sequence. So this ID guarantees that two trips with the same ID will take exactly the same route.
I called this ID trip_type.
I have improved my stop_connections table so that it includes a cost. This is used to select the best connection when two 'from' stops are connected to the same 'to' stop.
1. Get trips running from the departure stop(s)
Limit those trips to only 1 per trip type (group by trip_type)
2. Get arrival stops from these trips
Select only the best trip if there are two trips reaching the same stop
3. Get connected stops from these arrival stops
Select only the best connection if there are >1 stops that are connected to the same stop
4. Repeat from step 1
I have splitted this into several subqueries and temporary tables, because I can easily group and filter the best stops/trips at each step. This ensures that the minimum searches are sent to the SQL server.
I have stored this algorithm into an SQL procedure, that will do this in a single SQL statement:
call Get2CJourneys(dt, sd, sa, r1, r2, r3)
Where:
dt: departure time
sd: stops at departure point
sa: stops at arrival point
r1, r2, r3: allowed routes for the 1st, 2nd and 3rd trips
The procedure call returns interesting results in <600ms where my previous algorithm returned the same results in several minutes.
Expanding on #zebedia49's fourth point, you can precompute the vector traveled by a route, e.g. a route going due north has a vector of 0, due west = 90, due south = 180, due east = 270. Only return routes whose vectors are within, say, +/- 15 modulo 360 degrees from the as-the-crow-flies route (or +/- 30 if the +/- 15 query doesn't return any hits).

Implementing Dijkstra's algorithm using CUDA in c

I am trying to implement Dijsktra's algorithm using cuda.I got a code that does the same using map reduce this is the link http://famousphil.com/blog/2011/06/a-hadoop-mapreduce-solution-to-dijkstra%E2%80%99s-algorithm/ but i want to implement something similar as given in the link using cuda using shared and global memory..Please tell me how to proceed as i am new to cuda ..i dont know if it is necessary that i provide the input on host and device both in the form of matrix and also what operation should i perform in the kernel function
What about something like this(Dislaimer this is not a map-reduce solution).
Lets say you have a Graph G with N states an adjacency matrix A with entries A[i,j] for the cost of going from node i to node j in the graph.
This Dijkstras algorithm consists of having a vector denoting a front 'V' where V[i] is the current minimum distance from the origin to node i - in Dijkstras algorithm this information would be stored in a heap and loaded poped of the top of the heap on every loop.
Running the algorithm now starts to look a lot like matrix algebra in that one simply takes the Vector and applyes the adjancicy matrix to it using the following command:
V[i] <- min{V[j] + A[j,i] | j in Nodes}
for all values of i in V. This is run as long as there are updates to V (can be checked on the device, no need to load V back and forth to check!), also store the transposed version of the adjacency matrix to allow sequential reads.
At most this will have a running time corresponding to the longest non-looping path through the graph.
The interesting question now becomes how to distribute this across compute blocks, but it seems obvious to shard based on row indexes.
I suggest you study these two prominent papers on efficient graph processing on GPU. First can be found here. It's rather straightforward and basically assigns one warp to process a vertex and its neighbors. You can find the second one here which is more complicated. It efficiently produces the queue of next level vertices in parallel hence diminishing load imbalance.
After studying above articles, you'll have a better understanding why graph processing is challenging and where pitfalls are. Then you can make your own CUDA program.