I am developing an application that can show the shortest route using public transport methods (currently only buses). It should include the sections where one can walk some distance to the next stop rather than taking another bus (if its more shorter).
What should be the data structure for the map? I thought of graphical structure with nodes for bus stops. and vertices with distance as weight.
Even if I have found the shortest path using an algorithm (dijkstra) how to implement that walking sections in to the logic.
Without a lot of extra information, it's difficult to give you a great answer to this question, but let me hit some basics. This should be enough to get you going, but then you're going to need to do additional work to develop your solution.
In general, your data structure is going to be something like nodes that represent destinations or waypoints (like a bus stop, or an address). Your relationships will be modes of transportation with associated costs. For example, you can get from point/node A to point/node B via walking, or the bus. Those are two different relationships, with different "costs" in terms of time and money.
In general, you'll want to use a "weighted shortest path" algorithm to find the best way from point A to point B. Neo4j gives you a shortest path function, but in your case you'll need to assign weights to your relationships, and then calculate the shortest path not based on the number of "hops" through the graph, but based on some overall cost metric (time, money, whatever).
Ian Robinson wrote a great post on how to do weighted shortest paths in neo4j. So you should follow a template like that as a starting point.
You have a bunch of design questions to answer though. Do you want the shortest path in terms of time, money, effort, or some combination? The answer to that will affect your graph design, and your query strategy.
Related
I use the Google Maps Matrix API, to estimate the distance between 2 points, the point is that it uses traffic as a parameter to suggest a route which ends up giving a distance greater than the average. And since my application is mainly present in smaller cities, drivers end up following other routes of their own, etc.
So in the Matrix API call I would just like to return the distance from the nearest route.
This is not possible in Google Maps API. I can see that the feature request has already been filed, but unfortunately rejected by Google:
https://issuetracker.google.com/issues/35826943
Here is the answer of Google rep:
The Google Maps API team has reviewed the request for returning the shortest routes for driving directions, as opposed to the fastest routes. After much consideration, we have decided to keep the existing behavior and not implement shortest routes.
We understand there is demand for this feature; however, we believe that using the shortest route rather than the fastest route is not a good idea in practice. When the shortest route is not the fastest, it is likely to be a lower quality route in terms of time, fuel efficiency and sometimes even personal safety. We think these factors are more important to the majority of drivers on the road.
Our primary goal with driving directions is to save time for drivers. Therefore, we do not plan to offer the shortest routes in Google Maps or the Directions API.
There are a few workarounds that could potentially yield shorter routes, but we have found that they have significant drawbacks. We explain these below to clarify why we recommend against them.
Request routes in both directions.
Directions from A to B may not yield a feasible route from B to A due to situations like one-way streets, turn restrictions and different locations of highway exits. Requesting routes in both directions and taking the shortest route may yield a route that is not usable in the one direction.
Request alternative routes.
Adding alternatives=true to the request and picking the shortest route can yield a shorter route than that returned by default.
However, these alternative routes are not generally stable (may change over time as short-term road conditions change) nor guaranteed to include the shortest route. This means that the shortest route may still not be available, and also the shortest route found by this approach may change over time, giving an impression of instability.
I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.
I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:
First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.
The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.
A very interesting problem indeed!
On a higher level, what you want is summarization- a document has to be reduced to a few key phrases. This is far from being a solved problem. A simple approach would be to search for keywords as opposed to key phrases. You can try something like LDA for topic modelling to find what each document is about. You can then search for topics which are present in all documents- I suspect what will come up is stuff to do with licenses, location, copyright, etc. MALLET has an easy-to-use implementation of LDA.
I would approach this as a machine learning problem where you are trying to classify things in multiple ways- ie wants location, wants ssn, etc.
You'll need to enumerate the characteristics you want to use (location, ssn), and then for each document say whether that document uses that info or not. Choose your features, train your data and then classify and test.
I think simple features like words and n-grams would probably get your pretty far, and a dictionary of words related to stuff like ssn or location would finish it nicely.
Use the machine learning algorithm of your choice- Naive Bayes is very easy to implement and use and would work ok as a first stab at the problem.
I'm working on a transportation model, and am about to do a travel time matrix between 5,000 points. Is there a free, semi-reliable way to calculate the travel times between all my nodes?
I think google maps has a limit on the number of queries / hits I can achieve.
EDIT
I'd like to use an api such as google maps or similar ones as they include data such as road directions, number of lanes, posted speed, type of road, etc ...
EDIT 2
Please be advised that openstreet map data is incomplete and not available for all jurisdictions outside the US
Google Directions API restricts you to 2500 calls per day. Additionally, terms of service stipulate that you must only use the service "in conjunction with displaying the results on a Google map".
You may be interested in OpenTripPlanner, an in-development project which can do multi-modal routing, and Graphserver on which OpenTripPlanner is built.
One approach would be to use OpenStreetMap data with Graphserver to generate Shortest Path Trees from each node.
As that's 12,502,500 total connections, I'm pretty sure you'll hit some sort of limit if you attempt to use Google maps for all of them. How accurate of results do you need/how far are you travelling?
I might try to generate a crude map with travel speeds on it (e.g. mark off interstates as fast, yadda yadda) then use some software to calculate how long it would take from point to point. One could visualize it as an electromagnetic fields problem, where you're trying to calculate the resistance from point to point over a plane with varying resistance (interstates are wires, lakes are open circuits...).
If you really need all these routes accurately calculated and stored in your database, it sounds like (and I would believe) that you are going to have to spend the money to obtain this. As you can imagine, this is expensive to develop and there should be renumeration.
I would, however, probe a bit about your problem:
Do you really need all 5000! distances in a database? What if you asked google for them as you needed them, and then cached them (if allowed). I've had web applications like this that because of the slow traffic ramp-up pattern, I was able to leverage free services early on to vet the idea.
Do you really need all 5000 points? Or could you pick the top 100 and have a more tractable problem?
Perhaps there is some hybrid where you store distances between big cities and do more estimates for shorter distances.
Again, I really don't know what your problem is, but maybe thinking a bit outside the box will help you find an easier solution.
You might have to go for some heuristics here. Maybe you can estimate travel time based on a few factors like geometric distance and some features about the start and end points (urban vs rural areas, country, ...). You could get a few distances, try to fit your parameters on a subset of them and see how well you're able to predict the other ones. My prediction would be, for example, that travel times approach linear dependence from distance as distance grows larger, in many cases.
I know it's messy, but hey you're trying to estimate 12.5mio datapoints (or whatever the amount :)
You might also be able to incrementally add knowledge from already-retrieved "real" travel times by finding close points to the ones you're looking for:
get closest points StartApprox, EndApprox to starting and end position such that you have a travel time between StartApprox and EndApprox
compute distances StartError, EndError between start and StartApprox, end and EndApprox
if StartError+EndError>Distance(StartApprox, EndApprox) * 0.10 (or whatever your threshold) -> compute distance via API (and store it), else use known travel time plus overhead time based on StartError+EndError
(if you have 100 addresses in NY and 100 in SF, all the values are going to be more or less the same (ie the difference between them is probably lower than the uncertainty involved in these predictions) and such an approach would keep you from issuing 10000 queries where 1 would do)
Many GIS software packages have routing algorithms, if you have the data... Transportation data can be fairly spendy.
There are some other choices of sources for planning routes. Is this something to be done repeatedly, or a one-time process? Can this be broken up into smaller sub-sets of points? Perhaps you can use multiple routing sources and break up the data points into segments small enough for each routing engine.
Here are some other choices from quick Google search:
Wikipedia
Route66
Truck Miles
I've got a list of objects (probably not more than 100), where each object has a distance to all the other objects. This distance is merely the added absolute difference between all the fields these objects share. There might be few (one) or many (dozens) of fields, thus the dimensionality of the distance is not important.
I'd like to display these points in a 2D graph such that objects which have a small distance appear close together. I'm hoping this will convey clearly how many sub-groups there are in the entire list. Obviously the axes of this graph are meaningless (I'm not even sure "graph" is the correct word to use).
What would be a good algorithm to convert a network of distances into a 2D point distribution? Ideally, I'd like a small change to the distance network to result in a small change in the graphic, so that incremental progress can be viewed as a smooth change over time.
I've made a small example of the sort of result I'm looking for:
Example Graphic http://en.wiki.mcneel.com/content/upload/images/GraphExample.png
Any ideas greatly appreciated,
David
Edit:
It actually seems to have worked. I treat the entire set of values as a 2D particle cloud, construct inverse square repulsion forces between all particles and linear attraction forces based on inverse distance. It's not a stable algorithm, the result tends to spin violently whenever an additional iteration is performed, but it does always seem to generate a good separation into visual clusters:
alt text http://en.wiki.mcneel.com/content/upload/images/ParticleCloudSolution.png
I can post the C# code if anyone is interested (there's quite a lot of it sadly)
Graphviz contains implementations of several different approaches to solving this problem; consider using its spring model graph layout tools as a basis for your solution. Alternatively, its site contains a good collection of source material on the related theory.
The previous answers are probably helpful, but unfortunately given your description of the problem, it isn't guaranteed to have a solution, and in fact most of the time it won't.
I think you need to read in to cluster analysis quite a bit, because there are algorithms to sort your points into clusters based on a relatedness metric, and then you can use graphviz or something like that to draw the results. http://en.wikipedia.org/wiki/Cluster_analysis
One I quite like is a 'minimum-cut partitioning algorithm', see here: http://en.wikipedia.org/wiki/Cut_(graph_theory)
You might want to Google around for terms such as:
automatic graph layout; and
force-based algorithms.
GraphViz does implement some of these algorithms, not sure if it includes any that are useful to you.
One cautionary note -- for some algorithms small changes to your graph content can result in very large changes to the graph.
I've always been intrigued by Map Routing, but I've never found any good introductory (or even advanced!) level tutorials on it. Does anybody have any pointers, hints, etc?
Update: I'm primarily looking for pointers as to how a map system is implemented (data structures, algorithms, etc).
Take a look at the open street map project to see how this sort of thing is being tackled in a truely free software project using only user supplied and licensed data and have a wiki containing stuff you might find interesting.
A few years back the guys involved where pretty easy going and answered lots of questions I had so I see no reason why they still aren't a nice bunch.
A* is actually far closer to production mapping algorithms. It requires quite a bit less exploration compared to Dijikstra's original algorithm.
By Map Routing, you mean finding the shortest path along a street network?
Dijkstra shortest-path algorithm is the best known. Wikipedia has not a bad intro: http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
There's a Java applet here where you can see it in action: http://www.dgp.toronto.edu/people/JamesStewart/270/9798s/Laffra/DijkstraApplet.html and Google you lead you to source code in just about any language.
Any real implementation for generating driving routes will include quite a bit of data on the street network that describes the costs associate with traversing links and nodes—road network hierarchy, average speed, intersection priority, traffic signal linking, banned turns etc.
Barry Brumitt, one of the engineers of Google maps route finding feature, wrote a post on the topic that may be of interest:
The road to better path-finding
11/06/2007 03:47:00 PM
Instead of learning APIs to each map service provider ( like Gmaps, Ymaps api) Its good to learn Mapstraction
"Mapstraction is a library that provides a common API for various javascript mapping APIs"
I would suggest you go to the URL and learn a general API. There is good amount of How-Tos too.
I've yet to find a good tutorial on routing but there are lots of code to read:
There are GPL routing applications that use Openstreetmap data, e.g. Gosmore which works on Windows (+ mobile) and Linux. There are a number of interesting [applications using the same data, but gosmore has some cool uses e.g. interface with websites.
The biggest problem with routing is bad data, and you never get good enough data. So if you want to try it keep your test very local so you can control the data better.
From a conceptual point of view, imagine dropping a stone into a pond and watching the ripples. The routes would represent the pond and the stone your starting position.
Of course the algorithm would have to search some proportion of n^2 paths as the distance n increases. You would take you starting position and check all available paths from that point. Then recursively call for the points at the end of those paths and so on.
You can increase performance, by not double-backing on a path, by not re-checking the routes at a point if it has already been covered and by giving up on paths that are taking too long.
An alternative way is to use the ant pheromone approach, where ants crawl randomly from a start point and leave a scent trail, which builds up the more ants cross over a given path. If you send (enough) ants from both the start point and the end points then eventually the path with the strongest scent will be the shortest. This is because the shortest path will have been visited more times in a given time period, given that the ants walk at a uniform pace.
EDIT # Spikie
As a further explanation of how to implement the pond algorithm - potential data structures needed are highlighted:
You'll need to store the map as a network. This is simply a set of nodes and edges between them. A set of nodes constitute a route. An edge joins two nodes (possibly both the same node), and has an associated cost such as distance or time to traverse the edge. An edge can either either be bi-directional or uni-directional. Probably simplest to just have uni-directional ones and double up for two way travel between nodes (i.e. one edge from A to B and a different one for B to A).
By way of example imagine three railway stations arranged in an equilateral triangle pointing upwards. There are also a further three stations each halfway between them. Edges join all adjacent stations together, the final diagram will have an inverted triangle sitting inside the larger triangle.
Label nodes starting from bottom left, going left to right and up, as A,B,C,D,E,F (F at the top).
Assume the edges can be traversed in either direction. Each edge has a cost of 1 km.
Ok, so we wish to route from the bottom left A to the top station F. There are many possible routes, including those that double back on themselves, e.g. ABCEBDEF.
We have a routine say, NextNode, that accepts a node and a cost and calls itself for each node it can travel to.
Clearly if we let this routine run it will eventually discover all routes, including ones that are potentially infinite in length (eg ABABABAB etc). We stop this from happening by checking against the cost. Whenever we visit a node that hasn't been visited before, we put both the cost and the node we came from against that node. If a node has been visited before we check against the existing cost and if we're cheaper then we update the node and carry on (recursing). If we're more expensive, then we skip the node. If all nodes are skipped then we exit the routine.
If we hit our target node then we exit the routine too.
This way all viable routes are checked, but crucially only those with the lowest cost. By the end of the process each node will have the lowest cost for getting to that node, including our target node.
To get the route we work backwards from our target node. Since we stored the node we came from along with the cost, we just hop backwards building up the route. For our example we would end up with something like:
Node A - (Total) Cost 0 - From Node None
Node B - Cost 1 - From Node A
Node C - Cost 2 - From Node B
Node D - Cost 1 - From Node A
Node E - Cost 2 - From Node D / Cost 2 - From Node B (this is an exception as there is equal cost)
Node F - Cost 2 - From Node D
So the shortest route is ADF.
From my experience of working in this field, A* does the job very well. It is (as mentioned above) faster than Dijkstra's algorithm, but is still simple enough for an ordinarily competent programmer to implement and understand.
Building the route network is the hardest part, but that can be broken down into a series of simple steps: get all the roads; sort the points into order; make groups of identical points on different roads into intersections (nodes); add arcs in both directions where nodes connect (or in one direction only for a one-way road).
The A* algorithm itself is well documented on Wikipedia. The key place to optimise is the selection of the best node from the open list, for which you need a high-performance priority queue. If you're using C++ you can use the STL priority_queue adapter.
Customising the algorithm to route over different parts of the network (e.g., pedestrian, car, public transport, etc.) of favour speed, distance or other criteria is quite easy. You do that by writing filters to control which route segments are available, when building the network, and which weight is assigned to each one.
Another thought occurs to me regarding the cost of each traversal, but would increase the time and processing power required to compute.
Example: There are 3 ways I can take (where I live) to go from point A to B, according to the GoogleMaps. Garmin units offer each of these 3 paths in the Quickest route calculation. After traversing each of these routes many times and averaging (obviously there will be errors depending on the time of day, amount of caffeine etc.), I feel the algorithms could take into account the number of bends in the road for high level of accuracy, e.g. straight road of 1 mile will be quicker than a 1 mile road with sharp bends in it.
Not a practical suggestion but certainly one I use to improve the result set of my daily commute.