Route planning from Pt. A to a list of addresses - google-maps

I wonder if its possible in google maps to plot a the quickest
route from a specific address, Pt A, to a list of destinations
i.e. Pt B, Pt C, Pt D etc. And if that's possible is it available
thru API ? I'll probably need it in the app I'm developing.
Thanks and apologies if this has been asked before !

You may want to check out this project:
Google Maps Fastest Roundtrip Solver
It is available under a GPL license.

The problem you've described is an example of the Traveling Salesman Problem. This is a famous problem because it's an example of the kind of problem that can't be solved efficiently with any known algorithm. That is, you can't come up with the absolutely best answer effiently, because the number of possible solutions increases exponentially. The number of possible solutions is n!, which means 5 x 4 x 3 x 2 x 1, where n=5. Not a big deal in this case, when you are trying to solve for 5 cities, (120 combinations) but even getting up only as far as 10 raises the number of possible combos to 3,628,800. Once you get to 100 nodes, you're counting your CPU time in years. This is why the "Fastest Roundtrip Solver" listed above only guarantees "optimal" solutions up to 15 points.
Having said all that, it can't be solved efficiently, (a "solution" in this case means the one correct answer, as Gebweb says, the "optimal" answer) but you can come up with a pretty good answer, as long as you don't get hung up on it being the absolute provably best one. If you look in the code, you'll notice that Gebweb's Fastest Roundtrip page switches to an "Ant Colony Optimization" (not technically an algorithm, but rather a heuristic) once you get past 15 points. No sense in my repeating what he says better, look at his behind-the-scenes page.
Anyway, Daniel is right, this should do what you want, but I couldn't help but spill a bit about the fact this is a more complex problem than it seems.

Related

Trilateration different approaches and issues

Although there exists several posts about (multi)lateration, i would like to summarize some approaches and present some issues/questions to better clarify the approach.
It seems that are two ways to detect the target location; using geometric/analytic approach (solving directly the equations with some trick) and fitting approach converting from non-linear to linear system.
With respect to the first one i would like to ask few questions.
Suppose in the presence of perfect range measurements,considering 2D case, the exact solution is a unique point at three circles intersection. Can anyone point some geometric solution for the first case? I found this approach: https://math.stackexchange.com/questions/884807/find-x-location-using-3-known-x-y-location-using-trilateration
but is seems it fails to consider two points with the same y coordinate as we can get a division by 0. Moreover can this be extended to 3D?
The same solution can be extracted using the second approach
Ax=b and latter recovering x = A^-1b or using MLS (x = A^T A)^-1 A^T b.
Please see http://www3.nd.edu/~cpoellab/teaching/cse40815/Chapter10.pdf
What about the case when the three circles have no intersection. It seems that the second approach still finds a solution. Is this normal? How can be explained?
What about the first approach when the range measurements are noisy. Does it find an approximate solution or it fails?
Considering the 3D, it seems that it needs at least 4 anchors to provide a unique solution. However, considering 3 anchors it can provide 2 solutions. I am asking if anyone of u guys can provide such equations to find the two solutions. This can be good even we have two solutions we may discard one by checking the values if they agree with our scenario. E.g., the GPS case where we pick the solution located in the earth. Instead the second approach of LMS would provide always one solution, wrong one.
Do u know any existing library C/C++ which would implement some of this techniques and maybe some more complex fitting functions such as non-linear etc.
Thank you
Regards

Project Euler 298 - there must be a correct answer? (only pastebinned code)

Project Euler has a paging file problem (though it's disguised in other words).
I tested my code(pastebinned so as not to spoil it for anyone) against the sample data and got the same memory contents+score as the problem. However, there is nowhere near a consistent grouping of scores. It asks for the expected difference in scores after 50 turns. A random sampling of scores:
1.50000000
1.78000000
1.64000000
1.64000000
1.80000000
2.02000000
2.06000000
1.56000000
1.66000000
2.04000000
I've tried a few of those as answers, but none of them have been accepted... I know some people have succeeded, so I'm really confused - what the heck am I missing?
Your problem likely is that you don't seem to know the definition of Expected Value.
You will have to run the simulation multiple times and for each score difference, maintain the frequency of that occurence and then take the weighted mean to get the expected value.
Of course, given that it is Project Euler problem, there is probably a mathematical formula which can be used readily.
Yep, there is a correct answer. To be honest, Monte Carlo can theoretically come close in on the expect value given the law of large numbers. However, you won't want to try it here. Because practically each time you run the simu, you will have a slightly different result rounded to eight decimal places (And I think this setting does exactly deprive anybody of any chance of even thinking to use Monte Carlo). If you are lucky, you will have one simu that delivers the answer after lots of trials, given that you have submitted all the previous and failed. I think, captcha is the second way that euler project let you give up any brute-force approach.
Well, agree with Moron, you have to figure out "expected value" first. The principle of this problem is, you have to find a way to enumerate every possible "essential" outcomes after 50 rounds. Each outcome will have its own |L-R|, so sum them up, you will have the answer. No need to say, brute-force approach fails in most of the case, especially in this case. Fortunately, we have dynamic programming (dp), which is fast!
Basically, dp saves the computation results in each round as states and uses them in the next. Thus it avoids repeating the same computation over and over again. The difficult part of this problem is to find a way to represent a state, that is to say, how you would like to save your temp results. If you have solved problem 290 in dp, you can get some hints there about how to understand the problem and formulate a state.
Actually, that isn't the most difficult part for the mind. The hardest mental piece is whether you realize that some memory statuses of the two players are numerically different but substantially equivalent. For example, L:12345 R:12345 vs L:23456 R:23456 or even vs L:98765 R:98765. That is due to the fact that the call is random. That is also why I wrote possible "essential" outcomes. That is, you can summarize some states into one. And only by doing so, your program can finish in reasonal time.
I would run your simulation a whole bunch of times and then do a weighted average of the | L- R | value over all the runs. That should get you closer to the expected value.
Just submitting one run as an answer is really unlikely to work. Imagine it was dice roll expected value. Roll on dice, score a 6, submit that as expected value.

How to correct the user input (Kind of google "did you mean?")

I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.

Use of LOC to determine project size

How many lines of code (LOC) does it take to be considered a large project? How about for just one person writing it?
I know this metric is questionable, but there is a significant difference, for a single developer, between 1k and 10k LOC. I typically use space for readability, especially for SQL statements, and I try to reduce the amount of LOC for maintenance purpose to follow as many best practice as i can.
For example, I created a unified diff of the code I modified today, and it was over 1k LOC (including comments and blank lines). Is "modified LOC" a better metric? I have ~2k LOC, so it's surprising I modified 1k. I guess rewriting counts as both a deletion and addition which doubles the stats.
A slightly less useless metric - time of compilation.
If your project takes more than... say, 30 minutes to compile, it's large :)
Using Steve Yegge as the benchmark at the upper range of the scale, let's say that 500k lines of code is (over?) the maximum a single developer can maintain.
More seriously though; I think once you hit 100k LOC you are probably going to want to start looking for re-factorings before extensions to the code.
Note however that one way around this limit is obviously to compartmentalise the code more. If the sum-total of all code consists of two or three large libraries and an application, then combined this may well be more than you could maintain as a single code-base, but as long as each library is nicely self-contained you aren't going to exceed the capacity to understand each part of the solution.
Maybe another measurement for this would be the COCOMO measure - even though it is probably as useless as LOC.
A single developer could only do organic projects - "small" teams with "good" experience working with "less than rigid" requirements.
In this case, efford applied in man months are calculated as
2.4 * (kLOC)^1.05
This said, 1kLOC would need 2.52 man month. You can use several factors to refine that, based on product, hardware, personel, and project attributes.
But all we have done now is projected LOC to a time measurement. Here you again have to decide whether a 2-month or 20-month project is considered large.
But as you said, LOC probably is not the right measure to use. Keywords: software metrics, function points, evidence based scheduling, the planing game.
In my opinion it also depends on the design of your code - i've worked on projects in the 1-10K loc range, that was so poorly designed, that it felt like a really large project.
But is LOC really an interesting meassure for code? ;-)

"Proper" way to give clients or managers a reality check on software estimates

Looking back at my past projects I often encounter this one:
A client or a manager presents a task to me and asks for an estimate. I give an estimate say 24 hours. They also ask a business analyst and from what I've heard their experience is mostly non-technical. They give an estimate say 16 hours. In the end, they would consider the value given by the analyst even though aside from providing an estimate on my side, I've explained to them the feasibility of the task on the technical side. They treat the analysts estimate as a "fact of life" even though it is only an estimate and the true value is in the actual task itself. Worse, I see a pattern that they tend to be biased in choosing the lower value (say I presented a lower value estimate than the analyst, they quickly consider it) compared to the feasibility of the task. If you have read Peopleware, they are the types of people who given a set of work hours will do anything and everything in their power to shorten in even though that is not really possible.
Do you have specific negotiation skills and tactics that you used before to avoid this?
If I can help it, I would almost never give a number like "24 hours". Doing so makes several implicit assumptions:
The estimate is accurate to within an hour.
All of the figures in the number are significant figures.
The estimate is not sensitive to conditions that may arise between the time you give the estimate and the time the work is complete.
In most cases these are demonstrably wrong. To avoid falling into the trap posed by (1), quote ranges to reflect how uncertain you are about the accuracy of the estimate: "3 weeks, plus or minus 3 days". This also takes care of (2).
To close the loophole of (3), state your assumptions explicitly: "3 weeks, plus or minutes 3 days, assuming Alice and Bob finish the Frozzbozz component".
IMO, being explicit about your assumptions this way will show a greater depth of thought than the analyst's POV. I'd much rather pay attention to someone who's thought about this more intensely than someone who just pulled a number out of the air, and that will certainly count for plus points on your side of the negotiation.
Do you not have a work breakdown structure that validates your estimate?
If your manager/customer does not trust your estimate, you should be able to easily prove it beyond the ability of an analyst.
Nothing makes your estimate intrinsically better than his beyond the breakdown that shows it to be true. Something like this for example:
Gather Feature Requirements (2 hours)
Design Feature (4 hours)
Build Feature
1 easy form (4 hours)
1 easy business component (4 hours)
1 easy stored procedure (2 hours)
Test Feature
3 easy unit tests (4 hours)
1 regression test (4 hours)
Deploy Feature
1 easy deployment (4 hours)
==========
(28 hours)
Then you say "Okay, I came up with 28 hours, show me where I am wrong. Show me how you can do it in 16."
Sadly scott adams had a lot to contribute to this debate
Dilbert: "In a perfect world the project would take eight months. But based on past projects in this company, I applied a 1.5 incompetence multiplier. And then I applied an LWF of 6.3."
Pointy-Haired Boss: "LWF?"
Alice: "Lying Weasel Factor."
You can "control" clients a little easier than managers since the only power they really have is to not give the work to you (that solves your incorrect estimates problem pretty quickly).
But you just need to point out that it's not the analyst doing the work, it's you. And nobody is better at judging your times than you are.
It's a fact of life that people paying for the work (including managers) will focus on the lower figure. Many times I've submitted proper estimates with lower (e.g., $10.000) and upper bounds (e.g., $11,000) and had emails back saying that the clients were quite happy that I'd quoted $10,000 for the work.
Then, for some reason, they take umbrage when I bill them $10,500. You have to make it clear up front that estimates are, well, estimates, not guarantees. Otherwise they wouldn't be paying time-and-materials but fixed-price (and the fixed price would be considerably higher to cover the fact that the risk is now yours, not theirs).
In addition, you should include all assumptions and risks in any quotes you give. This will both cover you and demonstrate that your estimate is to be taken more seriously than some back-of-an-envelope calculation.
One thing you can do to try to fix this over time, and improve your estimating skills as well, is to track all of the estimates you make, and match those up with the actual time taken. If you can go back to your boss with a list of the last twenty estimates from both you and the business analyst, and the time each actually took, it will be readily apparent whose estimates you should trust.
Under no circumstances give a single figure, give a best, worst and a most likely. If you respond correctly then the next question should be "How do I get a more accurate number" to which the answer should be more detailed requirements and/or design depending where you are in the lifecycle.
Then you give another more refined range of best .. most ... likely and wost. This continues until you are done.
This is known as the cone of uncertanty I have lost count of the number of times I have drawn it on a whiteboard when talking estimates with clients.
Do you have specific negotiation skills and tactics that you used before to avoid this?
Don't work for such people.
Seriously.
Changing their behavior is beyond your control.