Related
I wanted to understand what do programmers generally mean when they use the term " Brute Force " in their work .
Many programming problems are a search of a data space, E.g. a walk of a list, tree, graph, etc. In solving the problem all of the data is searched or walked.
If one wants to make the code faster they will start to notice patterns that can be used to remove unnecessary parts of the search space.
When code searches the entire space that is "brute force". When optimizations are used to make the search more efficient that is not "brute force".
In other works when you first start writing code for an unknown problem you will start with brute force and then as you learn tricks (find optimizations) it will no longer be brute force.
As example, if one needs to find the first entry with just 1 in a list. The brute force method would search the entire list even after finding the first 1. But if one knows that only the first 1 needs to be found as soon as it is found then searching the remainder of the list is not needed.
A brute force approach to a problem is analogous to...
Forcing the door instead of picking the lock;
Hammering in a screw;
Executing all the suspects instead of determining who is guilty;
Fishing with dynamite;
Drawing with a grid;
Decorating ostentatiously instead of tastefully;
Victory through overwhelming numbers; or ...
Applying vast computational resources to a problem instead of finding an efficient solution.
Brute force is the application of a naïve algorithm to a problem, relying on computational resources instead of algorithmic efficiency to make the problem tractable.
#Guy Coder is correct that it often comes up when searching a data set, but it can be applied to other types of problems as well.
For example, supposed you needed to reverse the order of a linked list. Consider these approaches:
Adjust the pointers between the elements so that they are linked in the reverse order. That can be done in linear time with a fixed amount of memory.
Create a new list by walking the original list and pre-pending a copy of each element, then throw away the old list. That can also be done in linear time (though the constant will be higher) but it also uses memory in proportion to the length of the list.
Systematically create all possible linked lists until you discover one that's the reverse of the original list. It's an exhaustive search over an unbounded solution space (which is different than an exhaustive search of a finite data set). Since there are an infinite number of linked lists to try, no matter how much computing resource you can apply, it might never finish.
Like 3, but revised to generate and test all possible linked lists of the same length as the original list. This bounds the solution space to O(n!), which can be huge but is finite.
Start coding without an understanding of how to solve the problem, and iterate away until something works.
"Brute force" is technical slang rather than jargon with a precise definition. For different people, it will carry different connotations. Which of these solutions you consider to be "brute force" depends on what the term connotes for you.
For me, and many of the programmers and software engineers I've worked with, "brute force" carries these connotations:
The application of brute force is an intentional engineering decision. Brute force might be selected because:
The brute force method is easy to get correct
To create a reference implementation to check the results of a more efficient algorithm
The more efficient algorithm is not known
The more efficient algorithm is hard to implement correctly
The size of the problem is small enough that there is not much difference between brute force and the clever algorithm
A brute force solution must solve the problem. An implementation that attempts an exhaustive search of an unbounded space is not a general solution to the problem.
"Brute force" is usually a superlative. We say, "I used the brute force solution" rather than "a brute force solution." The implication is that, for a given problem, there's one straight-forward, direct, and obvious algorithm most programmers would recognize (or come up with) as the brute force solution for a given problem.
For those like me who feel the term has all of these connotations, only approach #2 is brute force. Those who disagree aren't wrong. For them, the term carries different connotations.
Although there exists several posts about (multi)lateration, i would like to summarize some approaches and present some issues/questions to better clarify the approach.
It seems that are two ways to detect the target location; using geometric/analytic approach (solving directly the equations with some trick) and fitting approach converting from non-linear to linear system.
With respect to the first one i would like to ask few questions.
Suppose in the presence of perfect range measurements,considering 2D case, the exact solution is a unique point at three circles intersection. Can anyone point some geometric solution for the first case? I found this approach: https://math.stackexchange.com/questions/884807/find-x-location-using-3-known-x-y-location-using-trilateration
but is seems it fails to consider two points with the same y coordinate as we can get a division by 0. Moreover can this be extended to 3D?
The same solution can be extracted using the second approach
Ax=b and latter recovering x = A^-1b or using MLS (x = A^T A)^-1 A^T b.
Please see http://www3.nd.edu/~cpoellab/teaching/cse40815/Chapter10.pdf
What about the case when the three circles have no intersection. It seems that the second approach still finds a solution. Is this normal? How can be explained?
What about the first approach when the range measurements are noisy. Does it find an approximate solution or it fails?
Considering the 3D, it seems that it needs at least 4 anchors to provide a unique solution. However, considering 3 anchors it can provide 2 solutions. I am asking if anyone of u guys can provide such equations to find the two solutions. This can be good even we have two solutions we may discard one by checking the values if they agree with our scenario. E.g., the GPS case where we pick the solution located in the earth. Instead the second approach of LMS would provide always one solution, wrong one.
Do u know any existing library C/C++ which would implement some of this techniques and maybe some more complex fitting functions such as non-linear etc.
Thank you
Regards
This was an interview question that someone asked me and I didn't really have a good answer. I was wondering if someone could possibly help me understand the solution to this:
"You have a stream of billion tweets coming in. How will you figure out the top 10 hashtags ? "
Thanks
Create a map, with a hashtag as the key and a counter as a value.
Increment the counter of each tag in each tweet you receive.
Examine the value of the counters to find the top 10.
Your phrasing of the question doesn't include any constraints that would prohibit this straightforward solution. In an interview situation, I would have asked clarifying questions to elicit these constraints.
Under constraints like, "it has to run in linear time," and, "it has to use a constant amount of memory," much more interesting answers emerge.
I am not sure if there is a constant memory solution to the problem as posed, but I know one for a related (and often more useful) problem: identifying elements that constitute a given fraction of results. I gave it as an answer to a similar question.
(I say, "more useful", because if the total fraction of a given item falls below a threshold, it's more likely to be noise than true "Top 10" material.)
You probably can't analyze all the tweets, so you just analyze a random sample. Find the top 10 from that sample and you can find the top 10 (to some degree of certainty, depending on the sample size and quality of the sample).
I don't think they were looking for an actual solution here, but more probing your thought process on how you might solve a (practically) impossible problem.
I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How many parameters are too many?
I was just writing a function that took in several values and it got me thinking. When is the number number of arguments to a function / method too many? When (if) does it signal a flawed design? Do you design / refactor the function to take in structs, arrays, pointers, etc to decrease the amount of arguments? Do you refactor the data coming in just to decrease the number of arguments? It seems that this could be a little less applicable in OOP designs, though. Just curious to see how others view the issue.
EDIT: For reference the function I just wrote took in 5 parameters. I use the definition of several that my AP Econ teacher gave me. More than 2; less than 7.
I don't know, but I know it when I see it.
According to Steve McConnell in Code Complete, you should
Limit the number of a routine's
parameters to about seven
If you have to ask then that's probably too many.
I generally believe that if the parameters are functionally related (e.g., coordinates or color components), they should be encapsulated as a class for good measures.
Not that I always follow this myself ;)
Robert C. Martin (Uncle Bob) recommends 3 as a maximum in Clean Code: A Handbook of Agile Software Craftsmanship
I don't have the book with me at the moment but his reasoning has to do with one, two and, to a lesser extent, three argument functions reading well and clearly showing the purpose of the function.
This of course goes hand in hand with his recommendation of very short, well named functions that adhere to the Single Responsibility Principal.
Quick answer: When you have to stop and ask that question, you've got too many.
Personally I like to keep the number under six. If more is needed, then the solution depends on the problem. One approach is to use "setter" functions to give the values to an object that will eventually perform the function you desire. Another option is to use a struct, as you mentioned. Either way, you can't really go wrong.
Well it would most certainly depend on what your function is doing as far as how many would be considered "too many". Having said that, it is certainly possible to have a function with a lot of different parameters that are options on how to handle certain cases inside the function, and having overloads to those functions with sane default values for those options.
With the pervasiveness of Intellisense (or equivalent in other IDEs) and tooltips showing the comments from the XML Documentation in Visual Studio, I don't really think that there's a firm answer to this question.
Too much parameter is a "Code Smell".
You can divide into multiple methods or use class to regroup variable that have something in common.
To put a number for the "Too much" is something very subjective and depend of your organization and the language you use, A rule of thumb is that if you can't read the signature of your method and have an idea of what is it doing than you might have too much information. Personnaly, I try not to go over 5 parameters.
For me is 5.
It is hard to manage ( remember name, order, etc ) beyond that. Plus If I come that far I have versions with default values that call this one.
Depends on the Function as well, if your function requires heavy user intervention or variables, I wouldn't go past 7-8 range. As far as average number of parameters to go with, 5-6 is the sweet spot in my opinion. If you are using more than that you might want to consider class objects as parameters or other smaller functions.
It varies from person to person. Personally, when I have trouble immediately understanding what a function call is doing by reading the invocation in code, it is time to refactor to take the strain off of my gray cells.
I've heard that 7 figure as well, but I somehow feel that it stems from a time when all you could pass where primitive values.
Nowadays you can pass a reference to an object that encapsulates some complex state (and behaviour). Using 7 of those would definitely be too much.
My personal goal is to avoid using more than 4.
It depends strongly on the types of the arguments. If they are all integers then 2 can be too many. (how do I remember which order?) If any argument accepts null, then the number drops drastically.
The real answer comes from asking yourself:
how easy is it to understand calls when I'm reading code?
how easy is it to remember the correct arguments and argument order when writing code?
And it depends of the programming language.. In C, it's really not rare to see functions with 7 parameters.. However, in C#, I have rarely seen more than 5 parameters and I personally use less than 3 usually.
// In C
draw_dot(x, y, size, red, green, blue, alpha)
// In C#
Point point(x,y);
Color color(red,green,blue,alpha);
Tool.DrawDot(point, color);
I would say maximum 4 . Anything above , I think should be placed within a class .