What do programmers mean when they use the term "BruteForce approach to the problem "? - language-agnostic

I wanted to understand what do programmers generally mean when they use the term " Brute Force " in their work .

Many programming problems are a search of a data space, E.g. a walk of a list, tree, graph, etc. In solving the problem all of the data is searched or walked.
If one wants to make the code faster they will start to notice patterns that can be used to remove unnecessary parts of the search space.
When code searches the entire space that is "brute force". When optimizations are used to make the search more efficient that is not "brute force".
In other works when you first start writing code for an unknown problem you will start with brute force and then as you learn tricks (find optimizations) it will no longer be brute force.
As example, if one needs to find the first entry with just 1 in a list. The brute force method would search the entire list even after finding the first 1. But if one knows that only the first 1 needs to be found as soon as it is found then searching the remainder of the list is not needed.

A brute force approach to a problem is analogous to...
Forcing the door instead of picking the lock;
Hammering in a screw;
Executing all the suspects instead of determining who is guilty;
Fishing with dynamite;
Drawing with a grid;
Decorating ostentatiously instead of tastefully;
Victory through overwhelming numbers; or ...
Applying vast computational resources to a problem instead of finding an efficient solution.

Brute force is the application of a naïve algorithm to a problem, relying on computational resources instead of algorithmic efficiency to make the problem tractable.
#Guy Coder is correct that it often comes up when searching a data set, but it can be applied to other types of problems as well.
For example, supposed you needed to reverse the order of a linked list. Consider these approaches:
Adjust the pointers between the elements so that they are linked in the reverse order. That can be done in linear time with a fixed amount of memory.
Create a new list by walking the original list and pre-pending a copy of each element, then throw away the old list. That can also be done in linear time (though the constant will be higher) but it also uses memory in proportion to the length of the list.
Systematically create all possible linked lists until you discover one that's the reverse of the original list. It's an exhaustive search over an unbounded solution space (which is different than an exhaustive search of a finite data set). Since there are an infinite number of linked lists to try, no matter how much computing resource you can apply, it might never finish.
Like 3, but revised to generate and test all possible linked lists of the same length as the original list. This bounds the solution space to O(n!), which can be huge but is finite.
Start coding without an understanding of how to solve the problem, and iterate away until something works.
"Brute force" is technical slang rather than jargon with a precise definition. For different people, it will carry different connotations. Which of these solutions you consider to be "brute force" depends on what the term connotes for you.
For me, and many of the programmers and software engineers I've worked with, "brute force" carries these connotations:
The application of brute force is an intentional engineering decision. Brute force might be selected because:
The brute force method is easy to get correct
To create a reference implementation to check the results of a more efficient algorithm
The more efficient algorithm is not known
The more efficient algorithm is hard to implement correctly
The size of the problem is small enough that there is not much difference between brute force and the clever algorithm
A brute force solution must solve the problem. An implementation that attempts an exhaustive search of an unbounded space is not a general solution to the problem.
"Brute force" is usually a superlative. We say, "I used the brute force solution" rather than "a brute force solution." The implication is that, for a given problem, there's one straight-forward, direct, and obvious algorithm most programmers would recognize (or come up with) as the brute force solution for a given problem.
For those like me who feel the term has all of these connotations, only approach #2 is brute force. Those who disagree aren't wrong. For them, the term carries different connotations.

Related

Hashing Function Vs Loop search

I have an array of structures, ~100 unique elements, and the structure is not large. Due to legacy code, to find an element in this array i use a hash function to find a likely starting point to start looping from until i find the element i want.
My question is this: Is the hash function (and resulting hash table) overkill ?
I know for large tables hashing is essential for good response time, but for a table this size ?
More succinctly, is there a table size below which writing a hash function is unnecessary ?
Language agnostic answers please.
Thanks,
A hash lookup trades better scalability for a bigger up-front computation cost. There is no inherent table size, as it depends on the cost of your hash function. Roughly speaking, if calculating your hash function has the same cost as one hundred equality comparisons, then you could only theoretically benefit from the hash map at some point above one hundred items. The only way to get specific answers for your case is to measure the performance.
My guess though, is that a hash map for 100 items for performance reasons is overkill.
The standard, obvious answer would be/is to write the simplest code that can do the job. Ensure that your interface to that code is as clean as possible so you can replace it when/if needed. Later, if you find that code takes an unacceptable amount of time, replace it with something that improves performance.
On a theoretical basis, however, it's impossible to guess at the upper limit on the number of items for which a linear search will provide acceptable performance for your task. It's also impossible to guess at the number of items for which a hash table will provide better performance than a linear search.
The main point, however, is that it's rarely necessary to try to figure out (especially on a poorly-defined theoretical basis) what data structure would be best for a given situation. In most cases, you just need to make an acceptable decision, and implement it so you can change your mind later if it turns out to be unacceptable after all.
When creating (or after it's created) sort your 'array of unique elements' by their 'key value'. Then use 'binary search' rather than hash or linear search. Now you get a simple implementation, no extra memory usage and good performance.

What's the term to describe an algorithm that has the same complexity as the underlying prroblem?

Whilst researching data structures for a project a few months back, I came across a term that I quite liked, that could be used as follows:
This [Algorithm/Solution/Data structure] is ?????ally-optimal
Meaning that the time (or space, depending on context) complexity of the solution being referred to is the same as the fundamental complexity as the problem it solves.
For example, if we ignore quantum computation and accept that problem of sorting is O(n log n) time in the general case, then with respect to time complexity heap sort is ?????ally-optimal because its complexity is also O(n log n), whereas bubble sort is not ?????ally-optimal because O(n^2) is worse than O(n log n).
I have no idea where I read it, I've so far failed to find it with google, and not being able to remember it has been bothering me ever since!
Maybe you are talking about Asymptotically optimal algorithm:
In computer science, an algorithm is said to be asymptotically optimal if, roughly speaking, for large inputs it performs at worst a constant factor (independent of the input size) worse than the best possible algorithm.
Are you thinking computationally optimal? probably "asymptotically optimal," like another answer said. It seems what you're describing is big-theta:
If a problem has been proven to take at least f(x), it is called Omega(f(x)); an algorithm's worst case is big-O(g(x)). When f(x) == g(x), that is to say the worst case for the solution is the best case for the problem, the algorithm is big-theta(f(x)). So heapsort, e.g. is theta(n*log(n)).

How to correct the user input (Kind of google "did you mean?")

I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.

How does the verbosity of identifiers affect the performance of a programmer?

I always wondered: Are there any hard facts which would indicate that either shorter or longer identifiers are better?
Example:
clrscr()
opposed to
ClearScreen()
Short identifiers should be faster to read because there are fewer characters but longer identifiers often better resemble natural language and therefore also should be faster to read.
Are there other aspects which suggest either a short or a verbose style?
EDIT: Just to clarify: I didn't ask: "What would you do in this case?". I asked for reasons to prefer one over the other, i.e. this is not a poll question.
Please, if you can, add some reason on why one would prefer one style over the other.
I'd go for clarity over verbosity or brevit.
ClearScreen()
is easier to parse than
clrscr()
in my opinion, but
ClearScreenBeforeRerenderingPage()
is just noise.
Abbreviations put a much greater burden on the reader. They are ambiguous; they are indirect; they are harder to discriminate. They burden the writer, too, for s/he must always be asking, "was that Cmd for Command, or Cmnd... or Cm?". They clash - a given abbreviation rule could produce the same abbreviation for two (or more) different words.
Because they are ambiguous, the reader must take time to think about what word is intended; if the word itself is present, the reader need only think about its meaning.
Because they are indirect, they are analogous to a pointer - just as there's a little processing time consumed by every pointer dereference, there's a little (human) processing time consumed, and additional memory occupied, by every abbreviation.
Certainly .NET developers should be following the .NET Naming Guidelines.
This suggests that the full names should be used, not abbreviations:
Do not use abbreviations or contractions as parts of identifier names. For example, use GetWindow instead of GetWin.
Personally I like to try a follow the wisdom of Clean Code by Uncle Bob.
To me it suggests that Code should read like prose. By using descriptive names and ensuring that methods have a single responsibility we can write code that accurately describes the programmers intent obviating the need for comments (in most cases).
He goes on to make the observation that when we write code, we often spend 90% of the time reading the surrounding code and dependent code. Therefore by writing code that is inherently readable we can be far more productive in our writing of code.
I remember a talk from Larry Wall some time ago where he talked about the verbosity of identifiers when you have non-native English speakers in your team.
ClearScreenBeforeRerenderingPage()
parses fine if you're a native English speaker. However he suggests (and experience shows) that:
Clear_Screen_Before_Rerendering_Page()
is much better.
The effect is exacerbated when you don't use both upper and lower case.
My basic rule is that every line of code is written only once, but read 10, 100, or 1000 times. According to this, the effort of typing is totally irrelevant. All that counts is the effort to read something.
Of course, this alone still leaves enough room for subjective opinions (is clrscrn readable enough?), but at least is narrows the field somehow.
Please go directly to the relevant chapter of Steve McConnell's "Code Complete" (sanitised Amazon link).
Specifically, chapter 11, "The Power of Variable Names".
My personal preference is to have verbose public identifiers and short private ones.
By public I mean class names, method names, global variables and constants, packages, namespaces - in short anything that can be accessed from a large number of places and by large number of people.
By private I mean local variables, private members, sometimes parameters - stuff that is only accessed from inside limited local scope and by single developer only.
Also consider where it's going to be used, ClearScreen() is likely to appear on its own.
However you cringe when new programmers who have learned that the identifier must be easily readable, produce code like.
screenCoordinateVertical = gradientOfLine * screenCoordinateHoriontal + screenCoordinateOrigin;
instead of
y = m*x + C;
Every developer should know how to touch type. Adding a few extra characters is not a productivity issue unless you're a hunt and peck typist.
With that said, I agree with the previous comments about balance. As with so many answers here, "it depends". But I favor verbosity and clarity. Taking out vowels is for old DBAs.
Always using full words in identifiers also helps to maintain consistency.
With abbreviations there is always the question whether the word was abbreviated, and if yes how.
For example, right now I'm looking at code which has index abbreviated as ndx or idx in various places.
For very local context it is OK to abbreviate, but then I'd use only the first letter of each word to guarantee consistency. E.g. sb for StringBuilder.
As a programmer I do much more thinking while programming than typing. So the extra time typing a longer identifier is of no relevance. And today my IDE is supporting me, I now have only to type some letters and the IDE let me choose from legal identifiers. So the productivity-argument against longer identifiers is today more invalid than it was a few years ago.
On the other side you gain readability if you choose more meaningful identifiers. Since you will read source-code more often than writing it, this is very important. Another point is, that abbreviations often are ambiguous. So do you abbreviate number as no, or as num? That makes errors more probable, as you choose the wrong identifier.
I think you'll find precious few hard facts, but lot of opinion on this subject.
The Wikipedia page on this topic links to a paper on a cost/benefit analysis of identifier naming issues (External Links section), but no language I know of bases its official or accepted naming conventions on the basis of a "scientific" study.
Looking at code in a social context, you should follow the naming conventions imposed by:
The project
The company
The programming language
.. in that order.
It's really all about finding a balance between the two, that is easy enough to read while at the same time not overly verbose. Many people have a personal dislike for Java or Win32's elaborate function/class names, but many others dislike very short identifiers as well.
Most modern IDEs (and even older ones) have an auto-complete feature, so it doesn't really take more time to type a long identifier (once it is declared of course). So I'd go for clarity over brevity, it makes the program much easier to read and more self-explaining
Nothing wrong with a short identifier as long as its obvious what it means.
Personally I'd lean toward the latter because i prefer to be verbose (Though i try not to be so verbose as MS and their CoMarshallInterthreadInterfaceInStream function) but as long as your Clear Screen function is not called "F()" I don't see a problem :)
Naming conventions and coding style are often discussed topics.
That said, the naming conventions are always very subjective -- to people and platform.
Bottom line is always -- let things make sense (yes, very subjective).
Wikibook search -- Naming identifiers
Also one has to think of where the identifier is, the well known i as iteration counter is a valid example:
for(int i=0;i<10;i++){
doSomething();
}
If the context is simple the identifier should reflect this accordingly.
No one has mentioned the negative impact on readability of identifiers that are too long. Once you start making identifiers that are 20, 30, 40 or more characters long you cannot write a reasonable statement on one line of text that is readable. Lines of code should be limited to about 80 characters. Anything longer is impossible to read. That is why newspapers are printed in columns. Even this webpage keeps the text column narrow so that it can be read without having to scan back and forth.

Do you use particular conventions for naming complementary variables?

I often find myself trying to come up with good names for complementary pairs of variables; where two variables denote opposing concepts, two participants in some sort of duologue, and so on.
This might be better explained by a counter-example - I maintain an app that prints two graphics as part of a print advertisement. They're stored in the database as TopLogo and LowerLogo, which I have to stop and double-check every time I use them because I'm expecting top to complement bottom, and lower should complement upper.
There's some obvious examples that I think work well:
client / server
source / target for copying/moving data or files from one variable to another
minimum / maximum
but there's some concepts that just don't lend themselves to such neat naming schemes. For example, when paging through records, does 'last' mean 'final' or 'previous' ? I recently saw some code that used firstPage, previousPage, nextPage and finalPage to avoid the ambiuous lastPage completely, which I thought was very beat, hence this question.
Do you have any particularly neat variable name pairs you'd care to share with us? (Bonus points if they're the same length, which makes the code so much neater in monospaced fonts.)
Like with all kinds of code style conventions, consistency is what you should strive for.
I would have the development team agree on "standard" pairs of prefixes for common scenarios like "source/destination" or "from/to" and then stick with them for the whole project. As long as every developer is aware of what is meant with a particular prefix in the codebase, it is easier to avoid misunderstandings.
Exceptions to the rule should be clarified in the documentation if the variable is part of a public API, or in comments within the code, if it's visibility is restricted to a single class or method.
In my databases you'll find many valid-state temporal ("history") tables containing a pair of columns named start_date and end_date. No bonus points for me, then, because I'd rather use the commonly used 'end' than try to come up with an intuitive alternative with the same number of characters as the word 'start'.
I tend to prefer these generic terms even when more context-specific terms may be viable e.g. preferring employee_start_date over employee_hire_date (what if their employment started for a reason other than being formally hiring e.g. their company was the subject of an acquisition). That said, I'd prefer person_birth_date over person_start_date :)
While one does try to be semantically coherent in obvious cases -- e.g., maximum goes with minimum, and not "lowest" -- in well-structured OO code (which isn't all code, I know) the problem disappears with a good IDE. Classes are short, methods are short, and variables are few in each method. So it doesn't matter what you call the variable pairs so long as they're clear. Your code might not look professional, but real quality is in the code, not in the look of your code.
The problem further disappears if there is good JavaDoc or whatever the documentation system is, and if have good Class names that go with them. For instance, if you have an instance of a Connection class and it has a method a method called setDestination, that's okay, but if you know that setDestination takes one parameter called destination and it's of the Server class, you're cool... even though you might prefer to call it target, aimHere, placeToSendTheData, or whatever (and the corresponding names, source, comingFromHere, and placeToGetTheDataFrom). Plus the doc system says what the thing is for, and that is priceless.
This next thing might sound stupid and I'm sure I'll get voted down here on StackOverflow, but unique non-professional sounding variable names have a great advantage: I know that my variables have names like placeWeWantTheDataToGo (and the IDE takes care of typing it), but the "serious" guys who do the JDK would never use such silly names. So I know immediately that the variable is one of mine. Incidentally, when I worked with developers in Spain and Italy, they write code with Spanish variable names (not always, but usually). This causes the same effect: we can quickly see that the Conexion class is ours, but the Connection class is not.
[Also, instead of typing your variable names, assign them a constant String somewhere in your code and use that, so if they called it lower or downer instead of low, you're still okay.]
Yes, I do try to name complementary sets of variables systematically so that the symmetry is clear. It is not always easy; sometimes, not even possible. Well, not possible using the rules I lay down for myself - which means I usually try to have the names the same length. The 'top' and 'lower' example would drive me batty (assuming I'm not batty already, which is far from certain); I'd probably use 'upper' and 'lower' because those are the same length; 'top' and 'bottom' would frustrate me too because of the difference in length.