"Stop words" list for English? [closed] - language-agnostic

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm generating some statistics for some English-language text and I would like to skip uninteresting words such as "a" and "the".
Where can I find some lists of these uninteresting words?
Is a list of these words the same as a list of the most frequently used words in English?
update: these are apparently called "stop words" and not "skip words".

The magic word to put into Google is "stop words". This turns up a reasonable-looking list.
MySQL also has a built-in list of stop words, but this is far too comprehensive to my tastes. For example, at our university library we had problems because "third" in "third world" was considered a stop word.

these are called stop words, check this sample

Depending on the subdomain of English you are working in, you may have/wish to compile your own stop word list. Some generic stop words could be meaningful in a domain. E.g. The word "are" could actually be an abbreviation/acronym in some domain. Conversely, you may want to ignore some domain specific words depending on your application which you may not want to ignore in the domain of general English. E.g. If you are analyzing a corpus of hospital reports, you may wish to ignore words like 'history' and 'symptoms' as they would be found in every report and may not be useful (from a plain vanilla inverted index perspective).
Otherwise, the lists returned by Google should be fine. The Porter Stemmer uses this and the Lucene seach engine implementation uses this.

Get statistics about word frequency in large txt corpora. Ignore all words with frequency > some number.

I think I used the stopword list for German from here when I built a search application with lucene.net a while ago. The site contains a list for English, too, and the lists on the site are apparaently the ones that the lucene project use as default, too.

Typically these words will appear in documents with the highest frequency.
Assuming you have a global list of words:
{ Word Count }
With the list of words, if you ordered the words from the highest count to the lowest, you would have a graph (count (y axis) and word (x axis) that is the inverse log function. All of the stop words would be at the left, and the stopping point of the "stop words" would be at where the highest 1st derivative exists.
This solution is better than a dictionary attempt:
This solution is a universal approach that is not bound by language
This attempt learns what words are deemed to be "stop words"
This attempt will produce better results for collections that are very similar, and produce unique word listings for items in the collections
The stop words can be recalculated at a later time (with this there can be caching and a statistical determination that the stop words may have changed from when they were calculated)
This can also eliminate time based or informal words and names (such as slang, or if you had a bunch of documents that had a company name as a header)
The dictionary attempt is better:
The lookup time is much faster
The results are precached
Its simple
Some else came up with the stop words.

Related

SQL - Rank or Order by "human" Relevance

Looking implement a ranking/order by feature that ranks products by the way we as humans regard as relevant, not what a computer regards as relevant. Currently I have this sql statment
select MATCH(productName) AGAINST('xyz' IN NATURAL LANGUAGE MODE) AS relevant...
... ORDER BY relevant DESC
These seems to work well, with regards to how many times a 'keyword' appears within the recordset, but its very Yay or Nay, if you know what I mean.
However, searching for "computer console" (in the unlikely event), I would like to see "Playstation", "xBox", "Nintendo" Although I never actually typed these keywords into the search field.
Search for "ladder" I personally would expect to see ladders for height access not the board game "snakes and ladders" or clothing with a ladder patten.
Some with "Iron" I wound not expect "Iron man bedding" to appear within the first page.
Is there an industry way of achieving such thing or does anyone have any ideas how this could be accomplished. i.e secondry table with keywords / search terms matching product_id.
Regards
This may not be exactly the same situation as yours but it may help you.
I designed a relevancy-based search results system for a large content management system I developed at my work.
Content comprises a title, the content and a hidden keywords field (words that should be used for search but are not included in the title or content). [there's lots more fields, but these three will do for demonstration of concept]
When content is added it gets indexed: some non alpha-numeric characters are removed, each word is stemmed (ie. educate, education, educator, educates, etc all get indexed as the same word), some words are converted to another based on some internal rules, and then they all get stored in an index.
When a search is done the system does the same as above to each keyword (removes unwanted characters, stemming, conversion based on internal rules).
The system then gets a list of content that has each of the parsed search keywords anywhere in any of those fields.
My code then parses each of the matching results: First it looks for all of they keywords existing consecutively in one of the fields; and if it doesn't find the search phrase it then iteratively [made up word] looks for smaller groups of keywords until found (ie. if 4 search keywords are entered it tries all 4 first, then 3, then 2, then 1 if they aren't all found together)
Based on how many of the keywords were found consecutively the system applies a score to the search result. Higher scores are given based on whether the keyword(s) were found in the title, content or keywords field [this took some fine tuning] and also how close it/they were found close to the start of the field.
The results are then given to the client based on this score.
The system works very well in our situation, particularly the grouped keywords part makes for good results.
You could use a similar system in your situation. A search for "ladder" would order a product like "Ladder - extra large" before "Snakes and Ladders Game".
For "computer console" you could add terms like these to a hidden keywords field.
Note that parsing the list for relevancy takes a bit of server resources so this type of system would only be suitable where you have sufficient infrastructure available or where the list of content is not large.

MySQL: Best way to do a backwords full text search?

I am trying to do basically a reverse full test search but have no clue of the best way to go about doing it.
Basically I have a table of key phrases laid out like this:
id - phrase
1 - "hello world"
2 - "goodbye world"
3 - "this is my world"
I then have a set string, such as "Welcome to the hello world group". I want to find the ID of all rows in my table that has an exact match for phrase. Meaning "o the" would not match because the word is "to the". Also "ello" would not match because the world is "hello".
Using Full Text Search, this can easily be achieved by doing a search of:
AGAINST ('"hello world"' IN BOOLEAN MODE);
Problem is, I don't believe I can use a full text search, since a full text search would find all rows that contains a single phrase. I want all phrases (from a known set of phrases) that match a single set.
I know how to do this using RegEx using the following, however this is way to slow. On a table with 400,000 key phrases it took over 40 seconds:
WHERE "the data I know I want to search goes here" REGEXP CONCAT('[[:<:]]', phrases, '[[:>:]]')
What I need is a more optimized way to do this. How would I possibly go about doing this as a full text search, even if i have to temporarily add it to a table without actually doing a LOOP individually checking each keyword.
I really appreciate the feedback as this is really causing my site to lag on adding new data.
If you are willing to consider a solution that reads the phrases out of the database and constructs a separate data structure used for optimized phrase detection, there are two main techniques that solve the problem. Which one is best for you depends on a number of factors, in particular:
How frequently the phrase list is updated
Whether and how you tokenise the text before running the phrase detection
How long the target strings are
Option 1: Hash-table of the phrases This means you simply insert each of the phrases as key into a hash table (aka dictionary or hash map in many programming languages). The phrase id becomes the value. Updates are fast and easy, but detecting the phrases in a given string can be hard: Firstly you need to tokenise the string and be sure that phrases only occur between token boundaries. Secondly, you need to make a lookup in the hash not only for every token, but also for every pair, triple, quadruple etc. of consecutive tokens. This still works well if the target strings are generally short. You can also maintain a copy of the hash table on disk, e.g. using the Berkeley DB. There are ready-to-use modules in the standard library of most programming languages for this.
Option 2: Search trie (or, slightly more advanced, a minimised search trie or a finite state machine). This can be implemented in very space-efficient ways but is generally larger than a hash table (although 400k entries will not be a problem at all). The big advantage during phrase detection is that you need not cut out tokens (or candidate phrases between token boundaries) before making look-ups. Instead you perform a longest-match look-up at each candidate start position in the text. Storing on disk is possible, although in most programming languages there won't be a standard-library module for this. Updates are quite easy in a trie, but can get difficult (and potentially time-consuming) in a minimised trie or FST.
Both options allow the data structure to be maintained on disk (or a copy of it to be stored on disk, while the actual look-ups happen memory). But you won't get transaction safety or fault-tolerance (which I understand you are not looking for).
You can use search engine. For example solr. You can set specific search filters against text. + search for words only. + It will be blindingly fast.
Or, second idea you can create your own table that stores all words and id of phrase. and search that table maching words only. It will be faster because you can add index on words better then phrases altogether.

Machine learning of word structure [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am working on a system that can create made up fanatsy words based on a variety of user input, such as syllable templates or a modified Backus Naur Form. One new mode, though, is planned to be machine learning. Here, the user does not explicitly define any rules, but paste some text and the system learns the structure of the given words and creates similar words.
My current naïve approach would be to create a table of letter neighborhood probabilities (including a special end-of-word "letter") and filling it by scanning the input by letter pairs (using whitespace and punctuation as word boundaries). Creating a word would mean to look up the probabilities for every letter to follow the current letter and randomly choose one according to the probabilities, append, and reiterate until end-of-word is encountered.
But I am looking for more sophisticated approaches that (probably?) provide better results. I do not know much about machine learning, so pointers to topics, techniques or algorithms are appreciated.
I think that for independent words (an especially names), a simple Markov chain system (which you seem to describe when talking about using letter pairs) can perform really well. Feed it a lexicon and throw it a seed to generate a new name based on what it learned. You may want to tweak the prefix length of the Markov chain to get nicely sounding results (as pointed out in a comment to your question, 2 letters are much better than one).
I once tried it with elvish and orcish names dictionaries and got very satisfying results.

Setting Up an Easily Searchable MySQL Database for Word Searches

I have appx. 2TB of text that I want to turn into a searchable database, where I will usually be searching to see if 2-4 word expressions exist in the database (for instance I might do a search to see if the phrase "these are four words", or "three consecutive words" appears anywhere in the text).
These searches will happen very often so it is very important that I setup the database to use as little processing as possible. I'd also want to minimize the overhead as much as possible so I can lower the amount of database servers I'll need.
Does anybody have any suggestions as to how I should setup this database?
For instance I was thinking of doing a linked list that was organized |id|word1|word2| (with all three beings keys) so for the expression "these are four words", I'd first search "these are", then I'd search "are four", check to see if any matches for "these are" are 1 id lower than "are four", and then do the same thing for "four words". But I think there has to be a more efficient way of doing it.
EDIT: The ONLY thing I will be using this database for is doing these 2-4 word exact match searches, and it is meant for internal use. All I want this database to be able to do is let me know if a 2-4 word expression exists somewhere in all of my files of information, and nothing more.
Does anybody have any suggestions as
to how I should setup this database?
Personally, I'd first rule out the possibility of using MySQL's full-text search, and every Open Source, full-text search engine. There's a list of Open Source search engines on Wikipedia. I'd also rule out using Google Custom Search. Heck, I'd even consider a commercial product before I'd try rolling my own.
At the very least, studying their code might give you some ideas about index structure.
If you're thinking of building a linked list in SQL, well, you might want to build a tiny test before you get too far into it. I don't think it will be practical, but I could be wrong.
It takes a lot of work to do full-text search really well. (Think about proximity searches—find "there are" within 3 words of "many ways to fail". ) Reinventing this wheel might not be the best use of your time.

How to correct the user input (Kind of google "did you mean?")

I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.