Efficient implementation for performing word count - mysql

I'd like to ask you for some advice on my research for my thesis.
I am building an application, where I'll have 1000 articles of 200-300 words and then a "word frequency list" - 30.000 words, each one rated according to usage across an English corpora e.g. "of" - 20168 times, "the" - 6464684 times, "aquaintance" - 15 times and so forth....
Now I want to query the database with lists of words and I want an article returned that contains most of these words, the most times.
E.g.: my list: different, contemporary, persistency.
Article 1 contains contemporary 1x
article 2 contains contemporary 3x
So the returned article would be no 2.
Questions
Should I create any relations among words and articles in the database. I mean for a thousand articles each one 300 words (well not unique) that would be quite a list. Or would an index suffice?
Mysql vs Oracle? With Mysql I'd use SOLR to index, I know that oracle has a tool for indexing but nothing more about it.
Is oracle with such functionality available for free? And also is it easy to handle, because I've never worked with it, but if the setup would be easy, I would go for it.
Thank you very much!

I will recommend you to use Hadoop to perform a WordCount operation. This will be scalable later (you're a researcher!) and efficient. Moreover creating relations among words, and articles in the database doesn't look like a neat solution.
If you choose Hadoop, it will provide the functionality of MapReduce. It works like this:
divides all input text files amongst multiple physical machines
each machine performs the word count algorithm
Results are collected from all machines, and then combined to give the final output.
You don't have to worry about implementing these functionalities, here is a tutorial.
WordCount job can also run locally on one machine.

Related

Which Database engine for large dataset

I'm working on a analysis assignment, we got a partial data-set from the university Library containing almost 300.000.000 rows.
Each row contains:
ID
Date
Owner
Deadline
Checkout_date
Checkin_date
I put all this inside a MySQL table, then I started querying that for my analysis assignment, however simple query (SELECT * FROM table WHERE ID = something) where taking 9-10 minutes to complete. So I created an index for all the columns, which made it noticeable faster ~ 30 sec.
So I started reading similar issues, and people recommended switching to a "Wide column store" or "Search engine" instead of "Relational".
So my question is, what would be the best database engine to use for this data?
Using a search engine to search is IMO the best option.
Elasticsearch of course!
Disclaimer: I work at elastic. :)
The answer is, of course, "it depends". In your example, you're counting the number of records in the database with a given ID. I find it hard to believe that it would take 30 seconds in MySQL, unless you're on some sluggish laptop.
MySQL has powered an incredible number of systems because it is full-featured, stable, and has pretty good performance. It's bad (or has been bad) at some things, like text search, clustering, etc.
Systems like Elasticsearch are good with globs of text, but still may not be a good fit for your system, depending on usage. From your schema, you have one text field ("owner"), and you wouldn't need Elasticsearch's text searching capabilities on a field like that (who ever needed to stem a user name?). Elasticsearch is also used widely for log files, which also don't need a text engine. It is, however, good with blocks of text and with with clustering.
If this is a class assignment, I'd stick with MySQL.

Efficient way to find 200,000 product names in 20 million articles?

We have two (MySQL) databases, one with about 200.000 products (like "Samsung Galaxy S4", db-size 200 MB) and one with about 10 million articles (plain text, db-size 20GB) which can contain zero, one or many of the product names from the product database. Now we want to find product names in the article texts and store them as facets of the articles while indexing them in elasticsearch. Using regular expressions to find the products is pretty slow, we looked at Apache OpenNLP and Stanford Named Entity Recognizer, for both we have to train our own models and there are some projects at github for integrating those NER tools into elasticsearch, but they don't seem to be ready for production use.
Products and articles are added every day, so we have to run a complete recognition every day. Is NER the way to go? Or any other ideas? We don't have to understand the grammer etc. of the text, we only have to find the product name strings as fast as possible. We can't do the calculation in realtime because that's way to slow, so we have to pre-calculate the connection between articles and products and store them as facets, so we can query them pretty fast in our application.
So what's your recommendation to find so many product names in so many articles?
One of the issues you'll run into the most is consistency... new articles and new product names are always coming in and you'll have an "eventual consistency" problem. So there are three approaches that come to mind that I have used to tackle this kind of problem.
As suggested, use a full text search in MySQL, basically create a loop over your products table, and for each product name do a MATCH AGAIST query and insert productkey, and article key into a tie table. THis is fast, I used to run a system in SQL Server with over 90000 items being searched against 1B sentences. If you had a multithreaded java program that chunked up the categories and exectured the full text query, you may be surpised how fast this will be. Also, this can hammer your DB server.
Use Regex. Put all the products in a collection in memory, and regex find with that list against every document. This CAN be fast if you have your docs in something like hadoop, where it can be parallelized. You could run the job at night, and have it populate a MySQL table... This approach means you will have to start storing your docs in HDFS or some NOSQL solution, or import from MySQL to hadoop daily etc etc.
You can try doing it "at index time", so when a record is indexed in ElasticSearch the extraction will happen then and your facets will be built. I have only used SOLR for stuff like this... problem here is that when you add new products you will have to process in batch again anyway because the previously index docs will not have had the new products extracted from them.
so there may be better options, but the one that scales infinitely (if you can afford the machines) is option 2... the hadoop job.... but this means big change.
These are just my thoughts, so I hope others come up with more clever ideas
EDIT:
As for using NER, I have used NER extensively, mainly OpenNLP, and the problem with this is that what it extracts will not be normalized, or to put it another way, it may extract pieces and parts of a product name, and you will be left dealing with things like fuzzy string matching to align the NER Results to the table of products. OpenNLP 1.6 trunk has a component called the EntityLinker, which is designed for this type of thing (linking NER results to authoritative databases). Also, NER/NLP will not solve the consistency problem, because every time you change your NER model, you will have to reprocess.
I'd suggest a preprocessing step : tokenization. If you do so for the product list and for the incoming articles, than you won't need to have a per-product search : the product list would be an automata where each transition is a given token.
That gives us a trie that you'll use to match products against texts, searching will look like :
products = []
availableNodes = dictionary.root
foreach token in text:
foreach node in availableNodes:
if node.productName:
products.append(node.productName)
nextAvailableNodes = [dictionary.root]
foreach node in availableNodes:
childNode = node.getChildren(token)
if childNode:
nextAvailableNodes.append(childNode)
availableNodes = nextAvailableNodes
As far as I can tell, this algorithm is quite efficient and it allows you to fine-tune node.getChildren() function (e.g. to address capitalization or diacritics issues). Loading products lists as a a trie may take some time , in that case you could cache it as a binary file.
This simple method can easily be distributed using Hadoop or other MapReduce approach, either over texts or over product list, see for instance this article (but you'll probably need more recent / accurate ones).

Searching in 1000 articles containing 300.000 words altogether

I am building a database and I'm not sure if I need any special indexing tool, or just mysql index would suffice.
In my DB I will have about 1000 articles, each containing about 300 words. I will need to search for articles that contain most of the words from my query (e.g.: "walk, walked, school, studying" - I want to find articles that contain these words most times).
The articles will be HTML.
The application will be used by a few people (10) at a time = no extra requirements for superfast response, I just want it returned in reasonable time, like 1 sec.
So, do I need any extra tool for indexing (Apache Lucene/SOLR) or will mysql index do?
I can't say im a MySql expert as I deal more with TSQL. However i'd say that just searching through the articles may take a while if they also include HTML as you have to take into account the tags which may or may not be malformed depending on how the HTML is saved.
Personally in the article table I'd have an extra column which would contain either the plain text version of the article, or some sort of result of a weighted algorithm which put in the most common 30 words in the article so that you have a much neater and streamline search field to use.
But for a 1000 articles this seems very much overkill and MySQL should do just fine if all your after is < 1s response time.

When to consider Solr

I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.

How to correct the user input (Kind of google "did you mean?")

I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.