Best solution to count occurence of words in database - mysql

I'm going to scrape a forums new threads page for each word appearing in the titles of the threads to make a sort of popularity trends (like Google Trends). I've found a way to scrape it but I don't know how I should store it in the database for optimal performance. I thought of two different ways.
Store each word that is new in a row and if the word isn't new, add one count to the "occurrences" field.
Store each word in a own row, no matter what.
Is there any other solutions to this problem?

If you are going through the trouble of scraping, you should be keeping multiple levels of information.
First, keep track of each forum title that you encounter, along with the date of the posting (and of your finding it) as well as other information. You can put a full text index on the forum title, which will give you nice capabilities for finding similar versions of the same word ("database" and "databases").
Second, store each word separately in a table along with the date and time of the posting (or of your finding it) and a link back to the posting table. The value of Google trends is not that it keeps a gross count of words ever. It is that you can break it down over time.
Then, do the aggregation in a query. If you have performance issues, you can partition the data by date, so most queries will only read a subset of the data. If the summaries are highly used, then you can consider summarization on a batch basis, say once per night.
Finally, how are you going to deal with different versions of the word appearing over time? WIth misspellings? Which multiple appearances of the same word in one title?

Idea #1 is the most compact, and should generally be the fastest. Check out INSERT/ON DUPLICATE KEY, using a unique key on the word and the date.
Idea #2 becomes important if you're storing other data than just the word, like the id of the forum thread, etc.
Good luck.

Related

Creating more relevant results from LDA topic modeling?

I am doing a project for my degree and I have an actual client from another college. They want me to do all this stuff with topic modeling to an sql file of paper abstracts he's given me. I have zero experience with topic modeling but I've been using Gensim and Nlkt in a Jupyter notebook for this.
What he want's right now is for me to generate 10 or more topics, record the top 10 most overall common words from the LDA's results, and then if they are very frequent in each topic, remove them from the resulting word cloud and if they are more variant, remove the words from just the topics where they are infrequent and keep them in the more relevant topics.
He also wants me to compare the frequency of each topic from the sql files of other years. And, he wants these topics to have a name generated smartly from the computer.
I have topic models per year and overall, but of course they do not appear exactly the same way in each year. My biggest concern is the first thing he wants with the removal process. Is any of this possible? I need help figuring out where to look as google is giving me not what I want as I am probably searching it wrong.
Thank you!
Show some of the code you use so we can give you more useful tips. Also use nlp tag, the tags you used are kind of specific and not followed by many people so your question might be hard to find for the relevant users.
By the whole word-removal thing do you mean stop words too? Or did you already remove those? Stop words are very common words ("the", "it", "me" etc.) which often appear high in most frequent word lists but do not really have any meaning for finding topics.
First you remove the stop words to make the most common words list more useful.
Then, as he requested, you look which (more common) words are common in ALL the topics (I can imagine in case of abstracts this is stuff like hypothesis, research, paper, results etc., so stuff that is abstract-specific but not useful for determining topics within different abstracts and remove those. I can imagine for this kind of analysis as well as the initial LDA it makes sense to use all the data from all years to have a large amount of data for the model to recognize patterns. But you should try around the variations and see if the per year or overall versions get you nicer results.
After you have your global word lists per topic you go back to the original data (split up by year) to count the frequencies of how often the combined words from a topic occur per year. If you view this over the years you probably can see trends like some topics that are popular in the last few years/now but if you go back far enough they werent relevant.
The last thing you mentioned (automatically assigning labels to topics) is actually something quite tricky, depending on how you go about it.
The "easy" way would be e.g. just use the most frequent word in each topic as label but the results will probably be underwhelming.
A more advanced approach is Topic Labeling. Or you can try an approach like modified text summarization using more powerful models.

MySQL fulltext search with a minimum number of matches per record

I've been puzzling over this for a little while and I'm looking for feedback as to the best way to implement this.
Essentially I want to be able to search a database of documents and specify a keyword, for any record that matches the keyword I need to be able to specify how many times that keyword must occur.
The solution I'm about to start work on uses regex within the query to accomplish this, however from trawling stackoverflow I can see that this is possibly a little slow, the API needs to be able to do this very quickly since we're talking thousands of requests per minute.
Is there a faster way than using Regex or shall I pull out the VISA card and invest in some high end hardware?
To be clear I'm not tied to mysql I need to search a LOT of documents and match only those that have the keyword that occur X times.

Database Design: How should I store 'word difficulty' in MySQL?

I made a vocabulary app for Android that has a list of ~5000 words stored in a local database (SQLite), and I want to find out which words are more difficult than others.
To find out, I'm thinking of adding a very simple feature that puts two random words on the screen, asking the user to choose the more difficult one. Then another pair of random words will show, and this process can be repeated for as long as the user wants. The more users who participate in this 'more difficult word', the app would in theory be able to distinguish difficult words from easy words.
Since the difficulty would be based on input from all users, I know I need to keep track of it online so that every app could then fetch them from the database on my website (which is MySQL). I'm not sure what would be the most efficient way to keep track of the difficulty, but I came up with two possible solutions:
1) Add a difficulty column that holds integer values to the words table. Then for every pair of words that a user looks at and ranks, the word that he/she chooses more difficult would have have its difficulty increased by one, and the word not chosen would have its difficulty decreased by one. I could simply order by that integer value to get the most difficult ones.
2) Create a difficulty table with two columns, more and less, that hold words (or ID's of the words to save space) based on the results of each selection a user makes. I'm still unsure how I would get the most difficult words - some combination of group by and order by?
The benefit of my second solution is that I can know how many times each word has been seen (# of rows from the more column that contain the word + # rows from the less column that contain the word). That helps with statistics, like if I wanted to find out which word has the highest ratio of more / less. But it would also take up much more space than my first suggested solution would, and don't know how it could scale.
Which do you think is the better solution, or what other ones should I consider?
Did you try sphinx for this? Guess a full text search engine like sphinx would solve with great performance.

Information Architecture & Retreival - Prioritizing queries

I've got an application that displays data to the user based on a relevancy score. There are 5 to 7 different types of information that I can display (e.g. Users Tags, Friends Tags, Recommended Tags, Popular Tags, etc.) Each information type would be a separate sql query.
Then I have an algorithm that ranks each type by how relevant it is. The algorithm is based on several factors including how long its been since an action was taken on a particular type, how important one information type is to another, how often one type has been shown, etc.
Once they are ranked, I show them to the users in a feed, similar to Facebook.
My question is a simple one. I need the data before I can run it through the ranking algorithm, so whats the most efficient way to pull only the data I need from the database.
Currently I pull the top 5 instances of each information type, and then rank those. Each piece of data gets a relevancy score, and if I don't have enough results that reach a certain relevancy threshold, I go back to the database for the next 5 of each.
The problem with this approach is that I risk pulling too many of one story type that I never use, and I have to keep going back to the database if I don't get what I need the first time.
I have thought about a massive sql query that incorporates all info types & algorithm, which could work, but that would really be a huge query, and them I'm having mysql do so much processing, and I'm of the general mind set that Mysql should do the data retrieval and my programming language (php) should do the processing stuff.
There has to be a better way! I'm sure there is a scholarly article somewhere, but I haven't been able to find it.
Thanks Stack Overflow
im assuming that by information type you mean (Users Tags, Friends Tags etc); i would recommend rather than fetching your data again again n again against a particular fixed threshold, change your algorithm a little. Try assigning weights to each information type, even if you get a few records of a low priority type, you dont have to fetch it again.

How twitter improve their search by using lucence?

Recently, twitter engineer post a very impressed blog about using Lucence instead of mysql for their search architecture.
So, I'm curious about why they choose lucence and why does mysql does not meet their requirements? On the other hand, what's the performance (or, scalability to say) bottleneck for the DBMS database system?
Any ideas are appreciated!
Thanks in Adv
Vance
Think about a Lucene index as something like the index you have in the back of some large reference books: for every important term that appears in the book, it lists all the pages in which it appears. So if you want to find all the places in the book where a term appears, you go to the index and get a list of pages.
What Lucene does is take documents, break them into their individual words (that process is called "tokenization") then for each word/token write in its index that that word appears in that document.
Think of the index like a hashtable (it's not really one, but it's the same idea): the keys are the words/tokens and for each key there's a bucket with a list references to documents (URIs, filenames) that contain that word. It doesn't store the document itself - just a reference to it. When you do a search on Lucene you provide a keyword and get back the list of documents that contain that keyword which appear in its index.
MySQL and other RDBMS's are optimized for storing and retrieving records - collections of pre-defined, ordered columns. When you place an index on a column it looks at the entire content of the column as a single unit. If that column is a piece of text, it does not break it down into words.
MySql is a RDMS, which is quite robust, fast. It does support full text searching, but it is not very good and efficient.
Lucence, is a full text search engine. Full text search engine are capable of searching in the docs, texts, etc. So they are able to search on loads of tweets efficiently.
MySql is good when it comes to queries columns and that too with discrete search values in those columns. Like queries would definitely take a hit.
You can find lot of information regarding Full Text searching on internet.