Database Design: How should I store 'word difficulty' in MySQL? - mysql

I made a vocabulary app for Android that has a list of ~5000 words stored in a local database (SQLite), and I want to find out which words are more difficult than others.
To find out, I'm thinking of adding a very simple feature that puts two random words on the screen, asking the user to choose the more difficult one. Then another pair of random words will show, and this process can be repeated for as long as the user wants. The more users who participate in this 'more difficult word', the app would in theory be able to distinguish difficult words from easy words.
Since the difficulty would be based on input from all users, I know I need to keep track of it online so that every app could then fetch them from the database on my website (which is MySQL). I'm not sure what would be the most efficient way to keep track of the difficulty, but I came up with two possible solutions:
1) Add a difficulty column that holds integer values to the words table. Then for every pair of words that a user looks at and ranks, the word that he/she chooses more difficult would have have its difficulty increased by one, and the word not chosen would have its difficulty decreased by one. I could simply order by that integer value to get the most difficult ones.
2) Create a difficulty table with two columns, more and less, that hold words (or ID's of the words to save space) based on the results of each selection a user makes. I'm still unsure how I would get the most difficult words - some combination of group by and order by?
The benefit of my second solution is that I can know how many times each word has been seen (# of rows from the more column that contain the word + # rows from the less column that contain the word). That helps with statistics, like if I wanted to find out which word has the highest ratio of more / less. But it would also take up much more space than my first suggested solution would, and don't know how it could scale.
Which do you think is the better solution, or what other ones should I consider?

Did you try sphinx for this? Guess a full text search engine like sphinx would solve with great performance.

Related

SQL - Rank or Order by "human" Relevance

Looking implement a ranking/order by feature that ranks products by the way we as humans regard as relevant, not what a computer regards as relevant. Currently I have this sql statment
select MATCH(productName) AGAINST('xyz' IN NATURAL LANGUAGE MODE) AS relevant...
... ORDER BY relevant DESC
These seems to work well, with regards to how many times a 'keyword' appears within the recordset, but its very Yay or Nay, if you know what I mean.
However, searching for "computer console" (in the unlikely event), I would like to see "Playstation", "xBox", "Nintendo" Although I never actually typed these keywords into the search field.
Search for "ladder" I personally would expect to see ladders for height access not the board game "snakes and ladders" or clothing with a ladder patten.
Some with "Iron" I wound not expect "Iron man bedding" to appear within the first page.
Is there an industry way of achieving such thing or does anyone have any ideas how this could be accomplished. i.e secondry table with keywords / search terms matching product_id.
Regards
This may not be exactly the same situation as yours but it may help you.
I designed a relevancy-based search results system for a large content management system I developed at my work.
Content comprises a title, the content and a hidden keywords field (words that should be used for search but are not included in the title or content). [there's lots more fields, but these three will do for demonstration of concept]
When content is added it gets indexed: some non alpha-numeric characters are removed, each word is stemmed (ie. educate, education, educator, educates, etc all get indexed as the same word), some words are converted to another based on some internal rules, and then they all get stored in an index.
When a search is done the system does the same as above to each keyword (removes unwanted characters, stemming, conversion based on internal rules).
The system then gets a list of content that has each of the parsed search keywords anywhere in any of those fields.
My code then parses each of the matching results: First it looks for all of they keywords existing consecutively in one of the fields; and if it doesn't find the search phrase it then iteratively [made up word] looks for smaller groups of keywords until found (ie. if 4 search keywords are entered it tries all 4 first, then 3, then 2, then 1 if they aren't all found together)
Based on how many of the keywords were found consecutively the system applies a score to the search result. Higher scores are given based on whether the keyword(s) were found in the title, content or keywords field [this took some fine tuning] and also how close it/they were found close to the start of the field.
The results are then given to the client based on this score.
The system works very well in our situation, particularly the grouped keywords part makes for good results.
You could use a similar system in your situation. A search for "ladder" would order a product like "Ladder - extra large" before "Snakes and Ladders Game".
For "computer console" you could add terms like these to a hidden keywords field.
Note that parsing the list for relevancy takes a bit of server resources so this type of system would only be suitable where you have sufficient infrastructure available or where the list of content is not large.

MySQL fulltext search with a minimum number of matches per record

I've been puzzling over this for a little while and I'm looking for feedback as to the best way to implement this.
Essentially I want to be able to search a database of documents and specify a keyword, for any record that matches the keyword I need to be able to specify how many times that keyword must occur.
The solution I'm about to start work on uses regex within the query to accomplish this, however from trawling stackoverflow I can see that this is possibly a little slow, the API needs to be able to do this very quickly since we're talking thousands of requests per minute.
Is there a faster way than using Regex or shall I pull out the VISA card and invest in some high end hardware?
To be clear I'm not tied to mysql I need to search a LOT of documents and match only those that have the keyword that occur X times.

Best solution to count occurence of words in database

I'm going to scrape a forums new threads page for each word appearing in the titles of the threads to make a sort of popularity trends (like Google Trends). I've found a way to scrape it but I don't know how I should store it in the database for optimal performance. I thought of two different ways.
Store each word that is new in a row and if the word isn't new, add one count to the "occurrences" field.
Store each word in a own row, no matter what.
Is there any other solutions to this problem?
If you are going through the trouble of scraping, you should be keeping multiple levels of information.
First, keep track of each forum title that you encounter, along with the date of the posting (and of your finding it) as well as other information. You can put a full text index on the forum title, which will give you nice capabilities for finding similar versions of the same word ("database" and "databases").
Second, store each word separately in a table along with the date and time of the posting (or of your finding it) and a link back to the posting table. The value of Google trends is not that it keeps a gross count of words ever. It is that you can break it down over time.
Then, do the aggregation in a query. If you have performance issues, you can partition the data by date, so most queries will only read a subset of the data. If the summaries are highly used, then you can consider summarization on a batch basis, say once per night.
Finally, how are you going to deal with different versions of the word appearing over time? WIth misspellings? Which multiple appearances of the same word in one title?
Idea #1 is the most compact, and should generally be the fastest. Check out INSERT/ON DUPLICATE KEY, using a unique key on the word and the date.
Idea #2 becomes important if you're storing other data than just the word, like the id of the forum thread, etc.
Good luck.

MySQL: Best way to do a backwords full text search?

I am trying to do basically a reverse full test search but have no clue of the best way to go about doing it.
Basically I have a table of key phrases laid out like this:
id - phrase
1 - "hello world"
2 - "goodbye world"
3 - "this is my world"
I then have a set string, such as "Welcome to the hello world group". I want to find the ID of all rows in my table that has an exact match for phrase. Meaning "o the" would not match because the word is "to the". Also "ello" would not match because the world is "hello".
Using Full Text Search, this can easily be achieved by doing a search of:
AGAINST ('"hello world"' IN BOOLEAN MODE);
Problem is, I don't believe I can use a full text search, since a full text search would find all rows that contains a single phrase. I want all phrases (from a known set of phrases) that match a single set.
I know how to do this using RegEx using the following, however this is way to slow. On a table with 400,000 key phrases it took over 40 seconds:
WHERE "the data I know I want to search goes here" REGEXP CONCAT('[[:<:]]', phrases, '[[:>:]]')
What I need is a more optimized way to do this. How would I possibly go about doing this as a full text search, even if i have to temporarily add it to a table without actually doing a LOOP individually checking each keyword.
I really appreciate the feedback as this is really causing my site to lag on adding new data.
If you are willing to consider a solution that reads the phrases out of the database and constructs a separate data structure used for optimized phrase detection, there are two main techniques that solve the problem. Which one is best for you depends on a number of factors, in particular:
How frequently the phrase list is updated
Whether and how you tokenise the text before running the phrase detection
How long the target strings are
Option 1: Hash-table of the phrases This means you simply insert each of the phrases as key into a hash table (aka dictionary or hash map in many programming languages). The phrase id becomes the value. Updates are fast and easy, but detecting the phrases in a given string can be hard: Firstly you need to tokenise the string and be sure that phrases only occur between token boundaries. Secondly, you need to make a lookup in the hash not only for every token, but also for every pair, triple, quadruple etc. of consecutive tokens. This still works well if the target strings are generally short. You can also maintain a copy of the hash table on disk, e.g. using the Berkeley DB. There are ready-to-use modules in the standard library of most programming languages for this.
Option 2: Search trie (or, slightly more advanced, a minimised search trie or a finite state machine). This can be implemented in very space-efficient ways but is generally larger than a hash table (although 400k entries will not be a problem at all). The big advantage during phrase detection is that you need not cut out tokens (or candidate phrases between token boundaries) before making look-ups. Instead you perform a longest-match look-up at each candidate start position in the text. Storing on disk is possible, although in most programming languages there won't be a standard-library module for this. Updates are quite easy in a trie, but can get difficult (and potentially time-consuming) in a minimised trie or FST.
Both options allow the data structure to be maintained on disk (or a copy of it to be stored on disk, while the actual look-ups happen memory). But you won't get transaction safety or fault-tolerance (which I understand you are not looking for).
You can use search engine. For example solr. You can set specific search filters against text. + search for words only. + It will be blindingly fast.
Or, second idea you can create your own table that stores all words and id of phrase. and search that table maching words only. It will be faster because you can add index on words better then phrases altogether.

Best usability practice for accepting long-ish account numbers

A user recently inquired (OK, complained) as to why a 19-digit account number on our web site was broken up into 4 individual text boxes of length [5,5,5,4]. Not being the original designer, I couldn't answer the question, but I'd always it assumed that it was done in order to preserve data quality and possibly to provide a better user experience also.
Other more generic examples include Phone with Area Code (10 consecutive digits versus [3,3,4]) and of course SSN (9 digits versus [3,2,4])
It got me wondering whether there are any known standards out there on the topic? When do you split up your ID#? Specifically with regards to user experience and minimizing data entry errors.
I know there was some research into this, the most I can find at the moment is the Wikipedia article on Short-term memory, specifically chunking. There's also The Magical Number Seven, Plus or Minus Two.
When I'm providing ID's to end users I, personally like to break it up into blocks of 5 which appears to be the same convention the original designer of your system used. I've got no logical reason that I can give you for having picked this number other than it "feels right". Short of being able to spend a lot of money on carrying out a study, "gut instinct" and following contentions from other systems is probably the way to go.
That said, if you can make the UI more usable to the user by:
Automatically moving from the end of one field to the start of another when it's complete
Automatically moving from the start of one field to the prior field and deleting the last character when the user presses delete in an empty field that isn't the first one
OR
Replacing it with one long field that has some form of "input mask" on it (not sure if this is doable in plain HTML, but it may be feasible using one of the UI frameworks) so it appears like "_____ - _____ - _____ - ____" and ends up looking like "1235 - 54321 - 12345 - 1234"
It would almost certainly make them happier!
Don't know about standards, but from a personal point of view:
If there are multiple fields, make sure the cursor moves to the next field once a field is full.
If there's only one field, allow spaces/dashes/whatever to be used in that field because you can filter them out. It's really annoying when sites/programs force you to enter dates in "dd/mm/yyyy" format, for example, meaning the day/month must be padded with zeroes. "23/8/2010" should be acceptable.
You need to consider the wider context of your particular application. There are always pros and cons of any design decision, but their impact changes depending on the situation, so you have to think every time.
Splitting the long number into several fields makes it easier to read, especially if you choose to divide the number the same way as most of your users. You can also often validate the input as soon as the user goes to the next field, so you indicate errors earlier.
On the other hand, users rarely type long numbers like that nowadays: most of the time they just copy-paste them from whatever note-keeping solution they have chosen, in whatever format they have it there. That means that a single field, without any limit on lenght or allowed characters suddenly makes a lot of sense -- you can filter the characters out anyways (just make sure you display the final form of the number to the user at some point). There are also issues with moving the focus between fields, with browsers remembering previous values (you just have to select one number, not 4 parts of the same number then), etc.
In general, I would say that as browsers slowly become more and more usable, you should take advantage of the mechanisms they provide by using the stock solutions, and not inventing complex solutions on your own. You may be a step before them today, but in two years the browsers will catch up and your site will suck.