Rails String Unique ID - mysql

I want to create a model that has a attribute that holds a string based unique identifier.
I only want the unique string to be 3 characters long and consist of letters of the alphabet (lower case only) and numbers.
How do I implement something like the above? How do I avoid collisions? I have looked into MD5, and that seems along the lines of what I want to accomplish - but shorter. I am willing to also seed it with a time if that make the approach deterministic.
I would love any feedback or pointers on this topic. Thanks!
EDIT:
One solution that has been on my mind is creating a table full of every single permutation, then randomly selecting as needed from the table, and deleting once used. Is this a bad approach?

Check out this SO thread; it's got plenty of good suggestions. Especially the last answer by Simone Carletti which points to this post.
There are quite a few options on the above post. The one I liked and might be useful for you is the use of rufus-mnemo gem

So the solution I decided to roll with after reading some of the questions & answers is quite different than what anyone had suggested.
I created a table to store codes. I wrote a ruby script to populate this table with every 3 letter combo based on the characters I wanted to use. Then on my model I have a before_save method assign a code to the instance if a code has not yet been assigned.
This approach ensures that I will never have a collision when assigning a code in the before_save. The slowest part is the generation of the table, but since I only have to do this once I can deal with this.

This gem called alphadecimal might be able to help you.

Related

Is it better to use varchar for a single sub category that has multiple values?

I'm quite new at this so I wanted some feedback on how I should structure these categorical databases.
So let's say we have a LED computer fan. The LED fan has colors and I want the users to be able to filter by color. Evidently, there is a quite a number of colors so I have come to 3 conclusions to implement this.
Use ENUM. However, I have read a few threads about using it which could be problematic and have almost ruled this one out.
Create a column for each color and use boolean values to check off which
color the led fan would be. I assume this method would use up more performance, but I'm no expert and stand to be corrected.
Use varchar and enter the color values manually and potentially risk wrong input values being entered but saving performance?(Question mark because I have no clue about performance in databases.)
If I could get some opinions, I would appreciate it! Thank you in advance!

How can I create an efficient MySQL database that auto-complete requests like Google

I'd like to get some ideas on how I can create an efficient MySQL database that can handle high traffic auto-complete requests like Google's new auto-SERP-update feature.
The key to this is, I am trying to take the content of my book and I want to index the text in a way such that the database requests the relevant text in the quickest/least overhead possible.
For Example:
If I were to type the text: "as", I would essentially scour the database (the entire book) and see a result set for sentences in the book that say:
"...that is as I was saying..."
"as I intended..."
"This is as good as it gets"
...
But as soon as I type a "k" and it spells "ask", the result set changes to (ie):
"Ask your father..."
"...I will ask you to do this."
...
In addition, I was considering adding helper words, so if you are in the middle of typing "askew", but currently only have "ask" spelled, the database would grab all words containing "ask", you will see helper words like "asking", "askew", "asked", "ask".
Importantly, this book is a rule book, so it has it's own unique key for each rule. Thus, when a person types in a word or 2, all rules with that combination of words will show up in the result set.
I am willing to do any research above what exact help anyone chooses to give. I am at a loss for the kinds of keywords I should be looking for in such a subject-- so in addition to specific solutions, keywords on this type of database structure would also be appreciated and helpful.
I have read something about full-text search? Can this be a solution, or is that not efficient enough for my purposes?
I know how to do ajax calls and auto-completion already... that is not the issue I am asking for solutions for. What I need is understanding on how to structure and index the database such that when I write a script to parse the content of my book in text format, and insert the tokens into the database, it will be later pulled in the most efficient way. I expect a high level of traffic eventually on the site, so minimizing request overhead is of paramount importance.
At an initial state, I was thinking of something like tables for each character length greater than 1... thus I'd have tables called "two_letters", "three_letters", etc.
One example record in the "two_letter" database could be "as", and it has a many-to-many relationship with every rule in the book that contains "as" in it... thus:
"as", "125"
"as", "2024"
"as", "4"
Of course, the smaller the letter set, the larger the database will be. This book is very big, so we're talking millions of records here! One for each combination of 2-letters, and the rule it is associated with. THEN, do it all over again with 3-letter combinations, until there are no more words. This is an initial brainstorming attempt only and may be a terrible idea, but it's my first thought on this.
Once the script is run, the database will create the tables and insert the records as it goes. It will likely read over the content many times for each length of characters.
I want it to recognize multi-word combinations as well, just as a keyphrase in Google would be auto-updated in the SERP. Thus, as the user types "they are go", you may find:
"they are gone already..."
"they are going to the movies later."
"how they are gonna get there is..."
I am essentially asking for that exact auto-complete feature in Google, but the content is a book, not indexed websites on the internet.
I look forward to hearing from some of the geniuses out there that get what I'm asking for here and feel like impressing some people! :)
Thanks in advance to everyone.
I have to recommend Sphinx. It's an amazing search engine for data stored in mysql (or other databases).
I second Sphinx -- I think Craigslist uses it

How to correct the user input (Kind of google "did you mean?")

I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.

mysql random generated value

I need to generate a random alpha/numeric to give to users that they come to the site to enter. I dont' know much about random numbers and such, I know there are seeding issues and such, but I'm not sure what they are.
So, I used this:
select substrING(md5(concat_ws('-',md5(username_usr),
MD5(zip_usr), MD5(id_usr),
MD5(created_usr))),-12) from users_usr
Is this safe? I used concat_ws because sometimes zip is null, but the others never are.
And yes, I know this is kinda short, but 1. They have to enter the last 4 of their social, 2. It's 1 time use, 3. There's no private data displayed back in the application and 4. I may use captcha, but since there's no private data, thats probably overkill.
THanks
Maybe using the Universal Unique Identifier would suffice? Just to keep it simple?
If you need a random alphanumeric value, why are you using so many variables? Something like the following should be perfectly enough:
md5(rand())
--Flavor: MySql
It'd help to know the purpose of the "random" string. This isn't random - it's repeatable - and fairly easily repeatable, at that. You're not exposing any sensitive information in a way that's easily reversible, but I'm guessing you're really looking for a way to generate a UUID (univeraslly unique ID). Not coincidentally, recent MySQL versions have a function called UUID.
http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_uuid
That might better solve the problem you're trying to address. If you really want a random number (which can definitely have collisions, by the way) for some reason, don't worry about seeding. If you don't specify a seed, it'll self-seed in a way that's probably better than a fixed seen anyway. You'd then map that random number (or a series of random numbers) to a character (possibly by casting the integer to a char), and repeat that until you have a string of chars long enough. But it bears repeating that a random number is not a guaranteed unique number...
Someone in the deleted duplicate of this question suggested using UUID(), which I think is a good idea. I don't think there's anything greatly wrong with using MD5(RAND()) either.
You'd have to store those, of course, which you don't have to do with your example.
>>SELECT md5(RAND()+CURRENT_TIMESTAMP())

Do you use particular conventions for naming complementary variables?

I often find myself trying to come up with good names for complementary pairs of variables; where two variables denote opposing concepts, two participants in some sort of duologue, and so on.
This might be better explained by a counter-example - I maintain an app that prints two graphics as part of a print advertisement. They're stored in the database as TopLogo and LowerLogo, which I have to stop and double-check every time I use them because I'm expecting top to complement bottom, and lower should complement upper.
There's some obvious examples that I think work well:
client / server
source / target for copying/moving data or files from one variable to another
minimum / maximum
but there's some concepts that just don't lend themselves to such neat naming schemes. For example, when paging through records, does 'last' mean 'final' or 'previous' ? I recently saw some code that used firstPage, previousPage, nextPage and finalPage to avoid the ambiuous lastPage completely, which I thought was very beat, hence this question.
Do you have any particularly neat variable name pairs you'd care to share with us? (Bonus points if they're the same length, which makes the code so much neater in monospaced fonts.)
Like with all kinds of code style conventions, consistency is what you should strive for.
I would have the development team agree on "standard" pairs of prefixes for common scenarios like "source/destination" or "from/to" and then stick with them for the whole project. As long as every developer is aware of what is meant with a particular prefix in the codebase, it is easier to avoid misunderstandings.
Exceptions to the rule should be clarified in the documentation if the variable is part of a public API, or in comments within the code, if it's visibility is restricted to a single class or method.
In my databases you'll find many valid-state temporal ("history") tables containing a pair of columns named start_date and end_date. No bonus points for me, then, because I'd rather use the commonly used 'end' than try to come up with an intuitive alternative with the same number of characters as the word 'start'.
I tend to prefer these generic terms even when more context-specific terms may be viable e.g. preferring employee_start_date over employee_hire_date (what if their employment started for a reason other than being formally hiring e.g. their company was the subject of an acquisition). That said, I'd prefer person_birth_date over person_start_date :)
While one does try to be semantically coherent in obvious cases -- e.g., maximum goes with minimum, and not "lowest" -- in well-structured OO code (which isn't all code, I know) the problem disappears with a good IDE. Classes are short, methods are short, and variables are few in each method. So it doesn't matter what you call the variable pairs so long as they're clear. Your code might not look professional, but real quality is in the code, not in the look of your code.
The problem further disappears if there is good JavaDoc or whatever the documentation system is, and if have good Class names that go with them. For instance, if you have an instance of a Connection class and it has a method a method called setDestination, that's okay, but if you know that setDestination takes one parameter called destination and it's of the Server class, you're cool... even though you might prefer to call it target, aimHere, placeToSendTheData, or whatever (and the corresponding names, source, comingFromHere, and placeToGetTheDataFrom). Plus the doc system says what the thing is for, and that is priceless.
This next thing might sound stupid and I'm sure I'll get voted down here on StackOverflow, but unique non-professional sounding variable names have a great advantage: I know that my variables have names like placeWeWantTheDataToGo (and the IDE takes care of typing it), but the "serious" guys who do the JDK would never use such silly names. So I know immediately that the variable is one of mine. Incidentally, when I worked with developers in Spain and Italy, they write code with Spanish variable names (not always, but usually). This causes the same effect: we can quickly see that the Conexion class is ours, but the Connection class is not.
[Also, instead of typing your variable names, assign them a constant String somewhere in your code and use that, so if they called it lower or downer instead of low, you're still okay.]
Yes, I do try to name complementary sets of variables systematically so that the symmetry is clear. It is not always easy; sometimes, not even possible. Well, not possible using the rules I lay down for myself - which means I usually try to have the names the same length. The 'top' and 'lower' example would drive me batty (assuming I'm not batty already, which is far from certain); I'd probably use 'upper' and 'lower' because those are the same length; 'top' and 'bottom' would frustrate me too because of the difference in length.