Best usability practice for accepting long-ish account numbers - usability

A user recently inquired (OK, complained) as to why a 19-digit account number on our web site was broken up into 4 individual text boxes of length [5,5,5,4]. Not being the original designer, I couldn't answer the question, but I'd always it assumed that it was done in order to preserve data quality and possibly to provide a better user experience also.
Other more generic examples include Phone with Area Code (10 consecutive digits versus [3,3,4]) and of course SSN (9 digits versus [3,2,4])
It got me wondering whether there are any known standards out there on the topic? When do you split up your ID#? Specifically with regards to user experience and minimizing data entry errors.

I know there was some research into this, the most I can find at the moment is the Wikipedia article on Short-term memory, specifically chunking. There's also The Magical Number Seven, Plus or Minus Two.
When I'm providing ID's to end users I, personally like to break it up into blocks of 5 which appears to be the same convention the original designer of your system used. I've got no logical reason that I can give you for having picked this number other than it "feels right". Short of being able to spend a lot of money on carrying out a study, "gut instinct" and following contentions from other systems is probably the way to go.
That said, if you can make the UI more usable to the user by:
Automatically moving from the end of one field to the start of another when it's complete
Automatically moving from the start of one field to the prior field and deleting the last character when the user presses delete in an empty field that isn't the first one
OR
Replacing it with one long field that has some form of "input mask" on it (not sure if this is doable in plain HTML, but it may be feasible using one of the UI frameworks) so it appears like "_____ - _____ - _____ - ____" and ends up looking like "1235 - 54321 - 12345 - 1234"
It would almost certainly make them happier!

Don't know about standards, but from a personal point of view:
If there are multiple fields, make sure the cursor moves to the next field once a field is full.
If there's only one field, allow spaces/dashes/whatever to be used in that field because you can filter them out. It's really annoying when sites/programs force you to enter dates in "dd/mm/yyyy" format, for example, meaning the day/month must be padded with zeroes. "23/8/2010" should be acceptable.

You need to consider the wider context of your particular application. There are always pros and cons of any design decision, but their impact changes depending on the situation, so you have to think every time.
Splitting the long number into several fields makes it easier to read, especially if you choose to divide the number the same way as most of your users. You can also often validate the input as soon as the user goes to the next field, so you indicate errors earlier.
On the other hand, users rarely type long numbers like that nowadays: most of the time they just copy-paste them from whatever note-keeping solution they have chosen, in whatever format they have it there. That means that a single field, without any limit on lenght or allowed characters suddenly makes a lot of sense -- you can filter the characters out anyways (just make sure you display the final form of the number to the user at some point). There are also issues with moving the focus between fields, with browsers remembering previous values (you just have to select one number, not 4 parts of the same number then), etc.
In general, I would say that as browsers slowly become more and more usable, you should take advantage of the mechanisms they provide by using the stock solutions, and not inventing complex solutions on your own. You may be a step before them today, but in two years the browsers will catch up and your site will suck.

Related

Database Design: How should I store 'word difficulty' in MySQL?

I made a vocabulary app for Android that has a list of ~5000 words stored in a local database (SQLite), and I want to find out which words are more difficult than others.
To find out, I'm thinking of adding a very simple feature that puts two random words on the screen, asking the user to choose the more difficult one. Then another pair of random words will show, and this process can be repeated for as long as the user wants. The more users who participate in this 'more difficult word', the app would in theory be able to distinguish difficult words from easy words.
Since the difficulty would be based on input from all users, I know I need to keep track of it online so that every app could then fetch them from the database on my website (which is MySQL). I'm not sure what would be the most efficient way to keep track of the difficulty, but I came up with two possible solutions:
1) Add a difficulty column that holds integer values to the words table. Then for every pair of words that a user looks at and ranks, the word that he/she chooses more difficult would have have its difficulty increased by one, and the word not chosen would have its difficulty decreased by one. I could simply order by that integer value to get the most difficult ones.
2) Create a difficulty table with two columns, more and less, that hold words (or ID's of the words to save space) based on the results of each selection a user makes. I'm still unsure how I would get the most difficult words - some combination of group by and order by?
The benefit of my second solution is that I can know how many times each word has been seen (# of rows from the more column that contain the word + # rows from the less column that contain the word). That helps with statistics, like if I wanted to find out which word has the highest ratio of more / less. But it would also take up much more space than my first suggested solution would, and don't know how it could scale.
Which do you think is the better solution, or what other ones should I consider?
Did you try sphinx for this? Guess a full text search engine like sphinx would solve with great performance.

Database design: Using hundred of fields for little values

I'm planning to develop a PHP Web App, it will mainly be used by registered users(sessions)
While thinking about the DB design, I was contemplating that in order to give the best user experience possible there would be lots of options for the user to activate, deactivate, specify, etc.
For example:
- Options for each layout elements, dialog boxes, dashboard, grid, etc.
- color, size, stay visible, invisible, don't ask again, show everytime, advanced mode, simple mode, etc.
This would get like 100s of fields ranging from simple Yes/No or 1 to N values..., for each user.
So, is it having a field for each of these options the way to go?
or how do those CRMs or CMS or other Web Apps do it to store lots of 1-2 char long values?
Do they group them on Text fields separated by a special char and then "explode" them as an array for runtime usage?
thank you
How about something like this:
CREATE TABLE settings (
user_id INT,
setting_name VARCHAR(255),
setting_value CHAR(2)
)
That way, to store a configuration setting for a user, you can do:
INSERT INTO settings (user_id, setting_name, setting_value),
VALUES (1, "timezone", "+8")
And when you need to query a setting for a particular user, you can do:
SELECT setting_value FROM settings
WHERE user_id = 1 AND setting_name = "timezone"
I would absolutely be inclined to have individual fields for each option. My rule of thumb is that each column holds exactly one piece of data whenever possible. No more, no less. As was mentioned earlier, the ease of maintenance and the ability to add / drop options down the road far outweighs the pain in the arse of setting it up. I would, however, put some thought into how you create the table(s). The idea mentioned earlier was to have a Settings table with 100 columns ( one for each option ) and one row for each user. That would work, to be sure. If it were me I would be inclined to break it down a bit further. You start with a basic User table, of course. That would hold the basics of username, password, userid etc. That way you can use the numeric userid as the key index for your Settings table(s). But after that I would try to break down the settings into smaller tables based on logical usage. For example, if you have 100 options, and 19 of those pertain to how a user views / is viewed / behaves in one specific part of the site, say something like a forum, then break those out into a separate table ie ForumSettings. Maybe there are 12 more that pertain to email preferences, but would not be used in other areas of the site / app. Now you have an EmailSettings table. Doing this would not only reduce the number of columns in your generic Settings table, but it would also make writing queries for specific tasks or areas of the app much easier, speed up the performance a tick, and make maintenance moving forward far less painful. Some may disagree as from a strictly data modeling perspective I'm pretty sure that the one Settings table would be indicated. But from a real world perspective, I have never gone wrong using logical chunks such as this.
From a pure data-model perspective, that would be the clearest design (though awful wide). Some might try to bitmask them into a single field for assumed space reasons, but the logic to encode/decode makes that not worthwhile, in my opinion. Also you lose the ability to index on them.
Another option (I just saw posted) is to hold a separate table with an FK back to the user table. But then you have to iterate over the results to get the value you want to check for.

implementing a blacklist of usernames

I have a block list of names/words, has about 500,000+ entries. The use of the data is to prevent people from entering these words as their username or name. The table structure is simple: word_id, word, create_date.
When the user clicks submit, I want the system to lookup whether the entered name is an exact match or a word% match.
Is this the only way to implement a block or is there a better way? I don't like the idea of doing lookups of this many rows on a submit as it slows down the submit process.
Consider a few points:
Keep your blacklist (business logic) checking in your application, and perform the comparison in your application. That's where it most belongs, and you'll likely have richer programming languages to implement that logic.
Load your half million records into your application, and store it in a cache of some kind. On each signup, perform your check against the cache. This will avoid hitting your table on each signup. It'll be all in-memory in your application, and will be much more performant.
Ensure myEnteredUserName doesn't have a blacklisted word at the beginning, end, and anywhere in between. Your question specifically had a begins-with check, but ensure that you don't miss out on 123_BadWord999.
Caching bring its own set of new challenges; consider reloading from the database everyday n minutes, or at a certain time or event. This will allow new blacklisted words to be loaded, and old ones to be thrown out.
You can't do where 'loginName' = word%. % can only be used in the literal string, not as part of the column data.
You would need to say where 'logi' = word or 'login' = word or ... where you compare substrings of the login name with the bad words. You'll need to test each substring whose length is between the shortest and longest bad word, inclusive.
Make sure you have an index on the word column of your table, and see what performance is like.
Other ways to do this would be:
Use Lucene, it's good at quickly searching text, espacially if you just need to know whether or not your substring exists. Of course Lucene might not fit technically in your environment -- it's a Java library.
Take a hash of each bad word, and record them in a bitset in memory -- this will be small and fast to look up, and you'll only need to go to the database to make sure that a positive isn't false.

MySQL Column Unification, any performance improvements?

I'm designing a MySQL table for an authentication system for a high-traffic personal website. Every time a user comment, article, etc is displayed the following fields will be needed:
login
User Display
User Bio ( A little signature )
Website Account
YouTube Account
Twitter Account
Facebook Account
Lastfm Account
So everything is in one table to prevent the need to call sub-tables. So my question is:
¿Would there be any improvements if I combine Website, Youtube, Twitter, Facebook and Lastfm columns to one?
For example:
[website::something.com][youtube::youtube.com/something]
No, combining these columns would not result in any improvement. Indeed it seems you would extend the overall length (with the adding of prefix and separators, hence potentially worsening performance.
A few other tricks however, may help:
reduce the size of the values stored in "xxxAccount" columns, by removing altogether, or replacing with short-hand codes, the most common parts of these values (the examples shown indicate some kind of URL whereby the beginning will likely be repeated.
depending on the average length of the bio, and typical text found therein, it may also be useful to find ways of shrinking its [storage] size, with simple replacement of common words, or possibly with actual compression (ZIP and such), although doing so may result in having to store the column in a BLOB column which may then become separated from the table, depending on the server implementation/configuration.
And, of course, independently form any improvements at the level of the database, the use model indicated seems to prompt for caching this kind of data agressively, to avoid the trick to SQL altogether.
Well i dont think so , think of it this way .. you will need some way to split them and that would require additional processing and then why not just have one field in the whole table and have everything in that? :) Dont worry about the performance it would be better with separate columns

How to correct the user input (Kind of google "did you mean?")

I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.