Where to split Directrory Groupings? A-F | G-K | L-P - language-agnostic

I'm looking to build a "quick link" directory access widget.
e.g. (option 1)
0-9 | A-F | G-K | L-P | Q-U | V-Z
Where each would be a link into sub-chunks of a directory starting with that character. The widget itself would be used in multiple places for looking up contacts, companies, projects, etc.
Now, for the programming part... I want to know if I should split as above...
0-9 | A-F | G-K | L-P | Q-U | V-Z
10+ 6 5 5 5 5
This split is fairly even and logically grouped, but what I'm interested to know is if there is a more optimal split based on the quantity of typical results starting with each letter. (option 2)
e.g. very few items will start with "Q".
(Note: this is currently for a "North American/English" deployment.)
Does anyone have any stats that would backup reasons to split differently?
Likewise, for usability how do users like/dislike this type of thing? I know mentally if I am looking for say: "S" it takes me a second to recall it falls in the Q-U section.
Would it be better to do a big list like this? (option 3)
#|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z

I would suggest one link per letter and hide the letters that don't have any results (if that doesn't asks for too much processing power).

As a user I would most definitely prefer one link per letter.
But better (for me as a user) would be a search box.

I think you're splitting the wrong things. You shouldn't evenly split letters, you should evenly split the results (as best as you can).
If you want 20 results per page, and A has 28, while B-C have 15 you'll want to have
A
B-C
and so on.
Additionally, you might have to consider why you are using alphabet chunking instead of something a bit more contextual. The problem with alphabet chunking is that users have to know the name of what they are looking for, and that name has to be the same as yours.
EDIT: We've tested this in lab conditions, and users locate information in chunk by results vs chunk by number of letters in pretty much the same way.
EDIT_2: Chunking by letters almost always tests poorly. Think if there are any better ways to do this.

Well, one of the primary usability considerations is evenly-distributed groups, so either your current idea (0-9, A-F, etc.) would work well, or the list with each individual letter. Having inconsistently-sized groups is a definite no-no for a user interface.

You probably definitely don't want to split across a number - that is, something like
0-4 | 5-B | ...
Besides that, I'd say just see where your data lies. Write a program to do groupings of two, three, four, five, etc... and see what the most even split for each grouping is. Pick the one that seems nicest. If you have sparse data, then having one link per letter might be annoying if there are only 1 or 2 directories with that name.
Then again, it depends what a typical user will be looking for. I can't tell what that might be just from your description - are they just navigating a directory tree?

I almost always use the last option since it is by far the easier to navigate for a user. Use that if you have enough room for it and the other one if you have a limited amount of screen estate.

Related

Two sets of sequences. How do I reset when a record is deleted

The other similar questions do not solve my problem. This is nothing to do with the PK.
My app is for salespeople to make quotes built with Web2Py. I have products with a monthly cost and a purchase cost and some with both.
The output is 2 separate tables (monthly and purchase), the salespeople want to be able to change the order the products appear on the quote. They also need to be numbered sequentially in the output.
However, as a product may be in the offer but only have a monthly cost and visa versa or even both costs. The order columns look like this:
1 0
2 1
3 0
0 2
0 3
0 4
Which is all fine until a product needs to be removed from an offer.
If for example the second item is deleted. I need to update both sequences.
The sequences are not very long. 20 max.
Is there a better way to store the ordering? If not, is there a neater solution than retrieving and updating every record in a offer?
The short answer is no. Thanks CL for the link.
This solves the problem but as the article states, it is a method which will eventually break, however this VERY unlikely in this situation. A user would have to spend a very long time changing the order for this to happen and each sequence is only available to one user and his management.
I have decided to use the divide by 2 method as it reduces the Database load and follow up with maintenance re-doing the sequences as suggested in the article.
Considering the scope of the project. I feel the solutions fits.

How to store a multiple choice useful in a database?

this is all about how to store content in the most efficient way in a database.
The most important thing here is not to save as much space as possible - the focus lays on the fastest way to use this data.
So in general its an easy thing :
We have 10 choices with radio boxes - we can select ALL or we can select none - or just select one/some of 'em.
So in general I see two general options to save the result in my database :
A) Just make 10 Fields on my table with Tinyint(1) and set a 0 or 1
B) I could use ONE Int(7) which will have the chance to decode the result like a binary system - f.E. if you choose option 3, 5 & 8 its like 00101001.
So the question is which makes more sense ?
B will take only 4 Bytes and A will take 8 Bytes - besides B will need a short PHP function to decode the binary thing.
The question now what option do you think will be of better usage sooner or later if the database get a hell of querys ?
It would probably be better to have 10 different fields, unless you know that each query will always need ALL 10 fields.
It'll be easier for the programmer (not always having to calculate bitvectors). It's easier to add new columns or remove them. You can put small, fast indexes on some columns, but not all.

best way to store user's "favorites" in MySQL

I have a photo gallery. I want to add "Add to favorites" button - so user can add other user to his/her favorites. And then I want each user to be able to watch his list of favorite users, as well as to be able to watch who (list of users) added this user to favorites.
I found two ways, and the first is:
faver_id faved_id
1 10
1 31
1 24
10 1
10 24
I dont like this method because of
1) a lots of repeating 2) very large table in future (if a have at least 1001 users, and each likes other 1000 users = 1 001 000 records) which I suppose will slow down my base.
The second way is:
user_id favs
1 1 23 34 56 87 23
10 45 32 67 54 34 88 101
I can take these favs and explode() them in php or search if user likes some other user by MySQL query select count(user_id) from users where favs LIKE '% 23 %' and user_id=10;
But I feel the second way is not very "correct" in MySQL terms.
Can you advice me something?
Think about this. Your argument against using the first approach is that your tables might get too big, but you then go on to say that if you use the second approach you could run a wildcard query to find fields which contain something.
The second approach forces a full table search, and is unindexable. With the first approach, you just slap indexes on each of your columns and you're good to go. The first approach scales much, much, much better than the second one. Since scaling seems to be your only concern with the first, I think the answer is obvious.
Go with the first approach. Many-to-Many tables are used everywhere, and for good reason.
Edit:
Another problem is that the second approach is handing off a lot of the work in maintaining the database off to the application. This is fine in some cases, but the cases you're talking about are things that the database excels at. You would only be reinventing the wheel, and badly.
Definitely go with the first way.
Well, the second way is not that easy when you want to remove or make changes, but the its all right in terms of MySQL.
Though, Joomla is even including different date of information in the same field called params.

Best usability practice for accepting long-ish account numbers

A user recently inquired (OK, complained) as to why a 19-digit account number on our web site was broken up into 4 individual text boxes of length [5,5,5,4]. Not being the original designer, I couldn't answer the question, but I'd always it assumed that it was done in order to preserve data quality and possibly to provide a better user experience also.
Other more generic examples include Phone with Area Code (10 consecutive digits versus [3,3,4]) and of course SSN (9 digits versus [3,2,4])
It got me wondering whether there are any known standards out there on the topic? When do you split up your ID#? Specifically with regards to user experience and minimizing data entry errors.
I know there was some research into this, the most I can find at the moment is the Wikipedia article on Short-term memory, specifically chunking. There's also The Magical Number Seven, Plus or Minus Two.
When I'm providing ID's to end users I, personally like to break it up into blocks of 5 which appears to be the same convention the original designer of your system used. I've got no logical reason that I can give you for having picked this number other than it "feels right". Short of being able to spend a lot of money on carrying out a study, "gut instinct" and following contentions from other systems is probably the way to go.
That said, if you can make the UI more usable to the user by:
Automatically moving from the end of one field to the start of another when it's complete
Automatically moving from the start of one field to the prior field and deleting the last character when the user presses delete in an empty field that isn't the first one
OR
Replacing it with one long field that has some form of "input mask" on it (not sure if this is doable in plain HTML, but it may be feasible using one of the UI frameworks) so it appears like "_____ - _____ - _____ - ____" and ends up looking like "1235 - 54321 - 12345 - 1234"
It would almost certainly make them happier!
Don't know about standards, but from a personal point of view:
If there are multiple fields, make sure the cursor moves to the next field once a field is full.
If there's only one field, allow spaces/dashes/whatever to be used in that field because you can filter them out. It's really annoying when sites/programs force you to enter dates in "dd/mm/yyyy" format, for example, meaning the day/month must be padded with zeroes. "23/8/2010" should be acceptable.
You need to consider the wider context of your particular application. There are always pros and cons of any design decision, but their impact changes depending on the situation, so you have to think every time.
Splitting the long number into several fields makes it easier to read, especially if you choose to divide the number the same way as most of your users. You can also often validate the input as soon as the user goes to the next field, so you indicate errors earlier.
On the other hand, users rarely type long numbers like that nowadays: most of the time they just copy-paste them from whatever note-keeping solution they have chosen, in whatever format they have it there. That means that a single field, without any limit on lenght or allowed characters suddenly makes a lot of sense -- you can filter the characters out anyways (just make sure you display the final form of the number to the user at some point). There are also issues with moving the focus between fields, with browsers remembering previous values (you just have to select one number, not 4 parts of the same number then), etc.
In general, I would say that as browsers slowly become more and more usable, you should take advantage of the mechanisms they provide by using the stock solutions, and not inventing complex solutions on your own. You may be a step before them today, but in two years the browsers will catch up and your site will suck.

How to correct the user input (Kind of google "did you mean?")

I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.