Keyboard key numbering - mysql

I am creating a database of keyboards using MySQL and PHP.
Is there a standard way to number the individual keys on a keyboard based on their physical position?
I want to use these numbers in the index column of my database. I have been looking at this, but it is missing some of the newer keys:
https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/com.ibm.aix.keyboardtechref/doc/kybdtech/Key.htm#dbffbe3450dagi
Using scancodes might be an alternative, but converting back and forth between decimal and hexadecimal would be an annoyance. Also, some keys sometimes have two sets of scancodes, which would be problematic for my database. Thanks.

Short answer: No.
Long answer: There are more keyboards out there than you could catalog in a lifetime. Every major vendor has produced localized keyboards for hundreds of countries in dozens of versions and iterations over the last thirty-plus years. Some countries have more than one keyboard layout, like Canada, which has both US English and Canadian French.
Even focusing on a single vendor with a relatively standardized layout, like Apple, would require hundreds of hours of work to identify, precisely, the nature of each key and its physical location and shape on the keyboard.
The scancodes are fairly consistent, but I'm not sure that's much help when the key itself may be remapped in software to do different things, or may have different meanings in different languages.
You're going to need to experiment, explore, and probably buy an embarrassing number of keyboards on eBay if you want to create a comprehensive database.

Related

What is a good way to create random word sequences to make easy-to-dictate identifiers?

I am presently using GUIDs to identify certain data entities. When a customer or colleague needs to relate one of these they can, of course, copy and paste the IDs into instant messages or emails. However, they aren't very pretty and are a pain to describe over the phone or in person.
I came across the PGP cryptography suite in my youth, which includes a word list that been designed to be phonetically distinct, so you won't confuse words when heard over the phone.
https://en.wikipedia.org/wiki/PGP_word_list
I wrote code that turns integer IDs into 'horse battery staple'-style identifiers using this list.
Unfortunately, the PGP word list contains many negative-sounding and distracting words. For example, I don't want a customer to ring up and say "Hey, I'm having trouble with 'drunken-Eskimo-revenge'" (in fact nearly all the random ones I tried were outrageous).
Are there any good alternatives to the PGP word list? Ideally the words should be 'phonically distinct' and inoffensive.

I know a GUID is nearly unique. But is it acceptable practice to assume it is unique?

So I completely understand the mathematical unlikeliness of creating two GUID values with the same number. But is it acceptable practice to assume they are unique?
For example I am working with a system for dealing with medical files. When I began to layout the database structure the manager (Not very technically knowledgeable, but likes to think he is and delegates things that would be better left for the more technically minded to decide) says he wants to use GUID's to separate different medical records instead of INT because it is "More unique". I explained how an INT is always going to be unique because it is sequential. I suggested we use BigINT if it will make him feel more comfortable since there are more numbers in that then if the population of the planet increased to the point people would only fit standing next to one another across the planet, but he is insisting on using GUIDs.
My feeling is although it is NEARLY IMPOSSIBLE for there to be a mix up, when dealing with medical records, why take the chance? What is the advantage of using a GUID vs an INT in this scenario?
But is it acceptable practice to assume it is unique?
Yes. That is the entire purpose of UUID, to be used as a reliable unique identifier without centralized coordination. (A GUID is Microsoft’s variation of a UUID.)
Only you (or your appropriate management) can make the final judgement for your particular project.
But if you truly begin to appreciate the enormity of the numerical range of 12x bits (which is actually incomprehensible to the human mind), then you know you can remove the usage of a properly generated UUID from your list of worries.
By “properly generated” I mean things like using the date-time Versions, or for lower number of values use the random (Version 4) if backed by a cryptographically-strong random number generator. Nearly every modern operating system today includes a UUID generation library. Or you can use the OSSP UUID project. Improperly-generated would include roll-your-own implementations you may see bandied about the inter webs.
As for the suggestion to use a database’s auto-incrementing serial/sequence number, every database person I know with years of real-world experience has been burned by those. I’ve never heard of or read of anyone ever having a collision with properly-generated UUIDs. I'm not saying sequences are necessarily bad or don't have their place, I'm just saying that all I can do is laugh when I hear people turn away from a UUID because of some beyond-astrononomically incomprehensibly minute possibility of a UUID collision and choose a sequence instead.
when dealing with medical records, why take the chance?
Your medical system is far far more likely to fail because of faulty data-entry or other human error with handling records. But do you post 3 clerks on duty to independently triple-enter the same data to reduce that chance of error? No. And that risk is incomprehensibly mathematically more likely to happen than a UUID problem. Yet every medical facility I know of accepts that enormous risk without even thinking about it.
What is the advantage of using a GUID vs an INT
The advantages include:
No need to manage your sequences.Examples include: Resetting for development, test, and production environments. Or when restoring a backup. Or fixing the sequence after faults in the system’s serial generation library (my own experience).
Avoid users’ intuited assumptions being confused about missing numbers in the sequence. I've had that conversation far too often.
Federating data between distributed systems.This is the biggest advantage, each system can act independently yet easily share data back and forth with other systems. Without UUIDs, the administrative overhead and the risk of error are bothersome at first and only grow over time.
Downsides include:
Larger memory and storage usage.Serial numbers are usually 32-bit integers, sometimes 64-bit. A good database with native support for UUID as a data type will use 128 bits.
Less readable by humans.One workaround is to just read several of the first or last digits for casual work.
Possibly less efficient indexing, with very large number of entries.
using an incrementing integer ID ensures only uniqueness within its own domain/type, an advantage of UUIDs/GUIDs is that they uniquely identify the owning thing in the entire universe.
So if you have multiple objects, say MedicalRecord, ID = 5, VaccinationForm, ID = 5 then you need to specify both the type ("medicalRecord" or "vaccinationForm" with the ID value of 5) whereas with a GUID you only need to store a single quanta of information to uniquely identify it.
It can be argued that using GUIDs is a waste of space as they are 16 bytes long (a 128-bit value).
If your system is self-contained and not interfacing with others you might want to use SQL Server's "sequence" concept, where instead of each table storing its own identity sequence, the sequence is maintained for all tables, making it a Locally-Unique ID value. You can use any size integer too.
See here: https://msdn.microsoft.com/en-us/library/ff878091.aspx

'Many to two' relationship

I am wondering about a 'many to two' relationship. The child can be linked to either of two parents, but not both. Is there any way to reinforce this? Also I would like to prevent duplicate entries in the child.
A real world example would be phone numbers, users and companies. A company can have many phone numbers, a user can have many phone numbers, but ideally the user shouldn't provide the same phone number as the company as there would be duplicate content in the DB.
This question shows that you don't fully understand entity relationships (no rudeness intended). Of which there are four (technically only 3) types below:
One to One
One to Many
Many to One
Many to Many
One to One (1:1):
In this case a table has been broken up into two parts for purposes of complying with normalisation, or more usually the open closed principle.
Normalisation compliance: You might have a business rule that each customer has only one account. Technically, you could in this case say customer and account could all be in the same table, but this breaks the rules of normalisation, so you split them and make a 1:1.
Open-Close principle compliance: A customer table, might have id, first & last names, and address. Later someone decides to add a date of birth and with it the ability to calculate age along with a bunch of other much needed fields. This is an over simplified example of one to one, but you get the main use for it is to extend your database without breaking existing code. Much code written (sadly) is tightly coupled to the database so changes in the structure of a table will break the code. Adding a 1:1 like this will extend the table to meet new requirements without modifying the origional, thereby allowing old code to continue functioning normally and new code to make use of the new db features.
The downside of normalisation and extending tables using 1:1 relationships in this way is performance. Often times on heavly used systems, the first target to increase database performance is de-normalising and combining such tables into a single table, and optimising the indexes thus removing the need to use joins and read from multiple tables. Normalisation / De-Normalisation is neither a good or bad thing, as it depends on the needs of the system. Most systems usually start off normalised changing back when needed, but this change needs to be done very carefully as mentioned, if code is tightly coupled to the DB structure, it will almost definitely cause the system to fail. i.e. When you combine 2 tables, one ceases to exist, all the code that includes that now nonexistant table fails until it is modified (in db terms, imagine connecting relationships to any of the tables in the 1:1, when you remove those tables, this breaks the relationships, and so the structure has to be greatly modified to compensate. Unfortunately, such bad designs are much easier to spot in the DB world than in the software world in most cases and you don't usually notice something went wrong in code until it all falls apart) unless the system is properly designed with separation of concerns in mind.
It the closest thing you can get to inheritance in object oriented programming. But its not quite the same.
One to Many (1:M) / Many to One (M:1):
These two relationships (hense why 4 become 3), are the most popular relationship types. They are both the same type of relationship, the only thing that changes is your point of view. An example A customer has many phone numbers, or alternately, many phone numbers can belong to a customer.
In object oriented programming this would be considered composition. Its not inheritance, but you are saying one item is composed of many parts. This is usually represented with arrays / lists / collections etc. inside of classes as opposed to an inheritance structure.
Many to Many (M:M):
This type of relationship with current technology is impossible. For this reason we need to break it down into two one to many relationships with an "association" table joining them. The many side of the two one to many relationships is always on the association / link table.
For your example, the person who said you need a many to many is correct. Because a two to many is effectively a many (meaning more than one) to many relationship. This is the only way you would get your system to work. Unless you are intending to research the field of relational calculus to find some new type of relationship that would allow this.
Also for such relationships (m2m) you have two choices, either create a compound key in the linker table so the combination of fields become a unique entry (if you are interested in db optimisation this is the slower choice, but takes less space). Alternately, you create a third field with an auto generated id column and make that the primary key (for db optimisation, this is the faster choice, but takes more space).
In your example specifically above...
A real world example would be phone numbers, users and companies. A company can have many phone numbers, a user can have many phone numbers, but ideally the user shouldn't provide the same phone number as the company as there would be duplicate content in the DB.
This would be a many to many relationship with the phone number table as the linker table between companies and users. As explained, to ensure no phone number is repeated, you simply set it as the primary key or use another primary key and set the phone number field to unique.
For those kind of questions, it is really down to how you phrase them. What is causing you to get confused about this, and how you overcome this confusion to see the solution is simple. Rephrase the problem as follows. Start by asking is it a one to one, if the answer is no, move on. Next ask is it a one to many, if the answer is no move on. The only other option remaining is many to many. Be careful though, ensure you have considered the first 2 questions carefully before moving on. Many inexperienced database people often over complicate issues by defining one to many as many to many. Once again, the most popular type of relationship by far is one to many (I would say 90%) with the many to many and one to one spliting the remaining 10% 7/3 respectevely. But those figures are just my personal perspective, so dont go quoting them as industry standard statistics. My point is to make extra extra sure it is definitely not a one to many before choosing many to many. It is worth the extra effort.
So now to find the linker table between the two, decide which two are your main tables, and what fields need to be shared between them. In this case, company and user tables both need to share the phone. Hense you need to make a new phone table as the linker.
The warning alarm of misunderstanding should show as soon as you decide none of the 3 are working for you. This should be enough to tell you that you simply are not phrasing the relationship question correctly. You will get better at it as time passes, but it is an essential skill and really should be mastered as soon as possible for your own sanaty.
Of course you could also go to an object oriented database which will allow a range of other relationships called "Hierarchacal" relationships. Thats great if you are thinking of becomming a programmer too. But I wouldnt recommend this as it going to make your head hurt when you start finding ways to combine the various types of relationships. Especially given there is not much need since nearly all databases in the world consist of just those 3 types of relationships unless they are something super duper special.
Hope this was a reasonable answer. Thanks for taking the time to read it.
Just make phone number a key in your contact numbers table.
For your phone number example, you would put the phone number in a table by itself, with an ID.
Then you link to that phone_id from each of users and companies.
For your parents example, you don't link the child to parent - instead you link the parent to the child. OR, you put both parents in the same table, and the child just links to one of them.

How to correct the user input (Kind of google "did you mean?")

I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.

How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

My users will import through cut and paste a large string that will contain company names.
I have an existing and growing MYSQL database of companies names, each with a unique company_id.
I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.
Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **
For example, someone writes:
Microsoft -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc. -> Polycom
I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:
How to find best fuzzy match for a string in a large string database
Matching inexact company names in Java
You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).
The drawbacks of SOUNDEX() are:
its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
for MySQL, at least according to the docs, SOUNDEX is broken for unicode input
Example:
SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')
/* all of these return 'M262' */
For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.
Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.
In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).
SOUNDEX is an OK algorithm for this, but there have been recent advances on this topic. Another algorithm was created called the Metaphone, and it was later revised to a Double Metaphone algorithm. I have personally used the java apache commons implementation of double metaphone and it is customizable and accurate.
They have implementations in lots of other languages on the wikipedia page for it, too. This question has been answered, but should you find any of the identified problems with the SOUNDEX appearing in your application, it's nice to know there are options. Sometimes it can generate the same code for two really different words. Double metaphone was created to help take care of that problem.
Stolen from wikipedia: http://en.wikipedia.org/wiki/Soundex
As a response to deficiencies in the
Soundex algorithm, Lawrence Philips
developed the Metaphone algorithm for
the same purpose. Philips later
developed an improvement to Metaphone,
which he called Double-Metaphone.
Double-Metaphone includes a much
larger encoding rule set than its
predecessor, handles a subset of
non-Latin characters, and returns a
primary and a secondary encoding to
account for different pronunciations
of a single word in English.
At the bottom of the double metaphone page, they have the implementations of it for all kinds of programming languages: http://en.wikipedia.org/wiki/Double-Metaphone
Python & MySQL implementation: https://github.com/AtomBoy/double-metaphone
Firstly, I would like to add that you should be very careful when using any form of Phonetic/Fuzzy Matching Algorithm, as this kind of logic is exactly that, Fuzzy or to put it more simply; potentially inaccurate. Especially true when used for matching company names.
A good approach is to seek corroboration from other data, such as address information, postal codes, tel numbers, Geo Coordinates etc. This will help confirm the probability of your data being accurately matched.
There are a whole range of issues related to B2B Data Matching too many to be addressed here, I have written more about Company Name Matching in my blog (also an updated article), but in summary the key issues are:
Looking at the whole string is unhelpful as the most important part
of a Company Name is not necessarily at the beginning of the Company
Name. i.e. ‘The Proctor and Gamble Company’ or ‘United States Federal
Reserve ‘
Abbreviations are common place in Company Names i.e. HP, GM, GE, P&G,
D&B etc..
Some companies deliberately spell their names incorrectly as part of
their branding and to differentiate themselves from other companies.
Matching exact data is easy, but matching non-exact data can be much more time consuming and I would suggest that you should consider how you will be validating the non-exact matches to ensure these are of acceptable quality.
Before we built Match2Lists.com, we used to spend an unhealthy amount of time validating fuzzy matches. In Match2Lists we incorporated a powerful Visualisation tool enabling us to review non-exact matches, this proved to be a real game changer in terms of match validation, reducing our costs and enabling us to deliver results much more quickly.
Best of Luck!!
Here's a link to the php discussion of the soundex functions in mysql and php. I'd start from there, then expand into your other not-so-well-defined requirements.
Your reference references the Levenshtein methodology for matching. Two problems. 1. It's more appropriate for measuring the difference between two known words, not for searching. 2. It discusses a solution designed more to detect things like proofing errors (using "Levenshtien" for "Levenshtein") rather than spelling errors (where the user doesn't know how to spell, say "Levenshtein" and types in "Levinstein". I usually associate it with looking for a phrase in a book rather than a key value in a database.
EDIT: In response to comment--
Can you at least get the users to put the company names into multiple text boxes; 2. or use an unambigous name delimiter (say backslash); 3. leave out articles ("The") and generic abbreviations (or you can filter for these); 4. Squoosh the spaces out and match for that also (so Micro Soft => microsoft, Bare Essentials => bareessentials); 5. Filter out punctuation; 6. Do "OR" searches on words ("bare" OR "essentials") - people will inevitably leave one or the other out sometimes.
Test like mad and use the feedback loop from users.
the best function for fuzzy matching is levenshtein. it's traditionally used by spell checkers, so that might be the way to go. there's a UDF for it available here: http://joshdrew.com/
the downside to using levenshtein is that it won't scale very well. a better idea might be to dump the whole table in to a spell checker custom dictionary file and do the suggestion from your application tier instead of the database tier.
This answer results in indexed lookup of almost any entity using input of 2 or 3 characters or more.
Basically, create a new table with 2 columns, word and key. Run a process on the original table containing the column to be fuzzy searched. This process will extract every individual word from the original column and write these words to the word table along with the original key. During this process, commonly occurring words like 'the','and', etc should be discarded.
We then create several indices on the word table, as follows...
A normal, lowercase index on word + key
An index on the 2nd through 5th character + key
An index on the 3rd through 6th character + key
Alternately, create a SOUNDEX() index on the word column.
Once this is in place, we take any user input and search using normal word = input or LIKE input%. We never do a LIKE %input as we are always looking for a match on any of the first 3 characters, which are all indexed.
If your original table is massive, you could partition the word table by chunks of the alphabet to ensure the user's input is being narrowed down to candidate rows immediately.
Though the question asks about how to do fuzzy searches in MySQL, I'd recommend considering using a separate fuzzy search (aka typo tolerant) engine to accomplish this. Here are some search engines to consider:
ElasticSearch (Open source, has a ton of features, and so is also complex to operate)
Algolia (Proprietary, but has great docs and super easy to get up and running)
Typesense (Open source, provides the same fuzzy search-as-you-type feature as Algolia)
Check if it's spelled wrong before querying using a trusted and well tested spell checking library on the server side, then do a simple query for the original text AND the first suggested correct spelling (if spell check determined it was misspelled).
You can create custom dictionaries for any spell check library worth using, which you may need to do for matching more obscure company names.
It's way faster to match against two simple strings than it is to do a Levenshtein distance calculation against an entire table. MySQL is not well suited for this.
I tackled a similar problem recently and wasted a lot of time fiddling around with algorithms, so I really wish there had been more people out there cautioning against doing this in MySQL.
Probably been suggested before but why not dump the data out to Excel and use the Fuzzy Match Excel plugin. This will give a score from 0 to 1 (1 being 100%).
I did this for business partner (company) data that was held in a database.
Download the latest UK Companies House data and score against that.
For ROW data its more complex as we had to do a more manual process.