What's the most efficient method or tool to randomize a list of database table columns to obscure sensitive information?
I have a Django application used by several clients, and I need to onboard some development contractors to do work on the application. When they work on bugs (e.g. page /admin/model/123 has an error), ideally they'd need a snapshot of the client database in order to reproduce and fix the bug. However, because they're off-site contractors, I'd like to mitigate risk in the event they expose the client database (unintentionally or otherwise). I don't want to have to explain to a client why all their data's been published online because a foreign contractor left his laptop in an unlocked car.
To do this, I'd like to find or write a tool to "randomize" sensitive fields in the database, like usernames, email addresses, account numbers, company names, phone numbers, etc so that the structure of the data is maintained, but all personally identifiable information is removed.
Presumably, this is a task that many other people have had to do, but I'm not sure what the technical term is, so I'm not finding much through Google. Are there any existing tools to do this with a Django application running a MySQL or PostgreSQL backend?
Anonymize and sanitize are good words for this chore.
It's relatively easy to do. Use queries like
UPDATE person
SET name = CONCAT('Person', person_id),
email = CONCAT('Person', person_id, '#example.com')
and so forth, to stomp actual names and emails and all that. It's helpful to preserve the uniqueness of entries, and the autoincrementing IDs of various tables can help you do that.
(Adding this as an answer, as I am not allowed to comment yet.)
As Cerin said, O. Jones approach for anonymizing/sanitizing works for simple fields, but not more complicated ones like addresses, phone number or account numbers that need to match a specific format. However, the method can be modified to allow this too.
Let's take a phone number with format aaa-bbbb-ccc as an example and use the autoincrementing person_id as the source of unique numbers. For the ccc part of the phone number, use MOD(person_id,1000). This will give the remainder of person_id divided by a 1000. For bbbb, take MOD((person_id-MOD(person_id,1000))/1000,10000). It looks complicated, but what this is doing is taking person_id, removing the last three digits (which were used for ccc), then dividing by a 1000. The last four digits are taken from the resulting number to use as bbbb. I think you'll be able to figure out how to calculate aaa.
The three parts of the phone number can then be concatenated to give the complete phone number: CONCAT(aaa,"-",bbbb,"-",ccc)
(You might have to explicitly convert the numbers to string, I'm not sure)
Related
I have a database that has a table email_eml that stores 3 attributes name_eml, host_eml and domain_eml. Which store email name the name of website and the domain name (like .com .net etc) It doesnt store # or a . in any of the variables.
This allows me some flexibility (for example checking average name lenght(before the # symbol) will be faster) . I can collect some statistics on email name, I can also create usernames from the name_eml attribute.
It however is also a burden to handle when people are submitting their email or i have to compare a whole email.
This will make me store the additional # and . symbols and make me seperate the name through script when i want to collect statistics.
I wonder if its better to store the email in a single column instead of the 3 columns.
Is one of the ways more proper or more normalized way?
I would like the answer to include pros and cons of both approaches to storing the email adresses. (even if storing the emails in 3 columns doesnt have many pros)
It doesnt store # or a . in any of the variables.
Well, it should; cat.call#somedomain.com is a legal email address.
I wonder if its better to store the email in a single column instead
of the 3 columns. Is one of the ways more proper or more normalized
way?
This doesn't have anything to do with normalization. It has to do with complex data types.
The relational model allows arbitrarily complex data types. A commonly used complex data type is a timestamp, which typically includes year, month, day, hour, minute, second, and microsecond.
Given a timestamp, sometimes you might need to know only the date, and sometimes you might need to know only the year or only the hour. The relational model imposes a specific burden on the dbms when dealing with complex data types. For a complex data type, the dbms is required either to return it in its entirety, or to provide functions that return its various parts. The point is that, if a user wants only the hour out of a timestamp, the user doesn't write code to get it.
SQL dbms have good support for timestamps; every dbms that I'm familiar with provides functions that return various parts of timestamps. None of them have native support for email addresses.
On a SQL platform, you have at least two alternatives to keep your database close to the relational model. You can write functions that can be incorporated into the database server (if your dbms and your programming skill allows that), or you can split up the data type into pieces so each can be addressed in its entirety like any other value.
While there are probably some data types that make sense to split like that (street addresses might be one of them), I don't really see any compelling reason to split an email address.
This allows me some flexibility (for example checking average name
lenght(before the # symbol) will be faster) . I can collect some
statistics on email name, I can also create usernames from the
name_eml attribute.
While that's true, right now I can't imagine anything at all interesting about the average length of a username. I don't find any of your reasons compelling, but you know more about your application than I do.
If you really need to do a lot of operations on the pieces, it makes more sense to keep the pieces separate. More "normal" client code should access the email addresses through a view that concatenates the pieces. (Concatenation is a lot easier than parsing an email address at run time.)
It's extremely rare to store email addresses in three columns. If you want to do something like search on the part of the email before the # symbol you could just use a LIKE query...
SELECT email FROM people WHERE email LIKE 'john.smith#%';
I'd be interested to hear of any real-life examples that aren't possible to do with an SQL query.
In terms of normalization, once you break apart common aspects (such as host and especially top-level domain), they should be modeled as foreign relationships. So you end up with three tables:
emailNames
emailHosts
emailTLDs
emailNames then has three columns:
emailName
hostID
tldID
Note that I used "TLD", as this is likely the only part with significant overlap in the host name, and you can expect the "." character in hostnames before the start of the TLD.
Today I have a table containing:
Table a
--------
name
description
street1
street2
zipcode
city
fk_countryID
I'm having a discussion what is the best way to normalize this in terms of quickest search. E.g. find all rows filtered by city or zipcode. Suggested new structure is this:
Table A
--------
name
description
fk_streetID
streetNumber
zipcode
fk_countryID
Table Street
--------
id
street1
street2
fk_cityID
Table City
----------
id
name
Table Country
-------------
id
name
The dicussion is about having only one field for street name instead of two.
My argument is that having two feilds is considered normal for supporting international addresses.
The pro argument is that it will go on the cost of performance on search and possible duplication.
I'm wondering what is the best way to go here.
UPDATE
I'm aiming at having 15.000 brands associated with 50.000 stores, where 1.000 users will do multiple searches each day by web and iPhone. In addition I will be having 3. parties fetching data from the DB for their sites.
The site is not launched yet, so we have no idea of the workload. And we'll only have around 1000 brands assocaited with around 4000 stores when we start.
My standard advice (from years of data warehouse /BI experience) here is:
always store the lowest level of broken out detail, i.e. the multiple fields option.
In addition to that, depending on your needs you can add indexes or even a compound field that is the other two field concatenated - though make sure to maintain with a trigger and not manually or you will have data syncronization and quality problems.
Part of the correct answer for you will always depend on your actual use. Can you ever anticipate needing the address in a standard (2-line) format for mailing... or exchange with other entities? Or is this a really pure 'read-only' database that is just set up for inquiries and not used for more standard address needs such as mailings.
A the end of the day if you have issues with query performance, you can add additional structures such as compound fields, indexes and even other tables with the same data in a different form. Then there are also options for caching at the server level if performance is slow. If building a complex or traffic intensive site, chances are you will end up with a product to help anyway, for example in the Ruby programming world people use thinking sphinx If query performance is still an issue and your data is growing you may ultimately need to consider non-sql solutions like MongoDB.
One final principle that I also adhere to: think about people updating data if that will occur in this system. When people input data initially and then subsequently go to edit that information, they expect the information to be "the same" so any manipulation done internally that actually changes the form or content of the users input will become a major headache when trying to allow them to do a simple edit. I have seen insanely complicated algorithms for encoding and decoding data in this fashion and they frequently have issues.
I think the topmost example is the way to go, maybe with a third free-form field:
name
description
street1
street2
street3
zipcode
city
fk_countryID
the only thing you can normalize half-way sanely for international addresses is zip code (needs to be a free-form field, though) and city. Street addresses vary way too much.
Note that high normalisation means more joins, so it won't yield to faster searches in every case.
As others have mentioned, address normalization (or "standardization") is most effective when the data is together in one table but the individual pieces are in separate columns (like your first example). I work in the address verification field (at SmartyStreets), and you'll find that standardizing addresses is a really complex task. There's more documentation on this task here: https://www.smartystreets.com/Features/Standardization/
With the volume of requests you'll be handling, I highly suggest you make sure the addresses are correct before you deploy. Process your list of addresses and remove duplicates, standardize formats, etc. A CASS-Certified vendor (such as SmartyStreets, though there are others) will provide such a service.
I'm working on an app I'd like to sell some day -- sooner rather than later! I'd like to develop a reasonably simple serial number scheme to protect it.
A simple number/letter combination not more than 25-30 alphanumeric characters long (think Microsoft product keys)
Does not require the user to enter any personal information (like an email address) as part of the verification
I've been thinking about this a (very little) bit, and I think public key cryptography is a good place to start. I could generate a string that identifies the license (like SKU + plain ole' integral serial number), hash it, encrypt it, and encode the serial number + identifier into a 25 digit (or so) alphanumeric key. The app would then decode the key into a serial number and "signature", generate an identifier hash, decrypt the "signature" using a corresponding public key and compare it against the generated identifier hash.
Essentially, the product key carries two pieces of data: the serial number the user claims to own plus a signature of sorts the program can use to verify that claim. I don't know if 25 alphanumeric characters (which encode 5 bits each for a realistic total of 120 bits) is enough for all this. But, it doesn't have to be cryptographically secure, just enough that the codes aren't easily guessable. I'm OK with short key lengths and short hashes.
As far as implementation goes, the app is written in Objective-C for Mac OS X, but given how easy it is to inject code into Cocoa apps, I'll probably write the verification code in straight C.
I would not use any strong cryptography, since you have to decrypt it in program anyways, making keygens or at least cracks easy to do.
I would do the following - take a, say, 25 digit number. Now add some rules, such as:
- number must be divisible by 31
- it must start and end with the last letter
...
Always generate keys using these rules. Use 20 rules or more (more the better). When deploying the app, use smaller number of rules, e.g. 10 to check if key is valid.
These rules will then be disassemled and used to create keygen.
On every update enable one of the rules you didn't use before. If rules are selected correctly, you will disable most of keys generated by keygens.
I like #bh213's method, however it isn't going to prevent the key-gens from being fixed as you update your serial number rules.
On a more personal preference note, I prefer the key generator based on a rule set method because if hackers have to patch a binary, you stand to get a bad review on your software because of a bad hacker patch and the ensuing battle between you and hackers.
My preference is based on a universal software truth: Software can and will be hacked if it is popular enough, there is no scheme, no ingenious method that will prevent this from happening. This battle is between the developer who has limited resources and limited time and a hacker group with unlimited time on their hands.
Your key generation scheme is really only to keep the honest customers honest - its easier to get a check cut from Accounts Payable than it is to get Security to sign off on a key generator.
As Redbeard 0x0A is correct, CD-Keys are to keep honest customers honest. It just needs to be slightly less difficult to buy your product than to find a keygen.
If you are selling your product online, then the best way to do it is to give them a file containing the serial. This way the serial can be however long you want and your paying customers don't have to waste time entering a serial.
A serial scheme can be very simple:
Have a large serial space (25 alpha numeric is about 1044, but just use a serial file and do 80 char x 16 rows for a key space of 102311)
Select a few rules to reduce valid serial number space to 100 times what you think you will sell in your wildest dreams
If your product has an online component (like how games have multiplayer online), you could further reduce valid serial numbers using a cryptographically strong random number generator (to select a subset of the rule based keys, product uses rule to check serial, server uses final true serial list). When your product requests service from your server, the server can check the serial.
All rules whether custom developed or encryption based can be found out and cracked. Think about who you're building your application for. Many of my products are for businesses so they're either going to buy it or they won't. They're typically not going to run a hacked version on their networks. People looking for a keygen aren't likely to purchase your product regardless. You just want to make sure you don't annoy your customer to the point where they're no longer wanting to buy your app.
That being said, I've written a library for this sort of thing to use in my applications based on AES encryption. I'm selling it for $25, and it uses a passphrase and a salt to make your serial number unique. If you're interested you can find it here: http://simpleserials.com
There is a Blog Post on sigpipe.macromates.com explaining how you use private/public key crypto for checking a serial number. It can verify that the user and Serialnumber match. (Signing/Verify). I would probably add some salt, just to be sure.
As this post is from 2004, you should consider the recommended keylength at keylength.com.
What exactly is GUID? Why and where I should use it?
I've seen references to GUID in a lot of places, and in wikipedia,
but it is not very clear telling you where to use it.
If someone could answer this, it would be nice.
Thanks
GUID technically stands for globally unique identifier. What it is, actually, is a 128 bit structure that is unlikely to ever repeat or create a collision. If you do the maths, the domain of values is in the undecillions.
Use guids when you have multiple independent systems or clients generating ID's that need to be unique.
For example, if I have 5 client apps creating and inserting transactional data into a table that has a unique constraint on the ID, then use guids. This prevents having to force a client to request an issued ID from the server first.
This is also great for object factories and systems that have numerous object types stored in different tables where you don't want any 2 objects to have the same ID. This makes caching and scavenging schemas much easier to implement.
A GUID is a "Globally Unique IDentifier". You use it anywhere that you need an identifier that guaranteed to be different than every other.
Usually, you only need a value to be "locally unique" -- the Primary Key identity in a database table,for example, needs only be different from the other rows in that table, but can be the same as the ID in other tables. (no need for a GUID here)
GUIDs are generally used when you will be defining an ID that must be different from an ID that someone else (outside of your control) will be defining. One such place in the Interface identifier on ActiveX controls. Anyone can create an ActiveX, and not know with what other control someone will be using them with --- and there's nothing to stop everyone from giving their controls the same name. GUIDs keep them distinct.
GUIDs are a combination of the time (in very small fractions of a second) (so it assured to be different from any GUID defined before or later), and a number defining your location (sometimes taken from the MAC address of you network card) (so it's assured to be different from any other GUID defined right now by someone else).
They are also sometimes known as UUIDs (universally unique ID).
As addition to all the other answers, here is an online GUID generator:
http://www.guidgenerator.com/
What is a GUID?
GUID (or UUID) is an acronym for
'Globally Unique Identifier' (or
'Universally Unique Identifier'). It
is a 128-bit integer number used to
identify resources. The term GUID is
generally used by developers working
with Microsoft technologies, while
UUID is used everywhere else.
How unique is a GUID?
128-bits is big enough and the
generation algorithm is unique enough
that if 1,0000,000,000 GUIDs per
second were generated for 1 year the
probability of a duplicate would be
only 50%. Or if every human on Earth
generated 600,000,000 GUIDs there
would only be a 50% probability of a
duplicate.
How are GUIDs used?
GUIDs are used in software development
as database keys, component
identifiers, or just about anywhere
else a truly unique identifier is
required. GUIDs are also used to
identify all interfaces and objects in
COM programming.
A GUID is a "Globally Unique ID". Also called a UUID (Universally Unique ID).
It's basically a 128 bit number that is generated in a way (see RFC 4112 http://www.ietf.org/rfc/rfc4122.txt) that makes it nearly impossible for duplicates to be generated. This way, I can generate GUIDs without some third party organization having to give them to me to ensure they are unique.
One widespread use of GUIDs is as identifiers for COM entities on Windows (classes, typelibs, interfaces, etc.). Using GUIDs, developers could build their COM components without going to Microsoft to get a unique identifier. Even though identifying COM entities is a major use of GUIDs, they are used for many things that need unique identifiers. Some developers will generate GUIDs for database records to provide them an ID that can be used even when they must be unique across many different databases.
Generally, you can think of a GUID as a serial number that can be generated by anyone at anytime and they'll know that the serial number will be unique.
Other ways to get unique identifiers include getting a domain name. To ensure the uniqueness of domain names, you have to get it from some organization (ultimately administered by ICANN).
Because GUIDs can be unwieldy (from a human readable point of view they are a string of hexadecimal numbers, usually grouped like so: aaaaaaaa-bbbb-cccc-dddd-ffffffffffff), some namespaces that need unique names across different organization use another scheme (often based on Internet domain names).
So, the namespace for Java packages by convention starts with the orgnaization's domain name (reversed) followed by names that are determined in some organization specfic way. For example, a Java package might be named:
com.example.jpackage
This means that dealing with name collisions becomes the responsibility of each organization.
XML namespaces are also made unique in a similar way - by convention, someone creating an XML namespace is supposed to make it 'underneath' a registered domain name under their control. For example:
xmlns="http://www.w3.org/1999/xhtml"
Another way that unique IDs have been managed is for Ethernet MAC addresses. A company that makes Ethernet cards has to get a block of addresses assigned to them by the IEEE (I think it's the IEEE). In this case the scheme has worked pretty well, and even if a manufacturer screws up and issues cards with duplicate MAC addresses, things will still work OK as long as those cards are not on the same subnet, since beyond a subnet, only the IP address is used to route packets. Although there are some other uses of MAC addresses that might be affected - one of the algorithms for generating GUIDs uses the MAC address as one parameter. This GUID generation method is not as widely used anymore because it is considered a privacy threat.
One example of a scheme to come up with unique identifiers that didn't work very well was the Microsoft provided ID's for 'VxD' drivers in Windows 9x. Developers of third party VxD drivers were supposed to ask Microsoft for a set of IDs to use for any drivers the third party wrote. This way, Microsoft could ensure there were not duplicate IDs. Unfortunately, many driver writers never bothered, and simply used whatever ID was in the example VxD they used as a starting point. I'm not sure how much trouble this caused - I don't think VxD ID uniqueness was absolutely necessary, but it probably affected some functionality in some APIs.
GUID or UUID (globally vs Universally) Unique IDentifier is, well, a unique ID :) When you need something really unique machine generated, there are libraries to get you one.
See GUID on wikipedia for details.
As to when you don't need a GUID, it is when a counter that you control (one way or another, like a SERIAL SQL type or a sequence) gets incremented. Indexing a "text" value (GUID in textual form) or a 128 bit binary value (which a GUID is) is far more expensive than an integer.
Someone said they are conceptually 128-bit random values, and that is substantially true, but having done a little reading on UUID (GUID usually refers to Microsoft's implementation of UUID), I see that there are several different UUID versions, and most of them are not actually random. So it is possible to generate a UUID for a machine (or something else) and be able to reliably repeat that process to obtain the same UUID down the road, which is important for some applications.
For me it's easier to think of them as simply "128-bit random values". Which is essentially what they are. There are some algorithms for including a bit of information in a few digits of your GUID (thus the random part gets a bit smaller), but still they are pretty large almost-random values.
Since they are so large, it is extremely unlikely that two GUIDs will ever be generated that are the same. For all practical purposes, every GUID ever generated is unique in the world.
I'll leave it to you to figure out where to use them, but other answers already have some examples. Let your imagination run wild. :)
Can be a hard thing to understand because of all the maths that goes on behind generating them. Think of it as a unique id. You can get Visual Studio to generate one for you, or .NET if you happen to be using C# or one of the many other applications or websites. They are considered unique because there is such a silly small chance you'll see the same one twice that it isn't worth considering.
128-bit Globally Unique ID. You can generate GUIDs from now until sunset and you never generate the same GUID twice, and neither will anyone else. They are used a lot with COM.
As for example of something you would use them for, we use them in one of our products. Our users can generate categories and cards on various devices. We want to make sure that we don't confuse a category made on one device with a category created on a different one, so it's important that IDs are unique no matter who generates them, where they generate them, and when they generate them. So we use GUIDs (actually we use our own scheme using 64-bit numbers but they are similar to GUIDs).
I worked on an ACD call center system a few years back where we wanted to gather call detail records from multiple call processors into a single database. I setup a column in MS SQL to generate a GUID for the database key rather than using a system-generated sequential ID (identity column). Back then, this required setting the default value to NewID (or generating it in the code, but the NewID() function was safer). Of course, having a large value for a key may raise a few eyebrows, but I would rather give up the space than risk a collision.
I didn't see anyone address using a GUID as a database key so I thought it might help to know you could do that too.
GUID stands for "Globally Unique Identifier" and you use it when you want to have, erm, a Globally Unique Identifier.
In RSS feeds, for example, you should have a GUID for each item in the feed. That way, the feed reader software can keep track of whether you have read that item or not. Without a GUID, it would be impossible to tell.
A GUID differs from something like a database ID in that no matter who creates an object -- you, me, the guy down the street -- our GUIDs will always be different. There should be no collisions using a GUID.
You'll also see the term UUID, which stands for "Universally Unique Identifier." There is essentially no difference between the two. UUID is the more appropriate term. GUID is the term used by Microsoft.
If you need to generate an identifier that needs to be unique during the whole lifetime of your application, you use a GUID.
Imagine you have a server with sessions, if you give each session a GUID, you are certain that it will be unique for every session ever created by your server. This is useful for tracing bugs.
One particularly useful application of GUIDs that I've found is using them to track unique visitors in webapps where the visitors are anonymous (i.e. not logged in or registered).
GUID = Global Unique IDentifier.
Use it when you want to uniquely identify something in a global context.
This generator can be handy.
The Wikipedia article on GUIDs is pretty clear on what they are used for - maybe rephrasing your question would help - what do you need a GUID for?
To actually see what it looks like on a windows computer, go to cmd or powershell.
Powershell => [guid]::NewGuid()
CMD => powershell [guid]::NewGuid()
My users will import through cut and paste a large string that will contain company names.
I have an existing and growing MYSQL database of companies names, each with a unique company_id.
I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.
Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **
For example, someone writes:
Microsoft -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc. -> Polycom
I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:
How to find best fuzzy match for a string in a large string database
Matching inexact company names in Java
You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).
The drawbacks of SOUNDEX() are:
its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
for MySQL, at least according to the docs, SOUNDEX is broken for unicode input
Example:
SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')
/* all of these return 'M262' */
For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.
Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.
In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).
SOUNDEX is an OK algorithm for this, but there have been recent advances on this topic. Another algorithm was created called the Metaphone, and it was later revised to a Double Metaphone algorithm. I have personally used the java apache commons implementation of double metaphone and it is customizable and accurate.
They have implementations in lots of other languages on the wikipedia page for it, too. This question has been answered, but should you find any of the identified problems with the SOUNDEX appearing in your application, it's nice to know there are options. Sometimes it can generate the same code for two really different words. Double metaphone was created to help take care of that problem.
Stolen from wikipedia: http://en.wikipedia.org/wiki/Soundex
As a response to deficiencies in the
Soundex algorithm, Lawrence Philips
developed the Metaphone algorithm for
the same purpose. Philips later
developed an improvement to Metaphone,
which he called Double-Metaphone.
Double-Metaphone includes a much
larger encoding rule set than its
predecessor, handles a subset of
non-Latin characters, and returns a
primary and a secondary encoding to
account for different pronunciations
of a single word in English.
At the bottom of the double metaphone page, they have the implementations of it for all kinds of programming languages: http://en.wikipedia.org/wiki/Double-Metaphone
Python & MySQL implementation: https://github.com/AtomBoy/double-metaphone
Firstly, I would like to add that you should be very careful when using any form of Phonetic/Fuzzy Matching Algorithm, as this kind of logic is exactly that, Fuzzy or to put it more simply; potentially inaccurate. Especially true when used for matching company names.
A good approach is to seek corroboration from other data, such as address information, postal codes, tel numbers, Geo Coordinates etc. This will help confirm the probability of your data being accurately matched.
There are a whole range of issues related to B2B Data Matching too many to be addressed here, I have written more about Company Name Matching in my blog (also an updated article), but in summary the key issues are:
Looking at the whole string is unhelpful as the most important part
of a Company Name is not necessarily at the beginning of the Company
Name. i.e. ‘The Proctor and Gamble Company’ or ‘United States Federal
Reserve ‘
Abbreviations are common place in Company Names i.e. HP, GM, GE, P&G,
D&B etc..
Some companies deliberately spell their names incorrectly as part of
their branding and to differentiate themselves from other companies.
Matching exact data is easy, but matching non-exact data can be much more time consuming and I would suggest that you should consider how you will be validating the non-exact matches to ensure these are of acceptable quality.
Before we built Match2Lists.com, we used to spend an unhealthy amount of time validating fuzzy matches. In Match2Lists we incorporated a powerful Visualisation tool enabling us to review non-exact matches, this proved to be a real game changer in terms of match validation, reducing our costs and enabling us to deliver results much more quickly.
Best of Luck!!
Here's a link to the php discussion of the soundex functions in mysql and php. I'd start from there, then expand into your other not-so-well-defined requirements.
Your reference references the Levenshtein methodology for matching. Two problems. 1. It's more appropriate for measuring the difference between two known words, not for searching. 2. It discusses a solution designed more to detect things like proofing errors (using "Levenshtien" for "Levenshtein") rather than spelling errors (where the user doesn't know how to spell, say "Levenshtein" and types in "Levinstein". I usually associate it with looking for a phrase in a book rather than a key value in a database.
EDIT: In response to comment--
Can you at least get the users to put the company names into multiple text boxes; 2. or use an unambigous name delimiter (say backslash); 3. leave out articles ("The") and generic abbreviations (or you can filter for these); 4. Squoosh the spaces out and match for that also (so Micro Soft => microsoft, Bare Essentials => bareessentials); 5. Filter out punctuation; 6. Do "OR" searches on words ("bare" OR "essentials") - people will inevitably leave one or the other out sometimes.
Test like mad and use the feedback loop from users.
the best function for fuzzy matching is levenshtein. it's traditionally used by spell checkers, so that might be the way to go. there's a UDF for it available here: http://joshdrew.com/
the downside to using levenshtein is that it won't scale very well. a better idea might be to dump the whole table in to a spell checker custom dictionary file and do the suggestion from your application tier instead of the database tier.
This answer results in indexed lookup of almost any entity using input of 2 or 3 characters or more.
Basically, create a new table with 2 columns, word and key. Run a process on the original table containing the column to be fuzzy searched. This process will extract every individual word from the original column and write these words to the word table along with the original key. During this process, commonly occurring words like 'the','and', etc should be discarded.
We then create several indices on the word table, as follows...
A normal, lowercase index on word + key
An index on the 2nd through 5th character + key
An index on the 3rd through 6th character + key
Alternately, create a SOUNDEX() index on the word column.
Once this is in place, we take any user input and search using normal word = input or LIKE input%. We never do a LIKE %input as we are always looking for a match on any of the first 3 characters, which are all indexed.
If your original table is massive, you could partition the word table by chunks of the alphabet to ensure the user's input is being narrowed down to candidate rows immediately.
Though the question asks about how to do fuzzy searches in MySQL, I'd recommend considering using a separate fuzzy search (aka typo tolerant) engine to accomplish this. Here are some search engines to consider:
ElasticSearch (Open source, has a ton of features, and so is also complex to operate)
Algolia (Proprietary, but has great docs and super easy to get up and running)
Typesense (Open source, provides the same fuzzy search-as-you-type feature as Algolia)
Check if it's spelled wrong before querying using a trusted and well tested spell checking library on the server side, then do a simple query for the original text AND the first suggested correct spelling (if spell check determined it was misspelled).
You can create custom dictionaries for any spell check library worth using, which you may need to do for matching more obscure company names.
It's way faster to match against two simple strings than it is to do a Levenshtein distance calculation against an entire table. MySQL is not well suited for this.
I tackled a similar problem recently and wasted a lot of time fiddling around with algorithms, so I really wish there had been more people out there cautioning against doing this in MySQL.
Probably been suggested before but why not dump the data out to Excel and use the Fuzzy Match Excel plugin. This will give a score from 0 to 1 (1 being 100%).
I did this for business partner (company) data that was held in a database.
Download the latest UK Companies House data and score against that.
For ROW data its more complex as we had to do a more manual process.