Deal with semicolon separated data in MySQL table - mysql

I've got MySQL DB with multiple data in one column separated by semicolon. I need to use the first of them. What is the best recommended way how to deal with this kind of stored data? (for this specific problem and also generally how to use semicolon separated data).

Q: "What is the best recommended way how to deal with this kind of stored data?"
A: The best recommendation is to avoid storing data as comma separated lists. (And no, this does not mean we should use semicolons in place of commas as delimiters in the list.)
For an introductory discussion of this topic, I recommend a review of Chapter 2 in Bill Karwin's book: "SQL AntiPatterns: Avoiding the Pitfalls of Database Programming"
Which is conveniently available here
https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557
and from other fine booksellers. Setting that recommendation aside for a moment...
To retrieve the first element from a semicolon delimited list, we can use the SUBSTRING_INDEX function.
As a demonstration:
SELECT SUBSTRING_INDEX('abc;def;ghi',';',1)
returns
'abc'
The MySQL SUBSTRING_INDEX function is documented here: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_substring-index
I recognize that this might be considered a "link only answer". A good answer to this question is going to be much longer, giving examples to demonstrate the pitfall of storing comma separated lists.
If the database will only ever view the comma separated list as a blob of data, in its entirety, without a need to examine the contents of the list, then I would consider storing it, similar to the way we would store a .jpg image in the database.
I would store and retrieve a .jpg image as a BLOB, just a block of bytes, in its entirety. Save the whole thing, and retrieve the whole thing. I'm not ever going to have the database manipulate the contents of the image. I'm not going to ever ask the database to examine the image to discern information about what is "in" the jpg image. I'm not going to ask the database to derive any meaningful information out of it... How many people are in a photo, what are the names of people in a photo, add a person to the photo, and so on.
I will only condone storing a comma separated (or semicolon separated) separated list if we are intending it to be an object, an opaque block of bytes, like we handle a jpg image.

Use
SELECT SUBSTRING_INDEX(column_name, ';', 1) from your_table

Related

SQL: What are all the delimiters?

I am running some php on my website pulling content from an SQL database. Queries are working and all is good there. Only problem I am having is that when I upload my dataset (.csv format) using phpmyadmin, an error occurs dropping all contents after a certain row. Supposedly this is caused by SQL recognizing more columns in that specific row than intended. And unfortunately, this is not just a single occurrence. I cannot seem to find out exactly what the problem is but most likely it is caused by some values in the column 'description' containing delimiters that split it into multiple columns. Hopefully, by deleting/replacing all these delimiters the problem can be solved. I am rather new to SQL though and I cannot seem to find a source that simply lays out what all the potential delimiters are that I should consider. Is there somebody that can help me out?
Thank you in advance and take care!
Regards
From personal experience, once can delimit by many different things. I've seen pipes | and commas , as well as tab, fixed space, tildas ~ and colons :.
Taken directly from https://en.wikipedia.org/wiki/Delimiter-separated_values:
"Due to their widespread use, comma- and tab-delimited text files can be opened by several kinds of applications, including most spreadsheet programs and statistical packages, sometimes even without the user designating which delimiter has been used.[5][6] Despite that each of those applications has its own database design and its own file format (for example, accdb or xlsx), they can all map the fields in a DSV file to their own data model and format.[citation needed]
Typically a delimited file format is indicated by a specification. Some specifications provide conventions for avoiding delimiter collision, others do not. Delimiter collision is a problem that occurs when a character that is intended as part of the data gets interpreted as a delimiter instead. Comma- and space-separated formats often suffer from this problem, since in many contexts those characters are legitimate parts of a data field.
Most such files avoid delimiter collision either by surrounding all data fields in double quotes, or only quoting those data fields that contain the delimiter character. One problem with tab-delimited text files is that tabs are difficult to distinguish from spaces; therefore, there are sometimes problems with the files being corrupted when people try to edit them by hand. Another set of problems occur due to errors in the file structure, usually during import of file into a database (in the example above, such error may be a pupil's first name missing).
Depending on the data itself, it may be beneficial to use non-standard characters such as the tilde (~) as delimiters. With rising prevalence of web sites and other applications that store snippets of code in databases, simply using a " which occurs in every hyperlink and image source tag simply isn't sufficient to avoid this type of collision. Since colons (:), semi-colons (;), pipes (|), and many other characters are also used, it can be quite challenging to find a character that isn't being used elsewhere."

Obtaining the average of one field with comma delimited values (InfoPath)

I have a field where the user enters multiple values each separated by a comma eg "1.8, 2, 3".
I want to find the average of those values. Is there a way to utilise avg() to accommodate for stripping the comma and producing the mean?
Unfortunately you can't do that with the built in InfoPath functions (there is no traditional split method for strings).
If you are willing to tackle it - using managed code behind the form will very easily solve your problem (only about 4 lines of code). Basic math and string manipulation should not impose any security restrictions on the form. However you will have to setup for code behind which is easy but can seem like somewhat of a hassle the first time you try it. There are good MSDN articles on how to go about that.
Alternatively, if you can change your data entry from comma separated to a repeating table you can use the built in avg() function.

Store Miscellaneous Data in DB Table Row

Let's assume I need to store some data of unknown amount within a database table. I don't want to create extra tables, because this will take more time to get the data. The amount of data can be different.
My initial thought was to store data in a key1=value1;key2=value2;key3=value3 format, but the problem here is that some value can contain ; in its body. What is the best separator in this case? What other methods can I use to be able to store various data in a single row?
The example content of the row is like data=2012-05-14 20:07:45;text=This is a comment, but what if I contain a semicolon?;last_id=123456 from which I can then get through PHP an array with corresponding keys and values after correctly exploding row text with a seperator.
First of all: You never ever store more than one information in only one field, if you need to access them separately or search by one of them. This has been discussed here quite a few times.
Assuming you allwas want to access the complete collection of information at once, I recommend to use the native serialization format of your development environment: e.g. if it is PHP, use serialze().
If it is cross-plattform, JSON might be a way to go: Good JSON encoding/decoding libraries exist for something like all environments out there. The same is true for XML, but int his context the textual overhead of XML is going to bite a bit.
On a sidenote: Are you sure, that storing the data in additional tables is slower? You might want to benchmark that before finally deciding.
Edit:
After reading, that you use PHP: If you don't want to put it in a table, stick with serialize() / unserialize() and a MEDIUMTEXT field, this works perfectly, I do it all the time.
EAV (cringe) is probably the best way to store arbitrary values like you want, but it sounds like you're firmly against additional tables for whatever reason. In light of that, you could just save the result of json_encode in the table. When you read it back, just json_decode to get it back into an array.
Keep in mind that if you ever need to search for anything in this field, you're going to have to use a SQL LIKE. If you never need to search this field or join it to anything, I suppose it's OK, but if you do, you've totally thrown performance out the window.
it can be the quotes who separate them .
key1='value1';key2='value2';key3='value3'
if not like that , give your sql example and we can see how to do it.

Is it a bad idea to escape HTML before inserting into a database instead of upon output?

I've been working on a system which doesn't allow HTML formatting. The method I currently use is to escape HTML entities before they get inserted into the database. I've been told that I should insert the raw text into the database, and escape HTML entities on output.
Other similar questions here I've seen look like for cases where HTML can still be used for formatting, so I'm asking for a case where HTML wouldn't be used at all.
you will also restrict yourself when performing the escaping before inserting into your db. let's say you decide to not use HTML as output, but JSON, plaintext, etc.
if you have stored escaped html in your db, you would first have to 'unescape' the value stored in the db, just to re-escape it again into a different format.
also see this perfect owasp article on xss prevention
Yes, because at some stage you'll want access to the original input entered. This is because...
You never know how you want to display it - in JSON, in HTML, as an SMS?
You may need to show it back to the user as is.
I do see your point about never wanting HTML entered. What are you using to strip HTML tags? If it a regex, then look out for confused users who might type something like this...
3<4 :->
They'll only get the 3 if it is a regex.
Suppose you have the text R&B, and store it as R&B. If someone searches for R&B, it won't match with a search SQL:
SELECT * FROM table WHERE title LIKE ?
The same for equality, sorting, etc.
Or if someone searches for life span, it could return extraneous matches with the escaped <span>'s. Though this is a bit orthogonal, and can be solved by using an external service like Elasticsearch, or by storing a raw text version in another field; similar to what #limscoder suggested.
If you expose the data via an API, the consumers may not expect the data to be escaped. Adding documentation may help.
A few months later, a new team member joins. As a well-trained developer, he always uses HTML escaping, now only to see everything is double-escaped (e.g. titles are showing up like He said "nuff" instead of He said "nuff").
Some escaping functions have additional options. Forgetting to use the same functions/options while un-escaping could result in a different value than the original.
It's more likely to happen with multiple developers/consumers working on the same data.
I usually store both versions of the text. The escaped/formatted text is used when a normal page request is made to avoid the overhead of escaping/formatting every time. The original/raw text is used when a user needs to edit an existing entry, and the escaping/formatting only occurs when the text is created or changed. This strategy works great unless you have tight storage space constraints, since you will be duplicating data.

How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

My users will import through cut and paste a large string that will contain company names.
I have an existing and growing MYSQL database of companies names, each with a unique company_id.
I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.
Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **
For example, someone writes:
Microsoft -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc. -> Polycom
I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:
How to find best fuzzy match for a string in a large string database
Matching inexact company names in Java
You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).
The drawbacks of SOUNDEX() are:
its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
for MySQL, at least according to the docs, SOUNDEX is broken for unicode input
Example:
SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')
/* all of these return 'M262' */
For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.
Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.
In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).
SOUNDEX is an OK algorithm for this, but there have been recent advances on this topic. Another algorithm was created called the Metaphone, and it was later revised to a Double Metaphone algorithm. I have personally used the java apache commons implementation of double metaphone and it is customizable and accurate.
They have implementations in lots of other languages on the wikipedia page for it, too. This question has been answered, but should you find any of the identified problems with the SOUNDEX appearing in your application, it's nice to know there are options. Sometimes it can generate the same code for two really different words. Double metaphone was created to help take care of that problem.
Stolen from wikipedia: http://en.wikipedia.org/wiki/Soundex
As a response to deficiencies in the
Soundex algorithm, Lawrence Philips
developed the Metaphone algorithm for
the same purpose. Philips later
developed an improvement to Metaphone,
which he called Double-Metaphone.
Double-Metaphone includes a much
larger encoding rule set than its
predecessor, handles a subset of
non-Latin characters, and returns a
primary and a secondary encoding to
account for different pronunciations
of a single word in English.
At the bottom of the double metaphone page, they have the implementations of it for all kinds of programming languages: http://en.wikipedia.org/wiki/Double-Metaphone
Python & MySQL implementation: https://github.com/AtomBoy/double-metaphone
Firstly, I would like to add that you should be very careful when using any form of Phonetic/Fuzzy Matching Algorithm, as this kind of logic is exactly that, Fuzzy or to put it more simply; potentially inaccurate. Especially true when used for matching company names.
A good approach is to seek corroboration from other data, such as address information, postal codes, tel numbers, Geo Coordinates etc. This will help confirm the probability of your data being accurately matched.
There are a whole range of issues related to B2B Data Matching too many to be addressed here, I have written more about Company Name Matching in my blog (also an updated article), but in summary the key issues are:
Looking at the whole string is unhelpful as the most important part
of a Company Name is not necessarily at the beginning of the Company
Name. i.e. ‘The Proctor and Gamble Company’ or ‘United States Federal
Reserve ‘
Abbreviations are common place in Company Names i.e. HP, GM, GE, P&G,
D&B etc..
Some companies deliberately spell their names incorrectly as part of
their branding and to differentiate themselves from other companies.
Matching exact data is easy, but matching non-exact data can be much more time consuming and I would suggest that you should consider how you will be validating the non-exact matches to ensure these are of acceptable quality.
Before we built Match2Lists.com, we used to spend an unhealthy amount of time validating fuzzy matches. In Match2Lists we incorporated a powerful Visualisation tool enabling us to review non-exact matches, this proved to be a real game changer in terms of match validation, reducing our costs and enabling us to deliver results much more quickly.
Best of Luck!!
Here's a link to the php discussion of the soundex functions in mysql and php. I'd start from there, then expand into your other not-so-well-defined requirements.
Your reference references the Levenshtein methodology for matching. Two problems. 1. It's more appropriate for measuring the difference between two known words, not for searching. 2. It discusses a solution designed more to detect things like proofing errors (using "Levenshtien" for "Levenshtein") rather than spelling errors (where the user doesn't know how to spell, say "Levenshtein" and types in "Levinstein". I usually associate it with looking for a phrase in a book rather than a key value in a database.
EDIT: In response to comment--
Can you at least get the users to put the company names into multiple text boxes; 2. or use an unambigous name delimiter (say backslash); 3. leave out articles ("The") and generic abbreviations (or you can filter for these); 4. Squoosh the spaces out and match for that also (so Micro Soft => microsoft, Bare Essentials => bareessentials); 5. Filter out punctuation; 6. Do "OR" searches on words ("bare" OR "essentials") - people will inevitably leave one or the other out sometimes.
Test like mad and use the feedback loop from users.
the best function for fuzzy matching is levenshtein. it's traditionally used by spell checkers, so that might be the way to go. there's a UDF for it available here: http://joshdrew.com/
the downside to using levenshtein is that it won't scale very well. a better idea might be to dump the whole table in to a spell checker custom dictionary file and do the suggestion from your application tier instead of the database tier.
This answer results in indexed lookup of almost any entity using input of 2 or 3 characters or more.
Basically, create a new table with 2 columns, word and key. Run a process on the original table containing the column to be fuzzy searched. This process will extract every individual word from the original column and write these words to the word table along with the original key. During this process, commonly occurring words like 'the','and', etc should be discarded.
We then create several indices on the word table, as follows...
A normal, lowercase index on word + key
An index on the 2nd through 5th character + key
An index on the 3rd through 6th character + key
Alternately, create a SOUNDEX() index on the word column.
Once this is in place, we take any user input and search using normal word = input or LIKE input%. We never do a LIKE %input as we are always looking for a match on any of the first 3 characters, which are all indexed.
If your original table is massive, you could partition the word table by chunks of the alphabet to ensure the user's input is being narrowed down to candidate rows immediately.
Though the question asks about how to do fuzzy searches in MySQL, I'd recommend considering using a separate fuzzy search (aka typo tolerant) engine to accomplish this. Here are some search engines to consider:
ElasticSearch (Open source, has a ton of features, and so is also complex to operate)
Algolia (Proprietary, but has great docs and super easy to get up and running)
Typesense (Open source, provides the same fuzzy search-as-you-type feature as Algolia)
Check if it's spelled wrong before querying using a trusted and well tested spell checking library on the server side, then do a simple query for the original text AND the first suggested correct spelling (if spell check determined it was misspelled).
You can create custom dictionaries for any spell check library worth using, which you may need to do for matching more obscure company names.
It's way faster to match against two simple strings than it is to do a Levenshtein distance calculation against an entire table. MySQL is not well suited for this.
I tackled a similar problem recently and wasted a lot of time fiddling around with algorithms, so I really wish there had been more people out there cautioning against doing this in MySQL.
Probably been suggested before but why not dump the data out to Excel and use the Fuzzy Match Excel plugin. This will give a score from 0 to 1 (1 being 100%).
I did this for business partner (company) data that was held in a database.
Download the latest UK Companies House data and score against that.
For ROW data its more complex as we had to do a more manual process.