I changed from latin1 to utf8. Although all sorts of text was displaying fine I noticed non-english characters were stored in the database as weird symbols. I spent a day trying to fix that and finally now non-english characters display as non-english characters in the database and display the same on the browser. However I noticed that I see apostrophes stored as ' and exclamation marks stored as !. Is this normal, or should they be appearing as ' and ! in the database instead? If so, what would I need to do in order to fix that?
It really depends on what you intend to do with the contents of the database. If your invariant is that "contents of the database are sanitized and may be placed directly in a web page without further validation/sanitization", then having & and other html entities in your database makes perfect sense. If, on the other hand, your database is to store only the raw original data, and you intend to process it/sanitize it, before displaying it in HTML code, then you should probably replace these entities with the original characters, encoded using UTF-8. So, it really depends on how you interpret your database content.
The &#XX; forms are HTML character entities, implying you passed the values stored in the database through a function such as PHP's htmlspecialchars or htmlentities. If the values are processed within an HTML document (or perhaps by any HTML processor, regardless of what they're a part of), they should display fine. Outside of that, they won't.
This means you probably don't want to keep them encoded as HTML entities. You can convert the values back using the counterpart to the function you used to encode them (e.g. html_entity_decode), which should take an argument as to which encoding to convert to. Once you've done that, check some of the previously problematic entries, making sure you're using the correct encoding to view them.
If you're still having problems, there's a mismatch between what encoding the stored values are supposed to use and what they're actually using. You'll have to figure out what they're actually using, and then convert them by pulling them from the DB and either converting them to the target encoding before re-inserting them, or re-inserting them with the encoding that they actually use. Similar to the latter option is to convert the columns to BLOBs, then changing the column character set, then changing the column type back to a text type, then directly converting the column to the desired character encoding. The reason for this unwieldy sequence is that text types are converted when changing the character encoding, but binary types aren't.
Read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for more on character encodings in general, and § 9.1.4. of the MySQL manual, "Connection Character Sets and Collations", for how encodings are used in MySQL.
Related
I am running some php on my website pulling content from an SQL database. Queries are working and all is good there. Only problem I am having is that when I upload my dataset (.csv format) using phpmyadmin, an error occurs dropping all contents after a certain row. Supposedly this is caused by SQL recognizing more columns in that specific row than intended. And unfortunately, this is not just a single occurrence. I cannot seem to find out exactly what the problem is but most likely it is caused by some values in the column 'description' containing delimiters that split it into multiple columns. Hopefully, by deleting/replacing all these delimiters the problem can be solved. I am rather new to SQL though and I cannot seem to find a source that simply lays out what all the potential delimiters are that I should consider. Is there somebody that can help me out?
Thank you in advance and take care!
Regards
From personal experience, once can delimit by many different things. I've seen pipes | and commas , as well as tab, fixed space, tildas ~ and colons :.
Taken directly from https://en.wikipedia.org/wiki/Delimiter-separated_values:
"Due to their widespread use, comma- and tab-delimited text files can be opened by several kinds of applications, including most spreadsheet programs and statistical packages, sometimes even without the user designating which delimiter has been used.[5][6] Despite that each of those applications has its own database design and its own file format (for example, accdb or xlsx), they can all map the fields in a DSV file to their own data model and format.[citation needed]
Typically a delimited file format is indicated by a specification. Some specifications provide conventions for avoiding delimiter collision, others do not. Delimiter collision is a problem that occurs when a character that is intended as part of the data gets interpreted as a delimiter instead. Comma- and space-separated formats often suffer from this problem, since in many contexts those characters are legitimate parts of a data field.
Most such files avoid delimiter collision either by surrounding all data fields in double quotes, or only quoting those data fields that contain the delimiter character. One problem with tab-delimited text files is that tabs are difficult to distinguish from spaces; therefore, there are sometimes problems with the files being corrupted when people try to edit them by hand. Another set of problems occur due to errors in the file structure, usually during import of file into a database (in the example above, such error may be a pupil's first name missing).
Depending on the data itself, it may be beneficial to use non-standard characters such as the tilde (~) as delimiters. With rising prevalence of web sites and other applications that store snippets of code in databases, simply using a " which occurs in every hyperlink and image source tag simply isn't sufficient to avoid this type of collision. Since colons (:), semi-colons (;), pipes (|), and many other characters are also used, it can be quite challenging to find a character that isn't being used elsewhere."
The Expedia Hotel Database is providing some of its data using the ISO-8859 encoding:
Files with ONLY English content are ISO-8859.
However:
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.
So it a series of different encodings with notable differences, rather than a single one. My problem:
How can I convert their "ONLY English content" data using NodeJS into a safer form to store in my database, that I can reliably deliver to user's browser, without worrying that the data get corrupted at the user end?
I am thinking of trying to convert all data from ISO/IEC 8859-X (for each X = 1,...,16) into HTML entities first, and then check for presence of non-ASCII characters, which means the encoding was not correct and I have to take the next X. If none of the X works, that means this data entry is corrupted and should be discarded I suppose, as it will be unlikely displayed correctly. The whole task feel somewhat cumbersome, so I am wondering if there are simpler ways.
Note that even though the content is declared "ONLY English content", many data entries do actually contain accented characters that might get corrupted in a wrong encoding.
We develop android app. The app accepts text from users and upload to server (mysql). Then this text is read by other users.
While testing i found that 'Hindi' (Indian) language gets inserted in the column as '????? '. Then after SO search, i changed the collation to utf8_general_ci.
I am new to collation. I want to let user input text in any language in the world and others get the access. What shall i do. Accuracy is must.
But i saw a comment where one says, "You should never, ever use utf8_general_ci. It simply doesn’t work. It’s a throwback to the bad old days of ASCII stooopeeedity from fifty years ago. Unicode case-insensitive matching cannot be done without the foldcase map from the UCD. For example, “Σίσυφος” has three different sigmas in it; or how the lowercase of “TSCHüẞ” is “tschüβ”, but the uppercase of “tschüβ” is “TSCHÜSS”. You can be right, or you can be fast. Therefore you must use utf8_unicode_ci, because if you don’t care about correctness, then it’s trivial to make it infinitely fast."
Your question title is asking about collations, but in the body you say:
I want to let user input text in any language in the world and others get the access.
So, I'm assuming that is what you're specifically after. To clarify, collations affect how MySQL compares strings with each other, but it's not the thing that ultimately opens up the possibility of storing unicode characters.
For storage you need to ensure that the character set is defined correctly. MySQL allows you to specify character set and collation values on a column level, but it also allows you to specify defaults on a table and database level. In general I'd advice setting defaults on a database and table level, and let MySQL handle the rest when defining columns. Note that if columns already exist with a different character set, then you'll need to investigate changing it. Depending on what you're using to communicate with MySQL, you may need to specify a character encoding to use against the connection too.
Note that utf8mb4 is an absolute must for the character set used, do not use just utf8.. you won't be able to store unicode characters that consume 4 bytes with UTF-8, such as emoji characters.
As for the collation to use, I don't have a recommendation really, as it sort of depends what you're aiming for, speed or accuracy. There is a fair amount of information around which covers the topic in other answers.
We have a MySQL InnoDB table holding ~10 columns of small base64 encoded javascript files and png (<2KB size) images base64 encoded as well.
There are few inserts and a lot of reads comparatively, however the output is being cached on a Memcached instance for some minutes to avoid subsequent reads.
As it is right now we are using BLOB for those columns, but I am wondering if there is an advantage in switching to TEXT datatype in terms of performance or snapshot backing up.
My search digging indicates that BLOB and TEXT for my case are close to identical and since I do not know before-hand what type of data are actually going to be stored I went for BLOB.
Do you have any pointers on the TEXT vs BLOB debate for this specific case?
One shouldn't store Base64-encoded data in one's database...
Base64 is a coding in which arbitrary binary data is represented using only printable text characters: it was designed for situations where such binary data needs to be transferred across a protocol or medium that can handle only printable-text (e.g. SMTP/email). It increases the data size (by 33%) and adds the computational cost of encoding/decoding, so it should be avoided unless absolutely necessary.
By contrast, the whole point of BLOB columns is that they store opaque binary strings. So just go ahead and store your stuff directly into your BLOB columns without first Base64-encoding them. (That said, if MySQL has a more suitable type for the particular data being stored, you may wish to use that instead: for example, text files like JavaScript sources could benefit from being stored in TEXT columns for which MySQL natively tracks text-specific metadata—more on this below).
The (erroneous) idea that SQL databases require printable-text encodings like Base64 for handling arbitrary binary data has been perpetuated by a large number of ill-informed tutorials. This idea appears to be seated in the mistaken belief that, because SQL comprises only printable-text in other contexts, it must surely require it for binary data too (at least for data transfer, if not for data storage). This is simply not true: SQL can convey binary data in a number of ways, including plain string literals (provided that they are properly quoted and escaped like any other string); of course, the preferred way to pass data (of any type) to your database is through parameterised queries, and the data types of your parameters can just as easily be raw binary strings as anything else.
...unless it's cached for performance reasons...
The only situation in which there might be some benefit from storing Base64-encoded data is where it's usually transmitted across a protocol requiring such encoding (e.g. by email attachment) immediately after being retrieved from the database—in which case, storing the Base64-encoded representation would save from having to perform the encoding operation on the otherwise raw data upon every fetch.
However, note in this sense that the Base64-encoded storage is merely acting as a cache, much like one might store denormalised data for performance reasons.
...in which case it should be TEXT not BLOB
As alluded above: the only difference between TEXT and BLOB columns is that, for TEXT columns, MySQL additionally tracks text-specific metadata (such as character encoding and collation) for you. This additional metadata enables MySQL to convert values between storage and connection character sets (where appropriate) and perform fancy string comparison/sorting operations.
Generally speaking: if two clients working in different character sets should see the same bytes, then you want a BLOB column; if they should see the same characters then you want a TEXT column.
With Base64, those two clients must ultimately find that the data decodes to the same bytes; but they should see that the stored/encoded data has the same characters. For example, suppose one wishes to insert the Base64-encoding of 'Hello world!' (which is 'SGVsbG8gd29ybGQh'). If the inserting application is working in the UTF-8 character set, then it will send the byte sequence 0x53475673624738676432397962475168 to the database.
if that byte sequence is stored in a BLOB column and later retrieved by an application that is working in UTF-16*, the same bytes will be returned—which represent '升噳扇㡧搲㥹扇全' and not the desired Base64-encoded value; whereas
if that byte sequence is stored in a TEXT column and later retrieved by an application that is working in UTF-16, MySQL will transcode on-the-fly to return the byte sequence 0x0053004700560073006200470038006700640032003900790062004700510068—which represents the original Base64-encoded value 'SGVsbG8gd29ybGQh' as desired.
Of course, you could nevertheless use BLOB columns and track the character encoding in some other way—but that would just needlessly reinvent the wheel, with added maintenance complexity and risk of introducing unintentional errors.
* Actually MySQL doesn't support using client character sets that are not byte-compatible with ASCII (and therefore Base64 encodings will always be consistent across any combination of them), but this example nevertheless serves to illustrate the difference between BLOB and TEXT column types and thus explains why TEXT is technically correct for this purpose even though BLOB will actually work without error (at least until MySQL adds support for non-ASCII compatible client character sets).
Does anyone know of a quick and easy way to locate special characters that didn't get correctly converted when data was imported into MySQL.
I think this an issue due to data encoding (e.g. Latin-1 vs. UTF-8). Regardless where the issue first occurred, I'm stuck with junk in my data that I need to remove.
There's unlikely to be an easy function for this, because for example, a broken UTF-8 special character will consist of two valid ISO-8859-1 characters. So while there are patterns of what those broken characters look like, there is no sure-fire way of identifying them.
You could build a search+replace function to replace the most common occurrences in your language (e.g. Ãœ for Ü if imported from UTF-8 into ISO-8859-1).
That said, it would be best to restart the import with the correct settings, if at all possible.