Finding special characters in MySQL database - mysql

Does anyone know of a quick and easy way to locate special characters that didn't get correctly converted when data was imported into MySQL.
I think this an issue due to data encoding (e.g. Latin-1 vs. UTF-8). Regardless where the issue first occurred, I'm stuck with junk in my data that I need to remove.

There's unlikely to be an easy function for this, because for example, a broken UTF-8 special character will consist of two valid ISO-8859-1 characters. So while there are patterns of what those broken characters look like, there is no sure-fire way of identifying them.
You could build a search+replace function to replace the most common occurrences in your language (e.g. Ãœ for Ü if imported from UTF-8 into ISO-8859-1).
That said, it would be best to restart the import with the correct settings, if at all possible.

Related

How should one reasonably handle combining characters in UTF-8

I am writing a website with a user chat function.
At some point a user decided to use diatrics to draw all over everyone's screens.
In response I removed all text that was not in the ASCII character range. I'd like to re-enable UTF-8 but I don't know what to do about the combining marks ( UTF-8 characters that modify the character next to them ).
As you can see from the example below, Stack Overflow doesn't handle for this problem.
Malicious input t̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀è̀̀̀̀̀̀̀x̀̀̀̀̀̀̀̀̀̀t̀̀̀̀̀̀̀̀̀̀̀̀̀
I feel like only 1 combining mark should be allowed but that seems like a really excessive thing for me to need to write and I don't know if there are any languages that take 2 or 3 combining characters. I imagine Korean uses them extensively.
This seems like it should be a solved problem but I can't any useful information on the topic.

SQL: What are all the delimiters?

I am running some php on my website pulling content from an SQL database. Queries are working and all is good there. Only problem I am having is that when I upload my dataset (.csv format) using phpmyadmin, an error occurs dropping all contents after a certain row. Supposedly this is caused by SQL recognizing more columns in that specific row than intended. And unfortunately, this is not just a single occurrence. I cannot seem to find out exactly what the problem is but most likely it is caused by some values in the column 'description' containing delimiters that split it into multiple columns. Hopefully, by deleting/replacing all these delimiters the problem can be solved. I am rather new to SQL though and I cannot seem to find a source that simply lays out what all the potential delimiters are that I should consider. Is there somebody that can help me out?
Thank you in advance and take care!
Regards
From personal experience, once can delimit by many different things. I've seen pipes | and commas , as well as tab, fixed space, tildas ~ and colons :.
Taken directly from https://en.wikipedia.org/wiki/Delimiter-separated_values:
"Due to their widespread use, comma- and tab-delimited text files can be opened by several kinds of applications, including most spreadsheet programs and statistical packages, sometimes even without the user designating which delimiter has been used.[5][6] Despite that each of those applications has its own database design and its own file format (for example, accdb or xlsx), they can all map the fields in a DSV file to their own data model and format.[citation needed]
Typically a delimited file format is indicated by a specification. Some specifications provide conventions for avoiding delimiter collision, others do not. Delimiter collision is a problem that occurs when a character that is intended as part of the data gets interpreted as a delimiter instead. Comma- and space-separated formats often suffer from this problem, since in many contexts those characters are legitimate parts of a data field.
Most such files avoid delimiter collision either by surrounding all data fields in double quotes, or only quoting those data fields that contain the delimiter character. One problem with tab-delimited text files is that tabs are difficult to distinguish from spaces; therefore, there are sometimes problems with the files being corrupted when people try to edit them by hand. Another set of problems occur due to errors in the file structure, usually during import of file into a database (in the example above, such error may be a pupil's first name missing).
Depending on the data itself, it may be beneficial to use non-standard characters such as the tilde (~) as delimiters. With rising prevalence of web sites and other applications that store snippets of code in databases, simply using a " which occurs in every hyperlink and image source tag simply isn't sufficient to avoid this type of collision. Since colons (:), semi-colons (;), pipes (|), and many other characters are also used, it can be quite challenging to find a character that isn't being used elsewhere."

Which character encoding to use

If I go through my Twitter feed I can see tweets which are of different languages (Some are English, some are Chinese, some have European characters, some even have emojis). I would also like to support multiple languages and emojis inside of my app. If I have a MySQL database for example and have a column called 'message_content' which stores messagess content, how can I ensure the data under this column can support all languages+emojis?
I am not sure if it is as simple as choosing a character encoding and that's it or if it is more complicated?
utf8mb4 is a good choice for this.

Storing apostrophes, exclamation marks, etc. in mysql database

I changed from latin1 to utf8. Although all sorts of text was displaying fine I noticed non-english characters were stored in the database as weird symbols. I spent a day trying to fix that and finally now non-english characters display as non-english characters in the database and display the same on the browser. However I noticed that I see apostrophes stored as ' and exclamation marks stored as !. Is this normal, or should they be appearing as ' and ! in the database instead? If so, what would I need to do in order to fix that?
It really depends on what you intend to do with the contents of the database. If your invariant is that "contents of the database are sanitized and may be placed directly in a web page without further validation/sanitization", then having & and other html entities in your database makes perfect sense. If, on the other hand, your database is to store only the raw original data, and you intend to process it/sanitize it, before displaying it in HTML code, then you should probably replace these entities with the original characters, encoded using UTF-8. So, it really depends on how you interpret your database content.
The &#XX; forms are HTML character entities, implying you passed the values stored in the database through a function such as PHP's htmlspecialchars or htmlentities. If the values are processed within an HTML document (or perhaps by any HTML processor, regardless of what they're a part of), they should display fine. Outside of that, they won't.
This means you probably don't want to keep them encoded as HTML entities. You can convert the values back using the counterpart to the function you used to encode them (e.g. html_entity_decode), which should take an argument as to which encoding to convert to. Once you've done that, check some of the previously problematic entries, making sure you're using the correct encoding to view them.
If you're still having problems, there's a mismatch between what encoding the stored values are supposed to use and what they're actually using. You'll have to figure out what they're actually using, and then convert them by pulling them from the DB and either converting them to the target encoding before re-inserting them, or re-inserting them with the encoding that they actually use. Similar to the latter option is to convert the columns to BLOBs, then changing the column character set, then changing the column type back to a text type, then directly converting the column to the desired character encoding. The reason for this unwieldy sequence is that text types are converted when changing the character encoding, but binary types aren't.
Read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for more on character encodings in general, and § 9.1.4. of the MySQL manual, "Connection Character Sets and Collations", for how encodings are used in MySQL.

What character encoding should I use for a web page containing mostly Arabic text? Is utf-8 okay?

What character encoding should I use for a web page containing mostly Arabic text?
Is utf-8 okay?
UTF-8 can store the full Unicode range, so it's fine to use for Arabic.
However, if you were wondering what encoding would be most efficient:
All Arabic characters can be encoded using a single UTF-16 code unit (2 bytes), but they may take either 2 or 3 UTF-8 code units (1 byte each), so if you were just encoding Arabic, UTF-16 would be a more space efficient option.
However, you're not just encoding Arabic - you're encoding a significant number of characters that can be stored in a single byte in UTF-8, but take two bytes in UTF-16; all the html encoding characters <,&,>,= and all the html element names.
It's a trade off and, unless you're dealing with huge documents, it doesn't matter.
I develop mostly Arabic websites and these are the two encodings I use :
1. Windows-1256
This is the most common encoding Arabic websites use. It works in most cases (90%) for Arabic users.
Here is one of the biggest Arabic web-development forums: http://traidnt.net/vb/. You can see that they are using this encoding.
The problem with this encoding is that if you are developing a website for international use, this encoding won't work with every user and they will see gibberish instead of the content.
2. UTF-8
This encoding solves the previous problem and also works in urls. I mean if you want to have Arabic words in the your url, you need them to be in utf-8 or it won't work.
The downside of this encoding is that if you are going to save Arabic content to a database (e.g. MySql) using this encoding (so the database will also be encoded with utf-8) its size is going to be double what it would have been if it were encoded with windows-1256 (so the database will be encoded with latin-1).
I suggest going with utf-8 if you can afford the size increase.
UTF-8 is fine, yes. It can encode any code point in the Unicode standard.
Edited to add
To make the answer more complete, your realistic choices are:
UTF-8
UTF-16
UTF-32
Each comes with tradeoffs and advantages.
UTF-8
As Joe Gauterin points out, UTF-8 is very efficient for European texts but can get increasingly inefficient the "farther" from the Latin alphabet you get. If your text is all Arabic it will actually be larger than the equivalent text in UTF-16. This is rarely a problem, however, in practice in these days of cheap and plentiful RAM unless you have a lot of text to deal with. More of a problem is that the variable-length of the encoding makes some string operations difficult and slow. For example you can't easily get the fifth Arabic character in a string because some characters might be 1 byte long (punctuation, say), while others are two or three. This makes actual processing of strings slow and error-prone.
On the other hand, UTF-8 is likely your best choice if you're doing a lot of mixed European/Arabic text. The more European text in your documents, the better the UTF-8 choice will be.
UTF-16
UTF-16 will give you better space efficiency than UTF-8 if you're using predominantly Arabic text. I don't know about the Arabic code points, however, so I don't know if you risk having variable-length encodings here. (My guess is that this is not an issue, however.) If you do, in fact, have variable-length encodings, all the string processing problems of UTF-8 apply here as well. If not, no problems.
On the other hand, if you have mixed European and Arabic texts, UTF-16 will be less space-efficient. Also, if you find yourself expanding your text forms to other texts like, say, Chinese, you definitely go back to variable length forms and the associated problems.
UTF-32
UTF-32 will basically double your space requirements. On the other hand it's constant sized for all known (and, likely, unknown;) script forms. For raw string processing it's your fastest, best option without the problems that variable-length encoding will cause you. (This presupposes you have a string library that knows about 32-bit characters, naturally.)
Recommendation
My own recommendation is that you use UTF-8 as your external format (because everybody supports it) for storage, transmission, etc. unless you really see a benefit size-wise with UTF-16. So any time you read a string from the outside world it would be UTF-8 and any time you put one to the outside world it, too, would be UTF-8. Within your software, though, unless you're in the habit of manipulating massive strings (in which case I'd recommend different data structures anyway!) I'd recommend using UTF-16 or UTF-32 instead (depending on if there's any variable-length encoding issues in your UTF-16 data) for the speed efficiency and simplicity of code.
UTF-8 is the simplest way to go since it will work with almost everything:
UTF-8 can encode any Unicode
character. Files in different
languages can be displayed correctly
without having to choose the correct
code page or font. For instance
Chinese and Arabic can be in the same
text without special codes inserted to
switch the encoding.
(via wikipedia)
Of course keep in mind that:
UTF-8 often takes more space than an
encoding made for one or a few
languages.
Latin letters with diacritics and
characters from other alphabetic
scripts typically take one byte per
character in the appropriate
multi-byte encoding but take two in
UTF-8. East Asian scripts generally
have two bytes per character in their
multi-byte encodings yet take three
bytes per character in UTF-8.
... but in most cases it's not a big issues. It would become one if you start handling huge documents.
UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.