Removing strange characters from MySQL data - mysql

Somewhere along the way, between all the imports and exports I have done, a lot of the text on a blog I run is full of weird accented A characters.
When I export the data using mysqldump and load it into a text editor with the intention of using search-and-replace to clear out the bad characters, searching just matches every "a" character.
Does anyone know any way I can successfully hunt down these characters and get rid of them, either directly in MySQL or by using mysqldump and then reimporting the content?

This is an encoding problem; the  is a non-breaking space (HTML entity ) in Unicode being displayed in Latin1.
You might try something like this... first we check to make sure the matching is working:
SELECT * FROM some_table WHERE some_field LIKE BINARY '%Â%'
This should return any rows in some_table where some_field has a bad character. Assuming that works properly and you find the rows you're looking for, try this:
UPDATE some_table SET some_field = REPLACE( some_field, BINARY 'Â', '' )
And that should remove those characters (based on the page you linked, you don't really want an nbsp there as you would end up with three spaces in a row between sentences etc, you should only have one).
If it doesn't work then you'll need to look at the encoding and collation being used.
EDIT: Just added BINARY to the strings; this should hopefully make it work regardless of encoding.

The accepted answer did not work for me.
From here http://nicj.net/mysql-converting-an-incorrect-latin1-column-to-utf8/ I have found that the binary code for  character is c2a0 (by converting the column to VARBINARY and looking what it turns to).
Then here http://www.oneminuteinfo.com/2013/11/mysql-replace-non-ascii-characters.html found the actual solution to remove (replace) it:
update entry set english_translation = unhex(replace(hex(english_translation),'C2A0','20')) where entry_id = 4008;
The query above replaces it to a space, then a normal trim can be applied or simply replace to '' instead.

I have had this problem and it is annoying, but solvable. As well as  you may find you have a whole load of characters showing up in your data like these:
“
This is connected to encoding changes in the database, but so long as you do not have any of these characters in your database that you want to keep (e.g. if you are actually using a Euro symbol) then you can strip them out with a few MySQL commands as previously suggested.
In my case I had this problem with a Wordpress database that I had inherited, and I found a useful set of pre-formed queries that work for Wordpress here http://digwp.com/2011/07/clean-up-weird-characters-in-database/
It's also worth noting that one of the causes of the problem in the first place is opening a database in a text editor which might change the encoding in some way. So if you can possibly manipulate the database using MySQL only and not a text editor this will reduce the risk of causing further trouble.

Related

Weird character at the end of database entry

I am migrating an excel sheet (csv) to mysql, however when I do an insert, some fields end up with empty spaces at the end, and I cant get rid of them for some reason. So I assume there is a wierd character at the end, since not even this:
UPDATE FOO set FIELD2 = TRIM(Replace(Replace(Replace(FIELD2,'\t',''),'\n',''),'\r',''));
Gets rid of it completely, I still have a whitespace at the end and I dont know how to get rid of it. I have over 2000 entries, so doing it manually is not an option. I am using Laravel with the revision package and it doesnt work because it thinks that those spaces at the end are changes and it creates a bunch of duplicates. Thank you for your help.
If you think there are weird characters in the original csv, you could open it in a text processor capable of doing regex replaces, and then replace all non ascii characters with nothing.
Your regex would look like this:
[^\u0000-\u007F]+
then after removing any possible strange characters, re-import the data into the database.
Unfortunately, I don't think regex replaces are possible in sql, so you'll need to re-import.

Ideas for Find and Replace character

I need to search address fields and change one character to upper case if there is an apartment number. So '521 Main St. #3b' would change to '521 Main St. #3B'.
The way I know to do this would be to write a program that loops through the recordset, looks at the address field for the last character to see if it's an alpha, then if the character before it is a numeric, change the case of the last char and update the record.
Is this something that would be quicker/simpler with regular expressions (haven't ever used)?
If so, is this best done from within a programming environmnet or using a text editor such as Textmate or vi ? The data is in MySQL and Excel, but I can export it to a text file.
Thanks.
I solved this using TextMate which, once I began to understand a little regex, was simple. (details here Regex Syntax for making the last character Uppercase in TextMate)
Still, I wonder if something like sed or awk, (which I started to try out) might be a better tool. And the SQL solution that Olexa provided works. I just don't know how to have it apply to the entire recordset.
If the data is stored in MySQL, then it is better to process it there:
UPDATE addresses
SET address = CONCAT(LEFT(address, CHAR_LENGTH(address) - 1), UPPER(RIGHT(address, 1)))
WHERE address REGEXP BINARY '#[[:digit:]]+[[:lower:]]{1}$'
;
I've added BINARY because otherwise REGEXP is not case-sensitive, but BINARY may need to be omitted to support multi-byte strings. In this case, surplus updates will be made, but the result would be correct anyway.
P. S. An example on SQL Fiddle showing which values are affected, and how they are affected: http://sqlfiddle.com/#!2/b29326/1

MySQL Query to Identify bad characters?

We have some tables that were set with the Latin character set instead of UTF-8 and it allowed bad characters to be entered into the tables, the usual culprit is people copy / pasting from Word or Outlook which copys those nasty hidden characters...
Is there any query we can use to identify these characters to clean them?
Thanks,
I assume that your connection chacater set was set to UTF8 when you filled the data in.
MySQL replaces unconvertable characters with ? (question marks):
SELECT CONVERT('тест' USING latin1);
----
????
The problem is distinguishing legitimate question marks from illegitimate ones.
Usually, the question marks in the beginning of a word are a bad sign, so this:
SELECT *
FROM mytable
WHERE myfield RLIKE '\\?[[:alnum:]]'
should give a good start.
You're probably noticing something like this 'bug'. The 'bad characters' are most likely UTF-8 control characters (eg \x80). You might be able to identify them using a query like
SELECT bar FROM foo WHERE bar LIKE LOCATE(UNHEX(80), bar)!=0
From that linked bug, they recommend using type BLOB to store text from windows files:
Use BLOB (with additional encoding field) instead of TEXT if you need to store windows files (even text files). Better than 3-byte UTF-8 and multi-tier encoding overhead.
Take a look at this Q/A (it's all about your client encoding aka SET NAMES )

Unicode escape sequence in command line MySQL

Short version:
What kind of escape sequence can one use to search for unicode characters in command line mysql?
Long version:
I'm looking for a way to search a column for records containing a unicode sequence, U+200B, in mysql from the command line. I can't figure out which kind of escape to use. I've tried \u200B and x200B and even ​ I finally found one blog that suggested the _utf8 syntax. This will produce the character on the command line:
select _utf8 x'200B';
Now I'm stuck trying to get that working in a "LIKE" query.
This generates the characters, but the % seem to lose their special meaning when placed in the LIKE part:
select _utf8 x'0025200B0025';
I also tried a concat but it didn't work either:
select concat('%', _utf8 x'200B', '%');
More background:
I have some data that has zero width space characters (zwsp) in it, Unicode Point U+200B. This is typically caused by copy/paste from websites that use the zwsp in their output. With most unicode characters, I can just paste the character into the terminal (or create it with a keycode), but since this one is invisible it's a bit more challenging. I can create a file that generates a "%%" sequence and copy/paste it to the terminal and it will work but it leaves my command history and terminal output screwy. I would think there is a straightforward way to do this in MySQL, but so far I've come up short.
Thanks in advance,
-Paul Burney
select _utf8 x'0025200B0025';
That's not UTF-8, it's UTF-16/UCS-2. You might be able to say SELECT _ucs2 0x0025200B0025 if you have UCS-2 support in your copy of MySQL.
Otherwise, the byte sequence encoding character U+200B in UTF-8 would be 0xE2, 0x80, 0x8B:
select 0xE2808B;
If it is Linux then hold Ctrl + Shift + U then release the U and type 200B.

Charset Problem while migrating database

I have a custom made CMS that I must migrate to work on Wordpress. Everything worked fine except the charset module.
Since this is about a Rumanian blog content, there are some special chars used (this will be ă, î, ș, â, Ț). When i insert this content on wordpress wp_posts, Wordpress displays them as "?".
I've tried all kind of stuff, like changing the charset from utf8 to latin1, latin2, and so on, but no result.
Even more, when i try to replace that special characters with normal ones (eg: ă to a, î to i) nothing happens, the content remains the same (there are actually some chars that are changed but not all)
What i do wrong and what i must do to do it right?
Thanks!
Character sets are a complete nightmare. What I'd do is to use mysqldump to dump your database to a sql file. Check to see if the special chars still look right.
Then, using find and replace in a text editor, replace all special characters with the correct html entity. e.g. Ă becomes Ă.
http://meta.wikimedia.org/wiki/Help:Romanian_characters
Then delete your database, set all conceivable settings to utf-8, and import your dump.
Wordpress also has an extensive article about character encodings.
Good luck!