Unicode escape sequence in command line MySQL - mysql

Short version:
What kind of escape sequence can one use to search for unicode characters in command line mysql?
Long version:
I'm looking for a way to search a column for records containing a unicode sequence, U+200B, in mysql from the command line. I can't figure out which kind of escape to use. I've tried \u200B and x200B and even ​ I finally found one blog that suggested the _utf8 syntax. This will produce the character on the command line:
select _utf8 x'200B';
Now I'm stuck trying to get that working in a "LIKE" query.
This generates the characters, but the % seem to lose their special meaning when placed in the LIKE part:
select _utf8 x'0025200B0025';
I also tried a concat but it didn't work either:
select concat('%', _utf8 x'200B', '%');
More background:
I have some data that has zero width space characters (zwsp) in it, Unicode Point U+200B. This is typically caused by copy/paste from websites that use the zwsp in their output. With most unicode characters, I can just paste the character into the terminal (or create it with a keycode), but since this one is invisible it's a bit more challenging. I can create a file that generates a "%%" sequence and copy/paste it to the terminal and it will work but it leaves my command history and terminal output screwy. I would think there is a straightforward way to do this in MySQL, but so far I've come up short.
Thanks in advance,
-Paul Burney

select _utf8 x'0025200B0025';
That's not UTF-8, it's UTF-16/UCS-2. You might be able to say SELECT _ucs2 0x0025200B0025 if you have UCS-2 support in your copy of MySQL.
Otherwise, the byte sequence encoding character U+200B in UTF-8 would be 0xE2, 0x80, 0x8B:
select 0xE2808B;

If it is Linux then hold Ctrl + Shift + U then release the U and type 200B.

Related

regex issues striping out extended ascii/unicode

I am working on importing a sql dump into a mysql database using workbench. The data set includes some extended ascii/unicode characters like "ç" in "Français" in some of the insert statements. These charecters break the import.
I do not care about those characters so using notepadd++ and this page Notepad++, How to remove all non ascii characters with regex? I am trying to strip out all the extended characters using this regex [^\x00-\x7F]+ which per my poor understanding is basically NOT 00-7f or NUL(0) through DEL(127).
It finds the right characters, but for some reason also finds the CRLF at the end of each line - which is not in this range and I am not sure why as CR and LF are \x0A and \x0D they should not be in that set.
I am sure I am missing something simple - so is there a better regex to use to not lose my newlines, or even a way to tell SQL workbench to ignore the extended characters?
Here is an example of one of the insert lines with an extended value in it:
INSERT INTO as_catalog VALUES('525234','Google Apps Sync™ for
Microsoft Outlook® 3.3.355.950','0');
Thanks!

MySQL query with non-printing characters (left-to-right mark)

I just found myself lost in the interesting situation that I need to query MySQL for fields containing a so called Left-to-right mark.
As the nature of this character is to be non-printing, thus invisible, I'm unable to simply copy/paste it into a query.
As mentioned in the linked Wikipedia article, the Left-to-right mark is Unicode character U+200F, which is a fact that I'm sure is the key to success in my current adventure.
My question is: How do I use raw Unicode in a MySQL query? Something along the lines of:
SELECT * FROM users WHERE username LIKE '%\U+200F%'
or
SELECT * FROM users WHERE username REGEXP '\U+200F'
or whatever the correct syntax for Unicode in MySQL is and depending on whether this is supported with LIKE and/or REGEXP.
To get a unicode char, something like this should work:
SELECT CHAR(<number> USING utf8);
Also, don't use REGEXP, because the regexp lib used by MySQL is very old, and doesn't support multi-byte charsets.

MySQL Query to Identify bad characters?

We have some tables that were set with the Latin character set instead of UTF-8 and it allowed bad characters to be entered into the tables, the usual culprit is people copy / pasting from Word or Outlook which copys those nasty hidden characters...
Is there any query we can use to identify these characters to clean them?
Thanks,
I assume that your connection chacater set was set to UTF8 when you filled the data in.
MySQL replaces unconvertable characters with ? (question marks):
SELECT CONVERT('тест' USING latin1);
----
????
The problem is distinguishing legitimate question marks from illegitimate ones.
Usually, the question marks in the beginning of a word are a bad sign, so this:
SELECT *
FROM mytable
WHERE myfield RLIKE '\\?[[:alnum:]]'
should give a good start.
You're probably noticing something like this 'bug'. The 'bad characters' are most likely UTF-8 control characters (eg \x80). You might be able to identify them using a query like
SELECT bar FROM foo WHERE bar LIKE LOCATE(UNHEX(80), bar)!=0
From that linked bug, they recommend using type BLOB to store text from windows files:
Use BLOB (with additional encoding field) instead of TEXT if you need to store windows files (even text files). Better than 3-byte UTF-8 and multi-tier encoding overhead.
Take a look at this Q/A (it's all about your client encoding aka SET NAMES )

Removing strange characters from MySQL data

Somewhere along the way, between all the imports and exports I have done, a lot of the text on a blog I run is full of weird accented A characters.
When I export the data using mysqldump and load it into a text editor with the intention of using search-and-replace to clear out the bad characters, searching just matches every "a" character.
Does anyone know any way I can successfully hunt down these characters and get rid of them, either directly in MySQL or by using mysqldump and then reimporting the content?
This is an encoding problem; the  is a non-breaking space (HTML entity ) in Unicode being displayed in Latin1.
You might try something like this... first we check to make sure the matching is working:
SELECT * FROM some_table WHERE some_field LIKE BINARY '%Â%'
This should return any rows in some_table where some_field has a bad character. Assuming that works properly and you find the rows you're looking for, try this:
UPDATE some_table SET some_field = REPLACE( some_field, BINARY 'Â', '' )
And that should remove those characters (based on the page you linked, you don't really want an nbsp there as you would end up with three spaces in a row between sentences etc, you should only have one).
If it doesn't work then you'll need to look at the encoding and collation being used.
EDIT: Just added BINARY to the strings; this should hopefully make it work regardless of encoding.
The accepted answer did not work for me.
From here http://nicj.net/mysql-converting-an-incorrect-latin1-column-to-utf8/ I have found that the binary code for  character is c2a0 (by converting the column to VARBINARY and looking what it turns to).
Then here http://www.oneminuteinfo.com/2013/11/mysql-replace-non-ascii-characters.html found the actual solution to remove (replace) it:
update entry set english_translation = unhex(replace(hex(english_translation),'C2A0','20')) where entry_id = 4008;
The query above replaces it to a space, then a normal trim can be applied or simply replace to '' instead.
I have had this problem and it is annoying, but solvable. As well as  you may find you have a whole load of characters showing up in your data like these:
“
This is connected to encoding changes in the database, but so long as you do not have any of these characters in your database that you want to keep (e.g. if you are actually using a Euro symbol) then you can strip them out with a few MySQL commands as previously suggested.
In my case I had this problem with a Wordpress database that I had inherited, and I found a useful set of pre-formed queries that work for Wordpress here http://digwp.com/2011/07/clean-up-weird-characters-in-database/
It's also worth noting that one of the causes of the problem in the first place is opening a database in a text editor which might change the encoding in some way. So if you can possibly manipulate the database using MySQL only and not a text editor this will reduce the risk of causing further trouble.

Escape characters in MySQL, in Ruby

I have a couple escaped characters in user-entered fields that I can't figure out.
I know they are the "smart" single and double quotes, but I don't know how to search for them in mysql.
The characters in ruby, when output from Ruby look like \222, \223, \224 etc
irb> "\222".length => 1
So - do you know how to search for these in mysql? When I look in mysql, they look like '?'.
I'd like to find all records that have this character in the text field. I tried
mysql> select id from table where field LIKE '%\222%'
but that did not work.
Some more information - after doing a mysqldump, this is how one of the characters is represented - '\\xE2\\x80\\x99'. It's the smart single quote.
Ultimately, I'm building an RTF file and the characters are coming out completely wrong, so I'm trying to replace them with 'dumb' quotes for now. I was able to do a gsub(/\222\, "'").
Thanks.
I don't quite understand your problem but here is some info for you:
First, there are no escaped characters in the database. Because every character being stored as is, with no escaping.
they don't "look ilke ?". I's just wrong terminal settings. SET NAMES query always should be executed first, to match client encoding.
you have to determine character set and use it on every stage - in the database, in the mysql client, in ruby.
you should distinguish ruby strings representation from character itself.
To enter character in the mysql query, you can use char function. But in terminal only. In ruby just use the character itself.
smart quotes looks like 2-byte encoded in the unicode. You have to determine your encoding first.