regex issues striping out extended ascii/unicode - mysql

I am working on importing a sql dump into a mysql database using workbench. The data set includes some extended ascii/unicode characters like "ç" in "Français" in some of the insert statements. These charecters break the import.
I do not care about those characters so using notepadd++ and this page Notepad++, How to remove all non ascii characters with regex? I am trying to strip out all the extended characters using this regex [^\x00-\x7F]+ which per my poor understanding is basically NOT 00-7f or NUL(0) through DEL(127).
It finds the right characters, but for some reason also finds the CRLF at the end of each line - which is not in this range and I am not sure why as CR and LF are \x0A and \x0D they should not be in that set.
I am sure I am missing something simple - so is there a better regex to use to not lose my newlines, or even a way to tell SQL workbench to ignore the extended characters?
Here is an example of one of the insert lines with an extended value in it:
INSERT INTO as_catalog VALUES('525234','Google Apps Sync™ for
Microsoft Outlook® 3.3.355.950','0');
Thanks!

Related

MySQL: how to replace literal \r\n with special characters \r\n

I have some faulty PHP code which inserted literal \r\n characters into the database instead of the special characters representing new line and carriage return. Can anyone help me come up with a query that will replace the literals with the special characters?
Here's an SQL Fiddle setup. All I really need is something that will return the row containing "abc\r\ndef" rather than the other row. It's probably a very simple escape that's needed, but I can't work it out.
http://sqlfiddle.com/#!9/1f2acb/1
Once I have that query I guess I will simply use
UPDATE test SET txt replace(txt, 'UNKNOWN EXPRESSIOn', '\r\n');
I'm running MySQL 5.5 on Ubuntu.
The answer was in a similar question that juanvan linked to.
UPDATE test set txt = replace(txt,'\\r\\n','\r\n');

Ideas for Find and Replace character

I need to search address fields and change one character to upper case if there is an apartment number. So '521 Main St. #3b' would change to '521 Main St. #3B'.
The way I know to do this would be to write a program that loops through the recordset, looks at the address field for the last character to see if it's an alpha, then if the character before it is a numeric, change the case of the last char and update the record.
Is this something that would be quicker/simpler with regular expressions (haven't ever used)?
If so, is this best done from within a programming environmnet or using a text editor such as Textmate or vi ? The data is in MySQL and Excel, but I can export it to a text file.
Thanks.
I solved this using TextMate which, once I began to understand a little regex, was simple. (details here Regex Syntax for making the last character Uppercase in TextMate)
Still, I wonder if something like sed or awk, (which I started to try out) might be a better tool. And the SQL solution that Olexa provided works. I just don't know how to have it apply to the entire recordset.
If the data is stored in MySQL, then it is better to process it there:
UPDATE addresses
SET address = CONCAT(LEFT(address, CHAR_LENGTH(address) - 1), UPPER(RIGHT(address, 1)))
WHERE address REGEXP BINARY '#[[:digit:]]+[[:lower:]]{1}$'
;
I've added BINARY because otherwise REGEXP is not case-sensitive, but BINARY may need to be omitted to support multi-byte strings. In this case, surplus updates will be made, but the result would be correct anyway.
P. S. An example on SQL Fiddle showing which values are affected, and how they are affected: http://sqlfiddle.com/#!2/b29326/1

Removing strange characters from MySQL data

Somewhere along the way, between all the imports and exports I have done, a lot of the text on a blog I run is full of weird accented A characters.
When I export the data using mysqldump and load it into a text editor with the intention of using search-and-replace to clear out the bad characters, searching just matches every "a" character.
Does anyone know any way I can successfully hunt down these characters and get rid of them, either directly in MySQL or by using mysqldump and then reimporting the content?
This is an encoding problem; the  is a non-breaking space (HTML entity ) in Unicode being displayed in Latin1.
You might try something like this... first we check to make sure the matching is working:
SELECT * FROM some_table WHERE some_field LIKE BINARY '%Â%'
This should return any rows in some_table where some_field has a bad character. Assuming that works properly and you find the rows you're looking for, try this:
UPDATE some_table SET some_field = REPLACE( some_field, BINARY 'Â', '' )
And that should remove those characters (based on the page you linked, you don't really want an nbsp there as you would end up with three spaces in a row between sentences etc, you should only have one).
If it doesn't work then you'll need to look at the encoding and collation being used.
EDIT: Just added BINARY to the strings; this should hopefully make it work regardless of encoding.
The accepted answer did not work for me.
From here http://nicj.net/mysql-converting-an-incorrect-latin1-column-to-utf8/ I have found that the binary code for  character is c2a0 (by converting the column to VARBINARY and looking what it turns to).
Then here http://www.oneminuteinfo.com/2013/11/mysql-replace-non-ascii-characters.html found the actual solution to remove (replace) it:
update entry set english_translation = unhex(replace(hex(english_translation),'C2A0','20')) where entry_id = 4008;
The query above replaces it to a space, then a normal trim can be applied or simply replace to '' instead.
I have had this problem and it is annoying, but solvable. As well as  you may find you have a whole load of characters showing up in your data like these:
“
This is connected to encoding changes in the database, but so long as you do not have any of these characters in your database that you want to keep (e.g. if you are actually using a Euro symbol) then you can strip them out with a few MySQL commands as previously suggested.
In my case I had this problem with a Wordpress database that I had inherited, and I found a useful set of pre-formed queries that work for Wordpress here http://digwp.com/2011/07/clean-up-weird-characters-in-database/
It's also worth noting that one of the causes of the problem in the first place is opening a database in a text editor which might change the encoding in some way. So if you can possibly manipulate the database using MySQL only and not a text editor this will reduce the risk of causing further trouble.

Unicode escape sequence in command line MySQL

Short version:
What kind of escape sequence can one use to search for unicode characters in command line mysql?
Long version:
I'm looking for a way to search a column for records containing a unicode sequence, U+200B, in mysql from the command line. I can't figure out which kind of escape to use. I've tried \u200B and x200B and even ​ I finally found one blog that suggested the _utf8 syntax. This will produce the character on the command line:
select _utf8 x'200B';
Now I'm stuck trying to get that working in a "LIKE" query.
This generates the characters, but the % seem to lose their special meaning when placed in the LIKE part:
select _utf8 x'0025200B0025';
I also tried a concat but it didn't work either:
select concat('%', _utf8 x'200B', '%');
More background:
I have some data that has zero width space characters (zwsp) in it, Unicode Point U+200B. This is typically caused by copy/paste from websites that use the zwsp in their output. With most unicode characters, I can just paste the character into the terminal (or create it with a keycode), but since this one is invisible it's a bit more challenging. I can create a file that generates a "%%" sequence and copy/paste it to the terminal and it will work but it leaves my command history and terminal output screwy. I would think there is a straightforward way to do this in MySQL, but so far I've come up short.
Thanks in advance,
-Paul Burney
select _utf8 x'0025200B0025';
That's not UTF-8, it's UTF-16/UCS-2. You might be able to say SELECT _ucs2 0x0025200B0025 if you have UCS-2 support in your copy of MySQL.
Otherwise, the byte sequence encoding character U+200B in UTF-8 would be 0xE2, 0x80, 0x8B:
select 0xE2808B;
If it is Linux then hold Ctrl + Shift + U then release the U and type 200B.

Escape characters in MySQL, in Ruby

I have a couple escaped characters in user-entered fields that I can't figure out.
I know they are the "smart" single and double quotes, but I don't know how to search for them in mysql.
The characters in ruby, when output from Ruby look like \222, \223, \224 etc
irb> "\222".length => 1
So - do you know how to search for these in mysql? When I look in mysql, they look like '?'.
I'd like to find all records that have this character in the text field. I tried
mysql> select id from table where field LIKE '%\222%'
but that did not work.
Some more information - after doing a mysqldump, this is how one of the characters is represented - '\\xE2\\x80\\x99'. It's the smart single quote.
Ultimately, I'm building an RTF file and the characters are coming out completely wrong, so I'm trying to replace them with 'dumb' quotes for now. I was able to do a gsub(/\222\, "'").
Thanks.
I don't quite understand your problem but here is some info for you:
First, there are no escaped characters in the database. Because every character being stored as is, with no escaping.
they don't "look ilke ?". I's just wrong terminal settings. SET NAMES query always should be executed first, to match client encoding.
you have to determine character set and use it on every stage - in the database, in the mysql client, in ruby.
you should distinguish ruby strings representation from character itself.
To enter character in the mysql query, you can use char function. But in terminal only. In ruby just use the character itself.
smart quotes looks like 2-byte encoded in the unicode. You have to determine your encoding first.