Escape characters in MySQL, in Ruby - mysql

I have a couple escaped characters in user-entered fields that I can't figure out.
I know they are the "smart" single and double quotes, but I don't know how to search for them in mysql.
The characters in ruby, when output from Ruby look like \222, \223, \224 etc
irb> "\222".length => 1
So - do you know how to search for these in mysql? When I look in mysql, they look like '?'.
I'd like to find all records that have this character in the text field. I tried
mysql> select id from table where field LIKE '%\222%'
but that did not work.
Some more information - after doing a mysqldump, this is how one of the characters is represented - '\\xE2\\x80\\x99'. It's the smart single quote.
Ultimately, I'm building an RTF file and the characters are coming out completely wrong, so I'm trying to replace them with 'dumb' quotes for now. I was able to do a gsub(/\222\, "'").
Thanks.

I don't quite understand your problem but here is some info for you:
First, there are no escaped characters in the database. Because every character being stored as is, with no escaping.
they don't "look ilke ?". I's just wrong terminal settings. SET NAMES query always should be executed first, to match client encoding.
you have to determine character set and use it on every stage - in the database, in the mysql client, in ruby.
you should distinguish ruby strings representation from character itself.
To enter character in the mysql query, you can use char function. But in terminal only. In ruby just use the character itself.
smart quotes looks like 2-byte encoded in the unicode. You have to determine your encoding first.

Related

SQL - Changing " or ' in SQL so adding text is easier

So I am building a database of all my text messages to get information about my habits and I'm having trouble importing the contents of the messages. Whenever there are apostrophes (often) or quotation marks (not as rare as you might think), I get syntax issues.
Is there a way to make MySQL use something other than " or ' to encase strings (specifically, the field is a VARCHAR). If I could use a ~ or some other rarely used character in text messaging my life would become a whole lot easier.
Preferably you should use parameterised queries, then your database connector takes care of sending the strings to the database in the correct way.
If you need to build the queries by concatenating the values into a query, you need to escape the strings correctly to make them string literals in the SQL code.
Stick to one delimiter for strings, don't use apostrophes around some strings and qoutation marks around others, that only makes it harder to escape them correctly. I suggest that you use apostrophes, as that is what the SQL standard specifies.
To escape the strings correctly to be a string literal delimited by apostropes, you should:
Replace all backslashes by double backslashes, then
Replace all apostrophes by a backslash and an apostrophe
For example, to make the string It's an "example" with a backslash(\). into a string literal, it should end up like this in a query:
insert into Table (txt) values ('It\'s an "example" with a backslash(\\).')
Note: This is a correct way to escape strings for MySQL. Other databases may use different characters for escaping and need other characters to be escaped, so using this for any other database may fail, or even worse open up for SQL injection attacks.

Mysql replace all special unicode characters with their ascii counterpart

I have a field with encoding utf8-general-ci in which many values contain non-ascii characters. I want to
Search for all fields with any non-ascii characters
Replace all non-ascii characters with their corresponding ascii version.
For example: côte-d'ivoire should be replaced with cote-d-i'voire, são-tomé should be replaced with sao-tome, etc.
How do I achieve this? If I just change the field type to ascii, non-ascii characters get replaced by '?'. I am not even able to search for all such fields using
RLIKE '%[^a-z]%'
For example
SELECT columname
FROM tablename
WHERE NOT columname REGEXP '[a-z]';
returns an empty set.
Thanks
An sql fiddle example is at
http://www.sqlfiddle.com/#!2/c1d90/1/0
the query to select is
select * from test where maintext rlike '[^\x00-\x7F]'
Hope this helps
I'm presuming from your previous questions that you're using PHP.
https://github.com/silverstripe-labs/silverstripe-unidecode
You could then use skv's answer to return the object's you wish to use and then use unidecode to attempt to convert the object to it's ascii equivalents.
In Perl, you can use Text::Unidecode.
In MySQL, there isn't an easy function to convert from utf8 (or utf8mb4) into ascii without it spitting out some ugly '?' characters as replacements. It's best to replace these prior to inserting them in the DB, or run something in Perl (or whatever) to extract the data and re-update it one row at a time.
There are many different ports of Text::Unidecode in different languages: Python, PHP, Java, Ruby, JavaScript, Haskell, C#, Clojure, Go.

Ideas for Find and Replace character

I need to search address fields and change one character to upper case if there is an apartment number. So '521 Main St. #3b' would change to '521 Main St. #3B'.
The way I know to do this would be to write a program that loops through the recordset, looks at the address field for the last character to see if it's an alpha, then if the character before it is a numeric, change the case of the last char and update the record.
Is this something that would be quicker/simpler with regular expressions (haven't ever used)?
If so, is this best done from within a programming environmnet or using a text editor such as Textmate or vi ? The data is in MySQL and Excel, but I can export it to a text file.
Thanks.
I solved this using TextMate which, once I began to understand a little regex, was simple. (details here Regex Syntax for making the last character Uppercase in TextMate)
Still, I wonder if something like sed or awk, (which I started to try out) might be a better tool. And the SQL solution that Olexa provided works. I just don't know how to have it apply to the entire recordset.
If the data is stored in MySQL, then it is better to process it there:
UPDATE addresses
SET address = CONCAT(LEFT(address, CHAR_LENGTH(address) - 1), UPPER(RIGHT(address, 1)))
WHERE address REGEXP BINARY '#[[:digit:]]+[[:lower:]]{1}$'
;
I've added BINARY because otherwise REGEXP is not case-sensitive, but BINARY may need to be omitted to support multi-byte strings. In this case, surplus updates will be made, but the result would be correct anyway.
P. S. An example on SQL Fiddle showing which values are affected, and how they are affected: http://sqlfiddle.com/#!2/b29326/1

MySQL Query to Identify bad characters?

We have some tables that were set with the Latin character set instead of UTF-8 and it allowed bad characters to be entered into the tables, the usual culprit is people copy / pasting from Word or Outlook which copys those nasty hidden characters...
Is there any query we can use to identify these characters to clean them?
Thanks,
I assume that your connection chacater set was set to UTF8 when you filled the data in.
MySQL replaces unconvertable characters with ? (question marks):
SELECT CONVERT('тест' USING latin1);
----
????
The problem is distinguishing legitimate question marks from illegitimate ones.
Usually, the question marks in the beginning of a word are a bad sign, so this:
SELECT *
FROM mytable
WHERE myfield RLIKE '\\?[[:alnum:]]'
should give a good start.
You're probably noticing something like this 'bug'. The 'bad characters' are most likely UTF-8 control characters (eg \x80). You might be able to identify them using a query like
SELECT bar FROM foo WHERE bar LIKE LOCATE(UNHEX(80), bar)!=0
From that linked bug, they recommend using type BLOB to store text from windows files:
Use BLOB (with additional encoding field) instead of TEXT if you need to store windows files (even text files). Better than 3-byte UTF-8 and multi-tier encoding overhead.
Take a look at this Q/A (it's all about your client encoding aka SET NAMES )

Unicode escape sequence in command line MySQL

Short version:
What kind of escape sequence can one use to search for unicode characters in command line mysql?
Long version:
I'm looking for a way to search a column for records containing a unicode sequence, U+200B, in mysql from the command line. I can't figure out which kind of escape to use. I've tried \u200B and x200B and even ​ I finally found one blog that suggested the _utf8 syntax. This will produce the character on the command line:
select _utf8 x'200B';
Now I'm stuck trying to get that working in a "LIKE" query.
This generates the characters, but the % seem to lose their special meaning when placed in the LIKE part:
select _utf8 x'0025200B0025';
I also tried a concat but it didn't work either:
select concat('%', _utf8 x'200B', '%');
More background:
I have some data that has zero width space characters (zwsp) in it, Unicode Point U+200B. This is typically caused by copy/paste from websites that use the zwsp in their output. With most unicode characters, I can just paste the character into the terminal (or create it with a keycode), but since this one is invisible it's a bit more challenging. I can create a file that generates a "%%" sequence and copy/paste it to the terminal and it will work but it leaves my command history and terminal output screwy. I would think there is a straightforward way to do this in MySQL, but so far I've come up short.
Thanks in advance,
-Paul Burney
select _utf8 x'0025200B0025';
That's not UTF-8, it's UTF-16/UCS-2. You might be able to say SELECT _ucs2 0x0025200B0025 if you have UCS-2 support in your copy of MySQL.
Otherwise, the byte sequence encoding character U+200B in UTF-8 would be 0xE2, 0x80, 0x8B:
select 0xE2808B;
If it is Linux then hold Ctrl + Shift + U then release the U and type 200B.