Ruby 1.9, MySQL character encoding issue

Ruby 1.9, MySQL character encoding issue - mysql

Our Rails 3 app needs to be able to accept foreign characters like ä and こ, and save them to our MySQL db, which has its character_set as 'utf8.'
One of our models runs a validation which is used to strip out all the non-word characters in its name, before being saved. In Ruby 1.8.7 and Rails 2, the following was sufficient:
def strip_non_words(string)
string.gsub!(/\W/,'')
end
This stripped out bad characters, but preserved things like 'ä', 'こ', and '3.' With Ruby 1.9's new encodings, however, that statement no longer works - it is now removing those characters as well as the others we don't want. I am trying to find a way to do that.
Changing the gsub to something like this:
def strip_non_words(string)
string.gsub!(/[[:punct]]/,'')
end
lets the string pass through fine, but then the database kicks up the following error:
Mysql2::Error: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation
Running the string through Iconv to try and convert it, like so:
def strip_non_words(string)
Iconv.conv('LATIN1', 'UTF8', string)
string.gsub!(/[[:punct]]/,'')
end
Results in this error:
Iconv::IllegalSequence: "こäè" # "こäè" being a test string
I'm basically at my whits end here. Does anyone know of a way to do do what I need?

This ended up being a bit of an interesting fix.
I discovered that Ruby has a regex I could use, but only for ASCII strings. So I had to convert the string to ASCII, run the regex, then convert it back for submission to the db. End result looks like this:
def strip_non_words(string)
string_encoded = string.force_encoding(Encoding::ASCII_8BIT)
string_encoded.gsub!(/\p{Word}+/, '') # non-word characters
string_reencoded = string_encoded.force_encoding('ISO-8859-1')
string_reencoded #return
end
Turns out you have to encode things separately due to how Ruby handles changing a character encoding: http://ablogaboutcode.com/2011/03/08/rails-3-patch-encoding-bug-while-action-caching-with-memcachestore/

Related

How to detect parts of unicode in a string in Python

I call an api to get some info, and sometime the response has examples like below.
"address": "BOULEVARD DU MÃ\u0089ROU - SN PEÃ\u008fRE, "
How can I detect these and convert them to the latin letters? I want to upload this data to a MYSQL Database. Right now it throws the following warning.
Warning: (1366, "Incorrect string value: '\\xC2\\x88ME A...' for column 'address' at row 1")
I'm using pymysql, to insert this info to the DB.

The example data was original encoded as UTF8, but decoded as latin1. You can reverse the process to fix it, or read it from the source using utf8 to begin with:
>>> s = "BOULEVARD DU MÃ\u0089ROU - SN PEÃ\u008fRE, "
>>> s.encode('latin1').decode('utf8')
'BOULEVARD DU MÉROU - SN PEÏRE, '

you can use the .encode() str function:
>>> "BOULEVARD DU MÃ\u0089ROU - SN PEÃ\u008fRE, ".encode("latin-1)
'BOULEVARD DU MÉROU - SN PEÏRE, '
Though be aware if the API response contains any UTF-8 characters that cannot be encoded in "latin-1" then you'll hit a UnicodeEncodeError
If at all possible, rather than do this you'll probably want to change the character set of your mysql database to UTF-8

It looks like you have multiple errors -- "double encoding" and unicode "codepoints". Hence, it is hard to unravel what things went wrong.
It would be better to go back to the source and fix the encoding at each stage -- not to try to encode/decode after the mess is made. In almost all cases no conversion code is needed if you specify UTF-8 at every stage.
Here are some notes on what to do in Python: http://mysql.rjweb.org/doc.php/charcoll#python
The hex for É should be C389 and the hex for Ï should be C38F. There should be no \uxxxx except in HTML. Even in HTML, is is normally better to simply use the utf8 encoding since HTML can handle such.

Is ActiveRecord changing the encoding on my serialized hash

I have a Rails application which accepts JSON data from third-party sources, and I believe I am running up against some ActiveRecord behind-the-scenes magic which is recognizing ASCII-8BIT characters in hashes extracted from the JSON and saving them as such to my database no matter what I do.
Here is a simplified description the class ...
class MyClass < ActiveRecord::Base
serialize :data
end
and of how an object is created ...
a = MyClass.new
a.data = {
"a" =>
{
"b" => "bb",
"c" => "GIF89a\x01\x00\x01\x00\x00\x00\x00\x00!\vNETSCAPE2.0\x03\x01\x00\x00!\x04\t\x00\x00\x01\x00,\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02L\x01\x00;"
}
}
I believe those are ASCII-8BIT characters, so fair enough if they are saved as such (despite my attempts to UTF8 everything everywhere). But I need these characters to be UTF-8, because when I go to view them I get:
ActionView::Template::Error ("\xEF" from ASCII-8BIT to UTF-8):
64: <div>
65: <pre><%= mc.prettify %></pre>
66: </div>
app/models/my_class.rb:28:in `prettify'
where line #28 in prettify is:
JSON.pretty_generate(self.data)
So I sought to re-encode any string in the Hash. I built out functionality to do this (with anonymous classes and refinements to Hash, Array and String), but no matter what, ASCII-8BIT is returned to me. In the simplest terms here is what is happening:
mc = MyClass.find(123)
mc.data['a']['c'].encode!(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
mc.data['a']['c'].encoding #=> #<Encoding:UTF-8>
mc.data['a']['c'].valid_encoding? #=> true
mc.save!
mc.reload
mc.data['a']['c'].encoding #=> #<Encoding:ASCII-8BIT> <-- !!!!!
What is ActiveRecord doing to this hash when it saves it? And what can I do to store a hash permanently with all strings encoded to UTF-8 in a serialized MySQL (v5.6, via the mysql2 gem) mediumtext column (using Ruby 2.2.4 and Rails 4.1.4)?
my.cnf
[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
[mysqld]
# ...
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
init-connect='SET NAMES utf8mb4'
character-set-server=utf8

So, there's not really such thing as an "ASCII-8BIT" character. ASCII-8BIT to ruby essentially means 'no encoding at all' -- just bytes, without assuming any encoding. It's a synonym for 'BINARY'.
But if you have bytes that aren't valid UTF-8, they can't really be encoded as UTF-8. Even if the encoding on the string were UTF-8, at best you'd get lots of InvalidEncoding errors when you tried to do something to it.
What encoding the string will end up tagged as depends on a complicated dance between ActiveRecord and your database itself -- also, the database itself can sometimes actually change your bytes, depending on the database and how it's set up and what you're doing. We could try to debug exactly what you are doing.
But really, the answer is -- if you want it to be UTF-8, it can't have binary non-UTF8 data in it. "ASCII-8BIT" actually is the right encoding for binary data. What are you actually trying to do, where do those weird bytes come from and why do you want them? In general, I'm not sure if it's legal to put arbitrary non-UTF8 bytes in JSON? It might be legal for JSON, but it will probably cause you headaches (such as the one you're dealing with), as it depends on what exactly both rails and your underlying DB are going to do with them.
Just to get around your display error, you could have your prettify method use scrub, added in ruby 2.1.0 to eliminate bad bytes for the current encoding. value.force_encoding("UTF-8").scrub. That will probably work to get rid of your error, and will do perhaps the right thing, but it would be better to figure out what the heck is really going on, why you want those weird bytes in the first place, what they are supposed to mean for what purpose.

Django with MySQL and UTF-8 [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?
Background:
I am using Django with MySQL 5.1 and I am having trouble with 4-byte UTF-8 characters causing fatal errors throughout my web application.
I've used a script to convert all tables and columns in my database to UTF-8 which has fixed most unicode issues, but there is still an issue with 4-byte unicode characters. As noted elsewhere, MySQL 5.1 does not support UTF-8 characters over 3 bytes in length.
Whenever I enter a 4-byte unicode character (e.g. 🀐) into a ModelForm on my Django website the form validates and then an exception similar to the following is raised:
Incorrect string value: '\xF0\x9F\x80\x90' for column 'first_name' at row 1
My question:
What is a reasonable way to avoid fatal errors caused by 4-byte UTF-8 characters in a Django web application with a MySQL 5.1 database.
I have considered:
Selectively disabling MySQL warnings to avoid specifically that error message (not sure whether that is possible yet)
Creating middleware that will look through the request.POST QueryDict and substitute/remove all invalid UTF8 characters
Somehow hook/alter/monkey patch the mechanism that outputs SQL queries for Django or for MySQLdb to substitute/remove all invalid UTF-8 characters before the query is executed
Example middleware to replacing invalid characters (inspired by this SO question):
import re
class MySQLUnicodeFixingMiddleware(object):
INVALID_UTF8_RE = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def process_request(self, request):
"""Replace 4-byte unicode characters by REPLACEMENT CHARACTER"""
request.POST = request.POST.copy()
for key, values in request.POST.iterlists():
request.POST.setlist(key,
[self.INVALID_UTF8_RE.sub(u'\uFFFD', v) for v in values])

Do you have an option to upgrade mysql? If you do, you can upgrade and set the encoding to utf8mb4.
Assuming that you don't have the option, I see these options for you:
1) Add java script / frontend validations to prevent entry of anything other than 1,2, or 3 byte unicode characters,
2) Supplement that with a cleanup function in your models to strip the data of any 4 byte unicode characters (which would be your option 2 or 3)
At the same time, it does look like your users are in fact using 4 byte characters. If there is a business case for using them in your application, you could go to the powers that be and request for an upgrade.

Unicode formatting lost when inserting into MySQL database in Ruby 1.8.7

In Ruby, I have a bunch of strings encoded in UTF-8, e.g.: "HEC Montr\u00e9al".
When I insert it into my MySQL table (formatted as utf8_general_ci) using the 'mysql' gem, the backslash is removed. What gives :) ? Any of you have any idea what the heck is going on here?
edit:
example string:
>> p mystring
"HEC Montr\\u00e9al"
and in the database after insert:
HEC Montru00e9al

This is not UTF:
'HEC Montr\u00e9al'
That's an ASCII representation of a JSON-encoded Unicode string. If it was UTF-8, it would look like:
'HEC Montréal'
You're not properly decoding your JSON inputs somewhere or your client-side code is sending your server JSON when your server is expecting plain text.
First you need to figure out why you're getting JSON encoded strings when you're not expecting them or figure out why you're not properly decoding your JSON. Then you can see if the database is mangling your UTF-8.

I believe you have to explicitly tell the MySQL gem to expect utf8. Something like this:
db = Mysql.init
db.options(Mysql::SET_CHARSET_NAME, 'utf8')
db.real_connect(...

iconv gives "Illegal Character" with smart quotes -- how to get rid of them?

I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).
Since PHP doesn't directly handle UTF-8, I'm using:
$value = iconv ('UTF-8', 'ISO-8859-1', $value);
to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).
However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:
Unknown error type: [8]
iconv() [function.iconv]: Detected an illegal character in input string
To get rid of the smart quotes before using iconv, I have tried using three statements like:
$value = str_replace('â€™', "'", $value);
(â€™ is the raw value of a UTF-8 smart single quote)
Because the text file is so long, these str_replace's cause the script to time out every single time.
What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?
Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?

Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.
Thus, on Linux, this works just fine:
$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'
I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.

What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].
If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.
If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.
(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)

Have you considered using MySQL's REPLACE string function to change the offending strings into apostrophes, or whatever? You may be able to put together the "string to be replaced" part e.g. by using CONCAT on CHAR calls...

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008