I have a Rails application which accepts JSON data from third-party sources, and I believe I am running up against some ActiveRecord behind-the-scenes magic which is recognizing ASCII-8BIT characters in hashes extracted from the JSON and saving them as such to my database no matter what I do.
Here is a simplified description the class ...
class MyClass < ActiveRecord::Base
serialize :data
end
and of how an object is created ...
a = MyClass.new
a.data = {
"a" =>
{
"b" => "bb",
"c" => "GIF89a\x01\x00\x01\x00\x00\x00\x00\x00!\vNETSCAPE2.0\x03\x01\x00\x00!\x04\t\x00\x00\x01\x00,\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02L\x01\x00;"
}
}
I believe those are ASCII-8BIT characters, so fair enough if they are saved as such (despite my attempts to UTF8 everything everywhere). But I need these characters to be UTF-8, because when I go to view them I get:
ActionView::Template::Error ("\xEF" from ASCII-8BIT to UTF-8):
64: <div>
65: <pre><%= mc.prettify %></pre>
66: </div>
app/models/my_class.rb:28:in `prettify'
where line #28 in prettify is:
JSON.pretty_generate(self.data)
So I sought to re-encode any string in the Hash. I built out functionality to do this (with anonymous classes and refinements to Hash, Array and String), but no matter what, ASCII-8BIT is returned to me. In the simplest terms here is what is happening:
mc = MyClass.find(123)
mc.data['a']['c'].encode!(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
mc.data['a']['c'].encoding #=> #<Encoding:UTF-8>
mc.data['a']['c'].valid_encoding? #=> true
mc.save!
mc.reload
mc.data['a']['c'].encoding #=> #<Encoding:ASCII-8BIT> <-- !!!!!
What is ActiveRecord doing to this hash when it saves it? And what can I do to store a hash permanently with all strings encoded to UTF-8 in a serialized MySQL (v5.6, via the mysql2 gem) mediumtext column (using Ruby 2.2.4 and Rails 4.1.4)?
my.cnf
[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
[mysqld]
# ...
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
init-connect='SET NAMES utf8mb4'
character-set-server=utf8
So, there's not really such thing as an "ASCII-8BIT" character. ASCII-8BIT to ruby essentially means 'no encoding at all' -- just bytes, without assuming any encoding. It's a synonym for 'BINARY'.
But if you have bytes that aren't valid UTF-8, they can't really be encoded as UTF-8. Even if the encoding on the string were UTF-8, at best you'd get lots of InvalidEncoding errors when you tried to do something to it.
What encoding the string will end up tagged as depends on a complicated dance between ActiveRecord and your database itself -- also, the database itself can sometimes actually change your bytes, depending on the database and how it's set up and what you're doing. We could try to debug exactly what you are doing.
But really, the answer is -- if you want it to be UTF-8, it can't have binary non-UTF8 data in it. "ASCII-8BIT" actually is the right encoding for binary data. What are you actually trying to do, where do those weird bytes come from and why do you want them? In general, I'm not sure if it's legal to put arbitrary non-UTF8 bytes in JSON? It might be legal for JSON, but it will probably cause you headaches (such as the one you're dealing with), as it depends on what exactly both rails and your underlying DB are going to do with them.
Just to get around your display error, you could have your prettify method use scrub, added in ruby 2.1.0 to eliminate bad bytes for the current encoding. value.force_encoding("UTF-8").scrub. That will probably work to get rid of your error, and will do perhaps the right thing, but it would be better to figure out what the heck is really going on, why you want those weird bytes in the first place, what they are supposed to mean for what purpose.
Related
i am having a tough time understanding how to use binary datatypes with redis. I want to use the command
set '{binary data}' 'Alex'
what if the binary data actually includes a quote symbol or /r/n? I know I can escape characters but is there an official list of characters I need to escape?
Arbitrary bytes can be input in redis-cli using hexadecimal notation, e.g.
set "\x00\xAB\x20" "some value"
There's no need to do anything special with the data itself. All Redis strings are binary safe.
Your problem relates to redis-cli (which is a very nice redis client for getting to know Redis, but almost never what you want in production, because of usage and performance issues).
Your problem also relates to common (bash/sh/other) terminal escaping. Here's a nice explanation.
I suggest you use python for this, or any other language you are comfortable with.
Example:
import redis
cli=redis.Redis('localhost', 6379)
with open('data.txt','rb') as f:
for d in f:
t = d.partition('\t')
cli.set(t[0], t[2].rstrip())
#EOF
You can send the command as an array of bulk strings to Redis, no need to escape characters or Base64 encode. Since bulk strings begin with the data length, Redis doesn't try to parse the data bytes and instead just jumps to the end to verify the terminating CR/LF pair:
*3<crlf>
$3<crlf>SET<crlf>
${binary_key_length}<crlf>{binary_key_data}<crlf>
${binary_data_length}<crlf>{binary_data}<crlf>
I found it is best to use the Redis protocol to do this as the boundaries can be defined before the datatype.
I have a MySQL 'articles' table and I am trying to make the following insert using SQLyog.
insert into articles (id,title) values (2356606,'Jérôme_Lejeune');
This works fine and the data shows fine when I do a select query.
The problem is that when I do the same insert query through my perl script, the name shows up with some junk characters in place of é and ô in the database. I need to know how to properly store the name through my script. The part of code that does the insert is like this.
$sql_insert = "insert into articles (id,title) values (?,?)";
$sth_insert = $dbh->prepare($sql_insert);
$sth_insert->execute($id,$title);
$id and $title have the correct required data which I have checked by print before I am inserting them. Please assist.
You have opened up the character encoding can of worms, and you have a lot to learn before you will solve this problem and have it stay solved.
You are probably already used to thinking of how a character of text can be encoded as a string of bits. Under the ASCII encoding, for example, the 8-bit string 01000001 (65) is used to indicate the A character. When you start to think about how many different languages there are and how many different kinds of characters there are, you quickly realize that an 8-bit encoding is not going to get you very far. So a number of other character encodings have proliferated. Some of the most popular are latin1 (ISO-8859-1) and UTF-8. Both of these encodings can render the é and ô characters, but they use quite different bit strings to represent them. As you write to a file (or to the terminal) or add a row to a database, Perl and MySQL have a notion of what the character encoding of the output stream is. An encoding is also used when you read data. If you don't know what this encoding is, then it doesn't make any sense to say that the data looks good/looks bad when you store it and retrieve it.
Perl and MySQL can, with the right settings, handle both of these encodings and several more. Which encoding you choose to use is not as important as making sure that all the pieces of your application are using the same encoding. But you should choose an encoding that
can encode all of the characters you will need (for this problem, you mention é and ô, but will there be others? what about in the future?)
is supported by all the pieces of your application (front-end, database, back-end)
Here's some suggested reading to get you headed in the right direction:
The Encode module for Perl
character sets in MySQL
(others should feel free to recommend additional links)
I can't speak to MySQL so much, but character encoding support in Perl is rapidly evolving (which isn't to say that it ain't damn good). The latest versions of Perl will have the best support (for the most obscure character sets) and the best features (for example, regular expressions and character classes) for characters beyond ASCII.
There are few things to follow.
First you have to make sure, that Perl understands that data which is moving between your program and DB is encoded as UTF-8 (i expect your databases and tables are set properly). For this you need to say it loud out on connecting to database, like this:
my($dbh) = DBI->connect(
'dbi:mysql:test',
'user',
'password',
{
mysql_enable_utf8 => 1,
}
);
Next, you need send data to output and you must set it to decaode data as UTF-8. For this i like pretty good module:
use utf8::all;
But this module is not in core, so you may want to set it with binmode yourself too:
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
And if you deal with webpages, you have to make sure, that browser understoods that you are sending your data encoded as UTF-8. For that you should make sure your HTTP-headers include encoding:
Content-Type: text/html; charset=utf-8;
and set it with HTML META-tag too:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Now you should get your road covered.
Our Rails 3 app needs to be able to accept foreign characters like ä and こ, and save them to our MySQL db, which has its character_set as 'utf8.'
One of our models runs a validation which is used to strip out all the non-word characters in its name, before being saved. In Ruby 1.8.7 and Rails 2, the following was sufficient:
def strip_non_words(string)
string.gsub!(/\W/,'')
end
This stripped out bad characters, but preserved things like 'ä', 'こ', and '3.' With Ruby 1.9's new encodings, however, that statement no longer works - it is now removing those characters as well as the others we don't want. I am trying to find a way to do that.
Changing the gsub to something like this:
def strip_non_words(string)
string.gsub!(/[[:punct]]/,'')
end
lets the string pass through fine, but then the database kicks up the following error:
Mysql2::Error: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation
Running the string through Iconv to try and convert it, like so:
def strip_non_words(string)
Iconv.conv('LATIN1', 'UTF8', string)
string.gsub!(/[[:punct]]/,'')
end
Results in this error:
Iconv::IllegalSequence: "こäè" # "こäè" being a test string
I'm basically at my whits end here. Does anyone know of a way to do do what I need?
This ended up being a bit of an interesting fix.
I discovered that Ruby has a regex I could use, but only for ASCII strings. So I had to convert the string to ASCII, run the regex, then convert it back for submission to the db. End result looks like this:
def strip_non_words(string)
string_encoded = string.force_encoding(Encoding::ASCII_8BIT)
string_encoded.gsub!(/\p{Word}+/, '') # non-word characters
string_reencoded = string_encoded.force_encoding('ISO-8859-1')
string_reencoded #return
end
Turns out you have to encode things separately due to how Ruby handles changing a character encoding: http://ablogaboutcode.com/2011/03/08/rails-3-patch-encoding-bug-while-action-caching-with-memcachestore/
I'm pulling some RSS feeds in from YouTube which have invalid UTF8. I can create a similar ruby string using
bad_utf8 = "\u{61B36}"
bad_utf8.encoding # => #<Encoding:UTF-8>
bad_utf8.valid_encoding? # => true
Ruby thinks this is a valid UTF-8 encoding and I'm pretty sure it isn't.
When talking to Mysql I get an error like so
require 'mysql2'
client = Mysql2::Client.new(:host => "localhost", :username => "root")
client.query("use test");
bad_utf8 = "\u{61B36}"
client.query("INSERT INTO utf8 VALUES ('#{moo}')")
# Incorrect string value: '\xF1\xA1\xAC\xB6' for column 'string' at row 1 (Mysql2::Error)
How can I detect or fix up these invalid types of encodings before I send them off to MySQL?
I don't rely on Ruby's built-in String.valid_encoding?, because the following is also possible:
irb
1.9.3-p125 :001 > bad_utf8 = "\u{0}"
=> "\u0000"
1.9.3-p125 :002 > bad_utf8.valid_encoding?
=> true
1.9.3-p125 :003 > bad_utf8.encoding
=> #<Encoding:UTF-8>
This is valid UTF-8 (Reference: https://en.wikipedia.org/wiki/Utf8), but I found the presence of the NULL character in a string is often a hint to a previous conversion error (e.g. when transcoding from invalid encoding informations found in html pages).
I created my own validation function for "Modified UTF-8", which can take a :bmp_only option for restricting validation to the Basic Multilingual Plane (0x1-0xffff). This should be enough for most modern languages (Reference: https://en.wikipedia.org/wiki/Unicode_plane).
Find the validator here: https://gist.github.com/2295531
possibly because the code point doesn't lie in the basic multilingual plane
which is the only characters that MySQL allows in its "utf8" character set.
Newer versions of mysql have another character set called "utf8mb4" which supports unicode characters outside the BMP.
But you probably don't want to be using that. Consider your use-cases carefully. Few real human languages (if any) use characters outside the BMP.
I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).
Since PHP doesn't directly handle UTF-8, I'm using:
$value = iconv ('UTF-8', 'ISO-8859-1', $value);
to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).
However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:
Unknown error type: [8]
iconv() [function.iconv]: Detected an illegal character in input string
To get rid of the smart quotes before using iconv, I have tried using three statements like:
$value = str_replace('’', "'", $value);
(’ is the raw value of a UTF-8 smart single quote)
Because the text file is so long, these str_replace's cause the script to time out every single time.
What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?
Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?
Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.
Thus, on Linux, this works just fine:
$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'
I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.
What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].
If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.
If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.
(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)
Have you considered using MySQL's REPLACE string function to change the offending strings into apostrophes, or whatever? You may be able to put together the "string to be replaced" part e.g. by using CONCAT on CHAR calls...