I'm pulling some RSS feeds in from YouTube which have invalid UTF8. I can create a similar ruby string using
bad_utf8 = "\u{61B36}"
bad_utf8.encoding # => #<Encoding:UTF-8>
bad_utf8.valid_encoding? # => true
Ruby thinks this is a valid UTF-8 encoding and I'm pretty sure it isn't.
When talking to Mysql I get an error like so
require 'mysql2'
client = Mysql2::Client.new(:host => "localhost", :username => "root")
client.query("use test");
bad_utf8 = "\u{61B36}"
client.query("INSERT INTO utf8 VALUES ('#{moo}')")
# Incorrect string value: '\xF1\xA1\xAC\xB6' for column 'string' at row 1 (Mysql2::Error)
How can I detect or fix up these invalid types of encodings before I send them off to MySQL?
I don't rely on Ruby's built-in String.valid_encoding?, because the following is also possible:
irb
1.9.3-p125 :001 > bad_utf8 = "\u{0}"
=> "\u0000"
1.9.3-p125 :002 > bad_utf8.valid_encoding?
=> true
1.9.3-p125 :003 > bad_utf8.encoding
=> #<Encoding:UTF-8>
This is valid UTF-8 (Reference: https://en.wikipedia.org/wiki/Utf8), but I found the presence of the NULL character in a string is often a hint to a previous conversion error (e.g. when transcoding from invalid encoding informations found in html pages).
I created my own validation function for "Modified UTF-8", which can take a :bmp_only option for restricting validation to the Basic Multilingual Plane (0x1-0xffff). This should be enough for most modern languages (Reference: https://en.wikipedia.org/wiki/Unicode_plane).
Find the validator here: https://gist.github.com/2295531
possibly because the code point doesn't lie in the basic multilingual plane
which is the only characters that MySQL allows in its "utf8" character set.
Newer versions of mysql have another character set called "utf8mb4" which supports unicode characters outside the BMP.
But you probably don't want to be using that. Consider your use-cases carefully. Few real human languages (if any) use characters outside the BMP.
Related
Context:
I have to migrate a Perl script, into Python. The problem resides in that the configuration files that this Perl script uses, is actually valid Perl code. My Python version of it, uses .yaml files as config.
Therefore, I basically had to write a converter between Perl and yaml. Given that, from what I found, Perl does not play well with Yaml, but there are libs that allow dumping Perl hashes into JSON, and that Python works with JSON -almost- natively, I used this format as an intermediate: Perl -> JSON -> Yaml. The first conversion is done in Perl code, and the second one, in Python code (which also does some mangling on the data).
Using the library mentioned by #simbabque, I can output YAML natively, which afterwards I must modify and play with. As I know next to nothing of Perl, I prefer to do so in Python.
Problem:
The source config files look something like this:
$sites = {
"0100101001" => {
mail => 1,
from => 'mail#mail.com',
to => 'mail#mail.com',
subject => 'á é í ó ú',
msg => 'á é í ó ú',
ftp => 0,
sftp => 0,
},
"22222222" => {
[...]
And many more of those.
My "parsing" code is the following:
use strict;
use warnings;
# use JSON;
use YAML;
use utf8;
use Encode;
use Getopt::Long;
my $conf;
GetOptions('conf=s' => \$conf) or die;
our (
$sites
);
do $conf;
# my $json = encode_json($sites);
my $yaml = Dump($sites);
binmode(STDOUT, ':encoding(utf8)');
# print($json);
print($yaml);
Nothing out of the ordinary. I simply need the JSON YAML version of the Perl data. In fact, it mostly works. My problem is with the encoding.
The output of the above code is this:
[...snip...]
mail: 1
msg: á é à ó ú
sftp: 0
subject: á é à ó ú
[...snip...]
The encoding goes to hell and back. As far as I read, UTF-8 is the default, and just in case, I force it with binmode, but to no avail.
What am I missing here? Any workaround?
Note: I thought I may have been my shell, but locale outputs this:
❯ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Which seems ok.
Note 2: I know next to nothing of Perl, and is not my intent to be an expert on it, so any enhancements/tips are greatly appreciated too.
Note 3: I read this answer, and my code is loosely based on it. The main difference is that I'm not sure how to encode a file, instead of a simple string.
The sites config file is UTF-8 encoded. Here are three workarounds:
Put use utf8 pragma inside the site configuration file. The use utf8 pragma in the main script is not sufficient to treat files included with do/require as UTF-8 encoded.
If that is not feasible, decode the input before you pass it to the JSON encoder. Something like
open CFG, "<:encoding(utf-8)", $conf;
do { local $/; eval <CFG> };
close CFG;
instead of
do $conf
Use JSON::to_json instead of JSON::encode_json. encode_json expects decoded input (Unicode code points) and the output is UTF-8 encoded. The output of to_json is not encoded, or rather, it will have the same encoding as the input, which is what you want.
There is no need to encode the final output as UTF-8. Using any of the three workarounds will already produce UTF-8 encoded output.
I'm using Rails 5 to show database content in a web browser.
In the db, all of the special characters are written in their ascii form. For instance, instead of an apostrophe, it's written as '.
Thus, my view is showing the ascii code. Is there a way to convert them all to characters for the view?
To transform ANY string containing HTML character entities, using Rails:
CGI.unescape_html "It doesn't look right" # => "It doesn't look right"
The CGI module is in the Ruby standard library and is required by Rails by default. If you want to do the same in a non-Rails project:
require 'cgi'
CGI.unescape_html "It doesn't look right"
Based on your example here's a simple Ruby solution if you want to define your own helper
39.chr # => "'"
'''.delete('&#;').to_i.chr # => "'"
module ApplicationHelper
def ascii_to_char(ascii)
ascii.delete('&#;').to_i.chr
end
end
# in the views
ascii_to_char(''') # => "'"
If what you really need is full HTML escaping see #forsym's answer
Characters were fed through some "html entities" conversion before storing into the database. Go back an fix that.
I have a Rails application which accepts JSON data from third-party sources, and I believe I am running up against some ActiveRecord behind-the-scenes magic which is recognizing ASCII-8BIT characters in hashes extracted from the JSON and saving them as such to my database no matter what I do.
Here is a simplified description the class ...
class MyClass < ActiveRecord::Base
serialize :data
end
and of how an object is created ...
a = MyClass.new
a.data = {
"a" =>
{
"b" => "bb",
"c" => "GIF89a\x01\x00\x01\x00\x00\x00\x00\x00!\vNETSCAPE2.0\x03\x01\x00\x00!\x04\t\x00\x00\x01\x00,\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02L\x01\x00;"
}
}
I believe those are ASCII-8BIT characters, so fair enough if they are saved as such (despite my attempts to UTF8 everything everywhere). But I need these characters to be UTF-8, because when I go to view them I get:
ActionView::Template::Error ("\xEF" from ASCII-8BIT to UTF-8):
64: <div>
65: <pre><%= mc.prettify %></pre>
66: </div>
app/models/my_class.rb:28:in `prettify'
where line #28 in prettify is:
JSON.pretty_generate(self.data)
So I sought to re-encode any string in the Hash. I built out functionality to do this (with anonymous classes and refinements to Hash, Array and String), but no matter what, ASCII-8BIT is returned to me. In the simplest terms here is what is happening:
mc = MyClass.find(123)
mc.data['a']['c'].encode!(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
mc.data['a']['c'].encoding #=> #<Encoding:UTF-8>
mc.data['a']['c'].valid_encoding? #=> true
mc.save!
mc.reload
mc.data['a']['c'].encoding #=> #<Encoding:ASCII-8BIT> <-- !!!!!
What is ActiveRecord doing to this hash when it saves it? And what can I do to store a hash permanently with all strings encoded to UTF-8 in a serialized MySQL (v5.6, via the mysql2 gem) mediumtext column (using Ruby 2.2.4 and Rails 4.1.4)?
my.cnf
[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
[mysqld]
# ...
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
init-connect='SET NAMES utf8mb4'
character-set-server=utf8
So, there's not really such thing as an "ASCII-8BIT" character. ASCII-8BIT to ruby essentially means 'no encoding at all' -- just bytes, without assuming any encoding. It's a synonym for 'BINARY'.
But if you have bytes that aren't valid UTF-8, they can't really be encoded as UTF-8. Even if the encoding on the string were UTF-8, at best you'd get lots of InvalidEncoding errors when you tried to do something to it.
What encoding the string will end up tagged as depends on a complicated dance between ActiveRecord and your database itself -- also, the database itself can sometimes actually change your bytes, depending on the database and how it's set up and what you're doing. We could try to debug exactly what you are doing.
But really, the answer is -- if you want it to be UTF-8, it can't have binary non-UTF8 data in it. "ASCII-8BIT" actually is the right encoding for binary data. What are you actually trying to do, where do those weird bytes come from and why do you want them? In general, I'm not sure if it's legal to put arbitrary non-UTF8 bytes in JSON? It might be legal for JSON, but it will probably cause you headaches (such as the one you're dealing with), as it depends on what exactly both rails and your underlying DB are going to do with them.
Just to get around your display error, you could have your prettify method use scrub, added in ruby 2.1.0 to eliminate bad bytes for the current encoding. value.force_encoding("UTF-8").scrub. That will probably work to get rid of your error, and will do perhaps the right thing, but it would be better to figure out what the heck is really going on, why you want those weird bytes in the first place, what they are supposed to mean for what purpose.
I want to ensure that a binary field has always value. I added a validation code like below.
class Foo < ActiveRecord::Base
validates :b, presence: true
end
However, the change seems to cause the error.
$ rails c
> Foo.create(b:File.read('b.jpg'))
ArgumentError: invalid byte sequence in UTF-8
The error doesn't always appear. Only when the binary data has non-ascii codes.
How can I validate the binary field?
I made the environment like below. A image file(b.jpg, less than 16KB) is also needed.
$ rails --version
Rails 4.2.0
$ rails new test_binary --database=mysql
$ cd test_binary/
$ rails g model foo b:binary
$ rake db:create db:migrate
File.read returns a String that will claim to have UTF-8 encoding by default. That means that this:
Foo.create(b: File.read('b.jpg'))
is really:
some_utf8_string = File.read('b.jpg')
Foo.create(b: some_utf8_string)
But a JPEG will rarely be a valid UTF-8 string so you're going to get that ArgumentError whenever someone tries to treat it as UTF-8.
You can specify an encoding when you read your JPEG:
Foo.create(b: File.read('b.jpeg', encoding: 'binary'))
That should get past your encoding problem.
Our Rails 3 app needs to be able to accept foreign characters like ä and こ, and save them to our MySQL db, which has its character_set as 'utf8.'
One of our models runs a validation which is used to strip out all the non-word characters in its name, before being saved. In Ruby 1.8.7 and Rails 2, the following was sufficient:
def strip_non_words(string)
string.gsub!(/\W/,'')
end
This stripped out bad characters, but preserved things like 'ä', 'こ', and '3.' With Ruby 1.9's new encodings, however, that statement no longer works - it is now removing those characters as well as the others we don't want. I am trying to find a way to do that.
Changing the gsub to something like this:
def strip_non_words(string)
string.gsub!(/[[:punct]]/,'')
end
lets the string pass through fine, but then the database kicks up the following error:
Mysql2::Error: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation
Running the string through Iconv to try and convert it, like so:
def strip_non_words(string)
Iconv.conv('LATIN1', 'UTF8', string)
string.gsub!(/[[:punct]]/,'')
end
Results in this error:
Iconv::IllegalSequence: "こäè" # "こäè" being a test string
I'm basically at my whits end here. Does anyone know of a way to do do what I need?
This ended up being a bit of an interesting fix.
I discovered that Ruby has a regex I could use, but only for ASCII strings. So I had to convert the string to ASCII, run the regex, then convert it back for submission to the db. End result looks like this:
def strip_non_words(string)
string_encoded = string.force_encoding(Encoding::ASCII_8BIT)
string_encoded.gsub!(/\p{Word}+/, '') # non-word characters
string_reencoded = string_encoded.force_encoding('ISO-8859-1')
string_reencoded #return
end
Turns out you have to encode things separately due to how Ruby handles changing a character encoding: http://ablogaboutcode.com/2011/03/08/rails-3-patch-encoding-bug-while-action-caching-with-memcachestore/