Encoding::UndefinedConversionError dump to json

Encoding::UndefinedConversionError dump to json - json

I'm working with a gem fie and ran into an issue with this gem that I would like to solve however I'm having trouble doing so. Fie is a gem for Rails. In it, it has some lines where it stores a marshal dump of an ActiveRecord::Base in json however I'm running in to an encoding error. I Have been able to replicate this across different machines and versions of ROR, although Rails 5.2 and greater.
Easiest way to reproduce is:
[5] pry(main)> Marshal.dump(User.first).to_json
User Load (29.8ms) SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT $1 [["LIMIT", 1]]
Encoding::UndefinedConversionError: "\x80" from ASCII-8BIT to UTF-8
from /home/chris/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/activesupport-5.2.1/lib/active_support/core_ext/object/json.rb:38:in `encode'
Digging In I tried a few things but was unable to make it work. It seems that a marshal dump is ASCII-8BIT but Json ants UTF-8bit. I was unable to force the encoding.
> User.first.to_json.encoding
=> #<Encoding:UTF-8>
> Marshal.dump(User.first).encoding
=> #<Encoding:ASCII-8BIT>
> { foo: Marshal.dump(object).force_encoding("ASCII-8BIT").encode("UTF-8") }.to_json
Encoding::UndefinedConversionError: "\x80" from ASCII-8BIT to UTF-8
from (pry):139:in `encode'
> { foo: Marshal.dump(object).force_encoding("ISO-8859-1").encode("ASCII-8BIT") }.to_json
Encoding::UndefinedConversionError: U+0080 to ASCII-8BIT in conversion from ISO-8859-1 to UTF-8 to ASCII-8BIT
ruby 2.5.1
Rails 5.2.1
git issue I opened

I had this issue and fixed it by using:
Marshal.dump(value).force_encoding("ISO-8859-1").encode("UTF-8")
I hope this help!
But as Tom Lord suggested you should be a bit more specific with your question to help us know what you are trying to achieve.

Related

Is ActiveRecord changing the encoding on my serialized hash

I have a Rails application which accepts JSON data from third-party sources, and I believe I am running up against some ActiveRecord behind-the-scenes magic which is recognizing ASCII-8BIT characters in hashes extracted from the JSON and saving them as such to my database no matter what I do.
Here is a simplified description the class ...
class MyClass < ActiveRecord::Base
serialize :data
end
and of how an object is created ...
a = MyClass.new
a.data = {
"a" =>
{
"b" => "bb",
"c" => "GIF89a\x01\x00\x01\x00\x00\x00\x00\x00!\vNETSCAPE2.0\x03\x01\x00\x00!\x04\t\x00\x00\x01\x00,\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02L\x01\x00;"
}
}
I believe those are ASCII-8BIT characters, so fair enough if they are saved as such (despite my attempts to UTF8 everything everywhere). But I need these characters to be UTF-8, because when I go to view them I get:
ActionView::Template::Error ("\xEF" from ASCII-8BIT to UTF-8):
64: <div>
65: <pre><%= mc.prettify %></pre>
66: </div>
app/models/my_class.rb:28:in `prettify'
where line #28 in prettify is:
JSON.pretty_generate(self.data)
So I sought to re-encode any string in the Hash. I built out functionality to do this (with anonymous classes and refinements to Hash, Array and String), but no matter what, ASCII-8BIT is returned to me. In the simplest terms here is what is happening:
mc = MyClass.find(123)
mc.data['a']['c'].encode!(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
mc.data['a']['c'].encoding #=> #<Encoding:UTF-8>
mc.data['a']['c'].valid_encoding? #=> true
mc.save!
mc.reload
mc.data['a']['c'].encoding #=> #<Encoding:ASCII-8BIT> <-- !!!!!
What is ActiveRecord doing to this hash when it saves it? And what can I do to store a hash permanently with all strings encoded to UTF-8 in a serialized MySQL (v5.6, via the mysql2 gem) mediumtext column (using Ruby 2.2.4 and Rails 4.1.4)?
my.cnf
[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
[mysqld]
# ...
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
init-connect='SET NAMES utf8mb4'
character-set-server=utf8

So, there's not really such thing as an "ASCII-8BIT" character. ASCII-8BIT to ruby essentially means 'no encoding at all' -- just bytes, without assuming any encoding. It's a synonym for 'BINARY'.
But if you have bytes that aren't valid UTF-8, they can't really be encoded as UTF-8. Even if the encoding on the string were UTF-8, at best you'd get lots of InvalidEncoding errors when you tried to do something to it.
What encoding the string will end up tagged as depends on a complicated dance between ActiveRecord and your database itself -- also, the database itself can sometimes actually change your bytes, depending on the database and how it's set up and what you're doing. We could try to debug exactly what you are doing.
But really, the answer is -- if you want it to be UTF-8, it can't have binary non-UTF8 data in it. "ASCII-8BIT" actually is the right encoding for binary data. What are you actually trying to do, where do those weird bytes come from and why do you want them? In general, I'm not sure if it's legal to put arbitrary non-UTF8 bytes in JSON? It might be legal for JSON, but it will probably cause you headaches (such as the one you're dealing with), as it depends on what exactly both rails and your underlying DB are going to do with them.
Just to get around your display error, you could have your prettify method use scrub, added in ruby 2.1.0 to eliminate bad bytes for the current encoding. value.force_encoding("UTF-8").scrub. That will probably work to get rid of your error, and will do perhaps the right thing, but it would be better to figure out what the heck is really going on, why you want those weird bytes in the first place, what they are supposed to mean for what purpose.

MATLAB: Read HTML-Codes (within XML)

I'm trying to read the following XML-file of a Polish treebank using MATLAB: http://zil.ipipan.waw.pl/Sk%C5%82adnica?action=AttachFile&do=view&target=Sk%C5%82adnica-frazowa-0.5-TigerXML.xml.gz
Polish letters seem to be encoded as HTML-codes: http://webdesign.about.com/od/localization/l/blhtmlcodes-pl.htm
For instance, ł stands for 'ł'. If I open the treebank using 'UTF-8', I get words like kłaniał, which should actually be displayed as 'kłaniał'
Now, I see 2 options to read the treebank correctly:
Directly read the XML-file such that HTML-codes are transformed into the corresponding characters.
First save the words in non-decoded format (e.g. as kłaniał) and then transform the characters afterwards.
Is it possible to do one of the 2 options (or both) in MATLAB?

A non-MATLAB solution is to preprocess the file through some external utility. For instance, with Ruby installed, one could use the HTMLentities gem to unescape all the special characters.
sudo gem install htmlentities
Let file.xml be the filename which should consist of ascii-only chars. The Ruby code to convert the file could be like this:
#!/usr/bin/env ruby
require 'htmlentities'
xml = File.open("file.xml").read
converted_xml = HTMLEntities.new.decode xml
IO.write "decoded_file.xml", xml
(To run the file, don't forget to chmod +x it to make it executable).
Or more compactly, as a one-liner
ruby -e "require 'htmlentities';IO.write(\"decoded_file.xml\",HTMLEntities.new.decode(File.open(\"file.xml\").read))"
You could then postprocess the xml however you wish.

How to validate a binary field using Ruby on Rails?

I want to ensure that a binary field has always value. I added a validation code like below.
class Foo < ActiveRecord::Base
validates :b, presence: true
end
However, the change seems to cause the error.
$ rails c
> Foo.create(b:File.read('b.jpg'))
ArgumentError: invalid byte sequence in UTF-8
The error doesn't always appear. Only when the binary data has non-ascii codes.
How can I validate the binary field?
I made the environment like below. A image file(b.jpg, less than 16KB) is also needed.
$ rails --version
Rails 4.2.0
$ rails new test_binary --database=mysql
$ cd test_binary/
$ rails g model foo b:binary
$ rake db:create db:migrate

File.read returns a String that will claim to have UTF-8 encoding by default. That means that this:
Foo.create(b: File.read('b.jpg'))
is really:
some_utf8_string = File.read('b.jpg')
Foo.create(b: some_utf8_string)
But a JPEG will rarely be a valid UTF-8 string so you're going to get that ArgumentError whenever someone tries to treat it as UTF-8.
You can specify an encoding when you read your JPEG:
Foo.create(b: File.read('b.jpeg', encoding: 'binary'))
That should get past your encoding problem.

Unicode formatting lost when inserting into MySQL database in Ruby 1.8.7

In Ruby, I have a bunch of strings encoded in UTF-8, e.g.: "HEC Montr\u00e9al".
When I insert it into my MySQL table (formatted as utf8_general_ci) using the 'mysql' gem, the backslash is removed. What gives :) ? Any of you have any idea what the heck is going on here?
edit:
example string:
>> p mystring
"HEC Montr\\u00e9al"
and in the database after insert:
HEC Montru00e9al

This is not UTF:
'HEC Montr\u00e9al'
That's an ASCII representation of a JSON-encoded Unicode string. If it was UTF-8, it would look like:
'HEC Montréal'
You're not properly decoding your JSON inputs somewhere or your client-side code is sending your server JSON when your server is expecting plain text.
First you need to figure out why you're getting JSON encoded strings when you're not expecting them or figure out why you're not properly decoding your JSON. Then you can see if the database is mangling your UTF-8.

I believe you have to explicitly tell the MySQL gem to expect utf8. Something like this:
db = Mysql.init
db.options(Mysql::SET_CHARSET_NAME, 'utf8')
db.real_connect(...

Why doesn't ruby detect an invalid encoding while mysql does?

I'm pulling some RSS feeds in from YouTube which have invalid UTF8. I can create a similar ruby string using
bad_utf8 = "\u{61B36}"
bad_utf8.encoding # => #<Encoding:UTF-8>
bad_utf8.valid_encoding? # => true
Ruby thinks this is a valid UTF-8 encoding and I'm pretty sure it isn't.
When talking to Mysql I get an error like so
require 'mysql2'
client = Mysql2::Client.new(:host => "localhost", :username => "root")
client.query("use test");
bad_utf8 = "\u{61B36}"
client.query("INSERT INTO utf8 VALUES ('#{moo}')")
# Incorrect string value: '\xF1\xA1\xAC\xB6' for column 'string' at row 1 (Mysql2::Error)
How can I detect or fix up these invalid types of encodings before I send them off to MySQL?

I don't rely on Ruby's built-in String.valid_encoding?, because the following is also possible:
irb
1.9.3-p125 :001 > bad_utf8 = "\u{0}"
=> "\u0000"
1.9.3-p125 :002 > bad_utf8.valid_encoding?
=> true
1.9.3-p125 :003 > bad_utf8.encoding
=> #<Encoding:UTF-8>
This is valid UTF-8 (Reference: https://en.wikipedia.org/wiki/Utf8), but I found the presence of the NULL character in a string is often a hint to a previous conversion error (e.g. when transcoding from invalid encoding informations found in html pages).
I created my own validation function for "Modified UTF-8", which can take a :bmp_only option for restricting validation to the Basic Multilingual Plane (0x1-0xffff). This should be enough for most modern languages (Reference: https://en.wikipedia.org/wiki/Unicode_plane).
Find the validator here: https://gist.github.com/2295531

possibly because the code point doesn't lie in the basic multilingual plane
which is the only characters that MySQL allows in its "utf8" character set.
Newer versions of mysql have another character set called "utf8mb4" which supports unicode characters outside the BMP.
But you probably don't want to be using that. Consider your use-cases carefully. Few real human languages (if any) use characters outside the BMP.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008