How to validate a binary field using Ruby on Rails? - mysql

I want to ensure that a binary field has always value. I added a validation code like below.
class Foo < ActiveRecord::Base
validates :b, presence: true
end
However, the change seems to cause the error.
$ rails c
> Foo.create(b:File.read('b.jpg'))
ArgumentError: invalid byte sequence in UTF-8
The error doesn't always appear. Only when the binary data has non-ascii codes.
How can I validate the binary field?
I made the environment like below. A image file(b.jpg, less than 16KB) is also needed.
$ rails --version
Rails 4.2.0
$ rails new test_binary --database=mysql
$ cd test_binary/
$ rails g model foo b:binary
$ rake db:create db:migrate

File.read returns a String that will claim to have UTF-8 encoding by default. That means that this:
Foo.create(b: File.read('b.jpg'))
is really:
some_utf8_string = File.read('b.jpg')
Foo.create(b: some_utf8_string)
But a JPEG will rarely be a valid UTF-8 string so you're going to get that ArgumentError whenever someone tries to treat it as UTF-8.
You can specify an encoding when you read your JPEG:
Foo.create(b: File.read('b.jpeg', encoding: 'binary'))
That should get past your encoding problem.

Related

Encoding::UndefinedConversionError dump to json

I'm working with a gem fie and ran into an issue with this gem that I would like to solve however I'm having trouble doing so. Fie is a gem for Rails. In it, it has some lines where it stores a marshal dump of an ActiveRecord::Base in json however I'm running in to an encoding error. I Have been able to replicate this across different machines and versions of ROR, although Rails 5.2 and greater.
Easiest way to reproduce is:
[5] pry(main)> Marshal.dump(User.first).to_json
User Load (29.8ms) SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT $1 [["LIMIT", 1]]
Encoding::UndefinedConversionError: "\x80" from ASCII-8BIT to UTF-8
from /home/chris/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/activesupport-5.2.1/lib/active_support/core_ext/object/json.rb:38:in `encode'
Digging In I tried a few things but was unable to make it work. It seems that a marshal dump is ASCII-8BIT but Json ants UTF-8bit. I was unable to force the encoding.
> User.first.to_json.encoding
=> #<Encoding:UTF-8>
> Marshal.dump(User.first).encoding
=> #<Encoding:ASCII-8BIT>
> { foo: Marshal.dump(object).force_encoding("ASCII-8BIT").encode("UTF-8") }.to_json
Encoding::UndefinedConversionError: "\x80" from ASCII-8BIT to UTF-8
from (pry):139:in `encode'
> { foo: Marshal.dump(object).force_encoding("ISO-8859-1").encode("ASCII-8BIT") }.to_json
Encoding::UndefinedConversionError: U+0080 to ASCII-8BIT in conversion from ISO-8859-1 to UTF-8 to ASCII-8BIT
ruby 2.5.1
Rails 5.2.1
git issue I opened
I had this issue and fixed it by using:
Marshal.dump(value).force_encoding("ISO-8859-1").encode("UTF-8")
I hope this help!
But as Tom Lord suggested you should be a bit more specific with your question to help us know what you are trying to achieve.

Is ActiveRecord changing the encoding on my serialized hash

I have a Rails application which accepts JSON data from third-party sources, and I believe I am running up against some ActiveRecord behind-the-scenes magic which is recognizing ASCII-8BIT characters in hashes extracted from the JSON and saving them as such to my database no matter what I do.
Here is a simplified description the class ...
class MyClass < ActiveRecord::Base
serialize :data
end
and of how an object is created ...
a = MyClass.new
a.data = {
"a" =>
{
"b" => "bb",
"c" => "GIF89a\x01\x00\x01\x00\x00\x00\x00\x00!\vNETSCAPE2.0\x03\x01\x00\x00!\x04\t\x00\x00\x01\x00,\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02L\x01\x00;"
}
}
I believe those are ASCII-8BIT characters, so fair enough if they are saved as such (despite my attempts to UTF8 everything everywhere). But I need these characters to be UTF-8, because when I go to view them I get:
ActionView::Template::Error ("\xEF" from ASCII-8BIT to UTF-8):
64: <div>
65: <pre><%= mc.prettify %></pre>
66: </div>
app/models/my_class.rb:28:in `prettify'
where line #28 in prettify is:
JSON.pretty_generate(self.data)
So I sought to re-encode any string in the Hash. I built out functionality to do this (with anonymous classes and refinements to Hash, Array and String), but no matter what, ASCII-8BIT is returned to me. In the simplest terms here is what is happening:
mc = MyClass.find(123)
mc.data['a']['c'].encode!(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
mc.data['a']['c'].encoding #=> #<Encoding:UTF-8>
mc.data['a']['c'].valid_encoding? #=> true
mc.save!
mc.reload
mc.data['a']['c'].encoding #=> #<Encoding:ASCII-8BIT> <-- !!!!!
What is ActiveRecord doing to this hash when it saves it? And what can I do to store a hash permanently with all strings encoded to UTF-8 in a serialized MySQL (v5.6, via the mysql2 gem) mediumtext column (using Ruby 2.2.4 and Rails 4.1.4)?
my.cnf
[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
[mysqld]
# ...
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
init-connect='SET NAMES utf8mb4'
character-set-server=utf8
So, there's not really such thing as an "ASCII-8BIT" character. ASCII-8BIT to ruby essentially means 'no encoding at all' -- just bytes, without assuming any encoding. It's a synonym for 'BINARY'.
But if you have bytes that aren't valid UTF-8, they can't really be encoded as UTF-8. Even if the encoding on the string were UTF-8, at best you'd get lots of InvalidEncoding errors when you tried to do something to it.
What encoding the string will end up tagged as depends on a complicated dance between ActiveRecord and your database itself -- also, the database itself can sometimes actually change your bytes, depending on the database and how it's set up and what you're doing. We could try to debug exactly what you are doing.
But really, the answer is -- if you want it to be UTF-8, it can't have binary non-UTF8 data in it. "ASCII-8BIT" actually is the right encoding for binary data. What are you actually trying to do, where do those weird bytes come from and why do you want them? In general, I'm not sure if it's legal to put arbitrary non-UTF8 bytes in JSON? It might be legal for JSON, but it will probably cause you headaches (such as the one you're dealing with), as it depends on what exactly both rails and your underlying DB are going to do with them.
Just to get around your display error, you could have your prettify method use scrub, added in ruby 2.1.0 to eliminate bad bytes for the current encoding. value.force_encoding("UTF-8").scrub. That will probably work to get rid of your error, and will do perhaps the right thing, but it would be better to figure out what the heck is really going on, why you want those weird bytes in the first place, what they are supposed to mean for what purpose.

Ruby on Rails: Allow less than sign '<' inside code block with sanitize helper

I'm trying to escape user generated content in Rails. I have used raw with sanitize and raw helpers to filter content like this:
raw(sanitize(code, :tags => ['<', 'h2','h3','p','br','ul','ol','li','code','pre','a'] ))
The list of tags mentioned are allowed in the content.
The problem is when I try to test it with a sql query like this:
mysql -u sat -p -h localhost database < data.sql
inside pre and code blocks it removes everything after the less than (<) sign.
Please help me figure out a way to do this.
I don't believe this is possible using the default sanitize method within Rails.
Instead try using the Sanitize gem (https://github.com/rgrove/sanitize)
require 'sanitize'
allowed_elements = ['h2','h3','p','br','ul','ol','li','code','pre','a']
code = "<pre>mysql -u sat -p -h localhost database < data.sql</pre>"
Sanitize.fragment(code, elements: allowed_elements)
# => <pre>mysql -u sat -p -h localhost database < data.sql</pre>
To use this to save sanitized content to the database add a before_save filter to you model that runs sanitize on the user generated content and stores the result, e.g.
class MyModel < ActiveRecord::Base
ALLOWED_ELEMENTS = ['h2','h3','p','br','ul','ol','li','code','pre','a']
before_save :sanitize_code
private
def sanitize_code
self.code = Sanitize.fragment(code, elements: ALLOWED_ELEMENTS)
end
end
When you output the content you just need to use the raw view helper e.g.
<%= raw #instance.code %>
Rails 3 added the html_safe property for every String instance. Every string that is printed or inserted to the database will be escaped unless html_safe is set to true (simplified). What raw does, is actually set html_safe to true. So you should only pass a string that is already safe/escaped.
A possible solution could look something like this:
strip_tags(code).html_safe
You might have to add additional checks / string replacements depending on your use case.
According to your comment, you probably need a little more complex version. You could try to replace all chars that you would like to allow, sanitize the string, and then reverse the replacement in order to avoid that the sanitize method sanitizes more than you actually want. Try something like this:
code = "mysql -u sat -p -h localhost database < data.sql"
ALLOWED_SIGNS = {
:lower_than => "<".html_safe
}
s = code.dup
ALLOWED_SIGNS.each { |k, v| s.sub!(v, "%{#{k}}") }
sanitize(s) % ALLOWED_SIGNS
It seems like the whole issue was with the way data being stored in the database. Previously, a less than sign '<' was being saved as it is but now it is being escaped so a '<' would be saved as < which seems to have solved the problem.
I was able to understand that accidentally while using tinymce-rails WYSIWYG editor which was escaping the '<' automatically.
#kieran-johnson's answer might have done the same but tinymce-rails solved it without installing an extra gem.
Thank you all of you who took out time to help.
This might help, sanitizer has options to provide white list of tags and attributes needs to ignored during sanitization
ActionView::Base.full_sanitizer.sanitize(html_string) #Basic Syntax
White list of tags and attributes can be specified as bellow
ActionView::Base.full_sanitizer.sanitize(html_string, :tags => %w(img br p), :attributes => %w(src style))
Above statement allows tags: img, br and p and attributes : src and style.
nokogiri gem solves the problem:
gem 'nokogiri'
Nokogiri::HTML::DocumentFragment.parse('<b>hi</b> x > 5').text
=> "hi x > 5"
Consider replacing "<" with its ASCII character < before running it through the sanitize method. It should get converted into < and then render as "<" character, instead of the html.

Ruby 1.9, MySQL character encoding issue

Our Rails 3 app needs to be able to accept foreign characters like ä and こ, and save them to our MySQL db, which has its character_set as 'utf8.'
One of our models runs a validation which is used to strip out all the non-word characters in its name, before being saved. In Ruby 1.8.7 and Rails 2, the following was sufficient:
def strip_non_words(string)
string.gsub!(/\W/,'')
end
This stripped out bad characters, but preserved things like 'ä', 'こ', and '3.' With Ruby 1.9's new encodings, however, that statement no longer works - it is now removing those characters as well as the others we don't want. I am trying to find a way to do that.
Changing the gsub to something like this:
def strip_non_words(string)
string.gsub!(/[[:punct]]/,'')
end
lets the string pass through fine, but then the database kicks up the following error:
Mysql2::Error: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation
Running the string through Iconv to try and convert it, like so:
def strip_non_words(string)
Iconv.conv('LATIN1', 'UTF8', string)
string.gsub!(/[[:punct]]/,'')
end
Results in this error:
Iconv::IllegalSequence: "こäè" # "こäè" being a test string
I'm basically at my whits end here. Does anyone know of a way to do do what I need?
This ended up being a bit of an interesting fix.
I discovered that Ruby has a regex I could use, but only for ASCII strings. So I had to convert the string to ASCII, run the regex, then convert it back for submission to the db. End result looks like this:
def strip_non_words(string)
string_encoded = string.force_encoding(Encoding::ASCII_8BIT)
string_encoded.gsub!(/\p{Word}+/, '') # non-word characters
string_reencoded = string_encoded.force_encoding('ISO-8859-1')
string_reencoded #return
end
Turns out you have to encode things separately due to how Ruby handles changing a character encoding: http://ablogaboutcode.com/2011/03/08/rails-3-patch-encoding-bug-while-action-caching-with-memcachestore/

Why doesn't ruby detect an invalid encoding while mysql does?

I'm pulling some RSS feeds in from YouTube which have invalid UTF8. I can create a similar ruby string using
bad_utf8 = "\u{61B36}"
bad_utf8.encoding # => #<Encoding:UTF-8>
bad_utf8.valid_encoding? # => true
Ruby thinks this is a valid UTF-8 encoding and I'm pretty sure it isn't.
When talking to Mysql I get an error like so
require 'mysql2'
client = Mysql2::Client.new(:host => "localhost", :username => "root")
client.query("use test");
bad_utf8 = "\u{61B36}"
client.query("INSERT INTO utf8 VALUES ('#{moo}')")
# Incorrect string value: '\xF1\xA1\xAC\xB6' for column 'string' at row 1 (Mysql2::Error)
How can I detect or fix up these invalid types of encodings before I send them off to MySQL?
I don't rely on Ruby's built-in String.valid_encoding?, because the following is also possible:
irb
1.9.3-p125 :001 > bad_utf8 = "\u{0}"
=> "\u0000"
1.9.3-p125 :002 > bad_utf8.valid_encoding?
=> true
1.9.3-p125 :003 > bad_utf8.encoding
=> #<Encoding:UTF-8>
This is valid UTF-8 (Reference: https://en.wikipedia.org/wiki/Utf8), but I found the presence of the NULL character in a string is often a hint to a previous conversion error (e.g. when transcoding from invalid encoding informations found in html pages).
I created my own validation function for "Modified UTF-8", which can take a :bmp_only option for restricting validation to the Basic Multilingual Plane (0x1-0xffff). This should be enough for most modern languages (Reference: https://en.wikipedia.org/wiki/Unicode_plane).
Find the validator here: https://gist.github.com/2295531
possibly because the code point doesn't lie in the basic multilingual plane
which is the only characters that MySQL allows in its "utf8" character set.
Newer versions of mysql have another character set called "utf8mb4" which supports unicode characters outside the BMP.
But you probably don't want to be using that. Consider your use-cases carefully. Few real human languages (if any) use characters outside the BMP.