Rails - convert ascii to characters - mysql

I'm using Rails 5 to show database content in a web browser.
In the db, all of the special characters are written in their ascii form. For instance, instead of an apostrophe, it's written as '.
Thus, my view is showing the ascii code. Is there a way to convert them all to characters for the view?

To transform ANY string containing HTML character entities, using Rails:
CGI.unescape_html "It doesn't look right" # => "It doesn't look right"
The CGI module is in the Ruby standard library and is required by Rails by default. If you want to do the same in a non-Rails project:
require 'cgi'
CGI.unescape_html "It doesn't look right"

Based on your example here's a simple Ruby solution if you want to define your own helper
39.chr # => "'"
'''.delete('&#;').to_i.chr # => "'"
module ApplicationHelper
def ascii_to_char(ascii)
ascii.delete('&#;').to_i.chr
end
end
# in the views
ascii_to_char(''') # => "'"
If what you really need is full HTML escaping see #forsym's answer

Characters were fed through some "html entities" conversion before storing into the database. Go back an fix that.

Related

How to escape a string for use in XML/HTML document in Scala?

Specifically, what's the easiest and most idiomatic way to replace special XML characters in a string. E.g., what's the easiest and most idiomatic way to convert <Jack & Jill> to <Jack & Jill>.
Turns out there is an easy way to do this (despite a quick web-search not revealing an obvious solution): just use method xml.Utility.escape.
If you are looking to escape the # sign in a scala.html file, for instance, using Scala/Play, do ##.
And give the person who provided this answer here How to print # symbol in HTML with play framework (scala) an upvote.
I came across this page while looking for the above answer, so I just wanted to duplicate it here since other people may end up here as well.
I propose using the function org.apache.commons.text.StringEscapeUtils.escapeHtml4, added into build.sbt by libraryDependencies += "org.apache.commons" % "commons-text" % "1.9"
( xml.Utility.escape does not escape e.g. « « )

What is the best value for "Unit Separator" in XML?

I used Unit Separator (US/0x1f) in database. When I export to XML 1.0 file, it is not accepted and leave the attribute with empty value.
I have data in database like this:
"option1=10;option2=20;option3=aaa[US]bbb[US]ccc;"
I'm assuming to export to XML 1.0 file like this:
<elementname, attr1="option1=10;option2=20;option3=aaa[US]bbb[US]ccc;"/>
However, the [US] is not accepted by XML 1.0. Any suggestions?
I can replace '\37' (oct 37, hex 1f) with something like "XXX", "$", "(0x1f)"... before writing to XML;
I can replace it when importing from XML and write to database. However, if I replace it with "& # x 1 F ;", which is the HTML Entity for Unit separator, I end up with "& a m p ; # x 1 F ;", which is definitely not what I wanted.
If I manually modify the XML file to "& # x 1 F ;", I can not use MSXML to load it, giving error "Invalid Unicode Character".
Any suggestions?
Thank you
Summary:
Let's make an analogy: Let's think about how the compiler works, there are two phases: "Pre-compile" and "Compile".
For XML File Generation, it acts like the "Compile" phase. E.g. convert "<" to "& l t ;"
However, the Unit Separator is not supported by XML 1.0, so the "Compile" phase will not convert it to HTML Entity "& # x 1 F ;"
So we have to seek solution in the "Pre-Compile" phase, which is our own application's responsibility.
When writing:
Option1: <unit>aaa</unit><unit>bbb</unit>
Option2: simply use "_x241F_" to replace "\37" in the string if "_x241F_" is not conflicting with any existing token in the string.
When reading:
According to Option1: Load the elements, catenate to a single string with "\37" as separator.
According to Option2: simply use "\37" to replace "_x241F_".
I've also found out that MSXML (even the highest version MSXML6.dll) will not load XML 1.1 .
So if we are unfortunately using MSXML, we have to write our own "Pre-Compile" code to handle the Unicode characters before feeding the "Compile" phase.
Note: I borrowed the idea of "_ x 2 4 1 F _" from here.
Thanks for everyone's help
There is no HTML entity for U+001F UNIT SEPARATOR. Besides, HTML entities would be irrelevant when dealing with generic XML.
The character references would be  and , in HTML and in XML, but the character is not allowed in HTML or in XML. For XML 1.0, which this seems to be about, please refer to section 2.2 Characters, where the normative definition is the following production (the associated comment is misleading, and comments are non-normative):
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
The conclusions to be drawn depend on the meaning and purpose of UNIT SEPARATOR in the text. It has no generally defined meaning; it is up to applications to assign a meaning to it and process it accordingly.
Usually UNIT SEPARATOR is used to separate units of some kind, so the natural approach would be to process the incoming data so that instead of such separators, the data, when converted to XML format, has units denoted by markup. So for data like aaa[US]bbb[US]ccc where [US] is UNIT SEPARATOR, you would generate something like <unit>aaa</unit><unit>bbb</unit><unit>ccc</unit>.
This website
http://www.fileformat.info/info/unicode/char/1f/index.htm
suggests one of the following:
HTML Entity (decimal) 
HTML Entity (hex)

Ruby 1.9, MySQL character encoding issue

Our Rails 3 app needs to be able to accept foreign characters like ä and こ, and save them to our MySQL db, which has its character_set as 'utf8.'
One of our models runs a validation which is used to strip out all the non-word characters in its name, before being saved. In Ruby 1.8.7 and Rails 2, the following was sufficient:
def strip_non_words(string)
string.gsub!(/\W/,'')
end
This stripped out bad characters, but preserved things like 'ä', 'こ', and '3.' With Ruby 1.9's new encodings, however, that statement no longer works - it is now removing those characters as well as the others we don't want. I am trying to find a way to do that.
Changing the gsub to something like this:
def strip_non_words(string)
string.gsub!(/[[:punct]]/,'')
end
lets the string pass through fine, but then the database kicks up the following error:
Mysql2::Error: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation
Running the string through Iconv to try and convert it, like so:
def strip_non_words(string)
Iconv.conv('LATIN1', 'UTF8', string)
string.gsub!(/[[:punct]]/,'')
end
Results in this error:
Iconv::IllegalSequence: "こäè" # "こäè" being a test string
I'm basically at my whits end here. Does anyone know of a way to do do what I need?
This ended up being a bit of an interesting fix.
I discovered that Ruby has a regex I could use, but only for ASCII strings. So I had to convert the string to ASCII, run the regex, then convert it back for submission to the db. End result looks like this:
def strip_non_words(string)
string_encoded = string.force_encoding(Encoding::ASCII_8BIT)
string_encoded.gsub!(/\p{Word}+/, '') # non-word characters
string_reencoded = string_encoded.force_encoding('ISO-8859-1')
string_reencoded #return
end
Turns out you have to encode things separately due to how Ruby handles changing a character encoding: http://ablogaboutcode.com/2011/03/08/rails-3-patch-encoding-bug-while-action-caching-with-memcachestore/

Is JSON safe to use as a command line argument or does it need to be sanitized first?

Is the following dangerous?
$ myscript '<somejsoncreatedfromuserdata>'
If so, what can I do to make it not dangerous?
I realize that this can depend on the shell, OS, utility used for making system calls (if being done inside a programming language), etc. However, I'd just like to know what kind of things I should watch out for.
Yes. That is dangerous.
JSON can include single quotes in string values (they do not need to be escaped). See "the tracks" at json.org.
Imagine the data is:
{"pwned": "you' & kill world;"}
Happy coding.
I would consider piping the data in to the program in question (e.g. use "popen" or even a version of "exec" that passes arguments directly) -- this can avoid issues that result from passing through the shell, for instance. Just as with SQL: using placeholders eliminates the need to trifle with "escaping".
If passing through a shell is the only way, then this may be an option (it is not tested, but something similar holds for a "<script>" context):
For every character in the JSON, which is either outside the range of "space" to "~" in ASCII, or has a special meaning in the '' context of a the shell such as \ and ' (but excluding " or any other character -- such as digits -- that can appear outside of "string" data, which is a limitation of this trivial approach), then encode the character using the \uXXXX JSON form. (Per the limitations defined above this should only encode potentially harmful characters appearing within the "strings" in the JSON and there should be no \\ pairs, no trailing \, and no 's, etc.)
It's ok. Just escape the character you use to wrap the string:
' should become '\''
So the JSON string
{"pwned": "you' & kill world;"}
becomes
{"pwned": "you'\'' & kill world;"}
and your final command, as the shell sees it, will be:
$ myscript '{"pwned": "you'\'' & kill world;"}'

How can I decode HTML entities?

Here's a quick Perl question:
How can I convert HTML special characters like ü or ' to normal ASCII text?
I started with something like this:
s/\&#(\d+);/chr($1)/eg;
and could write it for all HTML characters, but some function like this probably already exists?
Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.
Take a look at HTML::Entities:
use HTML::Entities;
my $html = "Snoopy & Charlie Brown";
print decode_entities($html), "\n";
You can guess the output.
The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.
Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:
use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);
my $source = '北亰';
print unidecode(decode_entities($source));
# That prints: Bei Jing
Note that there are hex-specified characters too. They look like this: é (é).
Use HTML::Entities' decode_entities to translate the entities into actual characters. To convert that to ASCII requires more work. I've used iconv (perl interface: Text::Iconv)
with the transliterate option on with some success in the past. But if you are dealing
with a limited set of entities, or you don't actually need it reduced to ASCII equivalents,
you may be better off limiting what decode_entities produces or providing it with custom
conversion maps. See the HTML::Entities doc.
There are a handful of predefined HTML entities - & " > and so on - that you could hard code.
However, the larger case of numberic entities - { - is going to be much harder, as those values are Unicode, and conversion to ASCII is going to range from difficult to impossible.
I use this script. Save it as html2utf.py, and use it ala echo $some_html | html2utf.py.
#!/usr/bin/env python3
"""
An alternative for `perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)'` (which you can use by `cpanm HTML::Entities`) and `recode html..`.
"""
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('\n')))
I have created a one-liner for bash, using Perl to decode the HTML entities that are passed to perl. My solution is a blend of this answer (see above) and something I found on commandlinefu.com last week.
Most of us who code in Bash aren't in the habit of using echo -n to strip out the \n newline character since it doesn't usually affect Bash text parsing. With Perl——and with this particular method——it's important to use echo -n or else perl will interpret the 'newline' \n character as a literal part of the response, adding an unwanted %0A to your results.
Here's my bash-perl one-liner hybrid:
encodedURL="$(echo -n "$entityURL" | perl -MHTML::Entities -MURI::Escape -ne 'print uri_escape(decode_entities($_))')"
Example:
Input: Seals \& Croft - Summer Breeze
Output: Seals%20%26%20Croft%20-%20Summer%20Breeze