Differences between XML string encoding in MRI and JRuby - jruby

I'd like to use JRuby to create some XML files, but it is not escaping characters in the same way MRI Ruby does.
> "<node attr=#{'this is "my" complicated <String>'.encode(:xml => :attr)} />"
MRI
ruby-1.9.3-p194
=> "<node attr=\"this is "my" complicated <String>\" />"
JRuby
jruby-1.7.2
=> "<node attr=this is \"my\" complicated <String> />"

Please don't create XML like this. Use Nokogiri or another XML library.
require 'rubygems'
require 'nokogiri'
builder = Nokogiri::XML::Builder.new do |xml|
xml.node(:attr => 'this is "my" complicated <String>')
end
puts builder.to_xml
# prints: <node attr="this is "my" complicated <String>"/>
See also Nokogiri::XML::Builder documentation

This is indeed a JRuby bug. It has now been fixed in master and should work in JRuby 1.7.4.

Related

JSON encoding in Perl output

Context:
I have to migrate a Perl script, into Python. The problem resides in that the configuration files that this Perl script uses, is actually valid Perl code. My Python version of it, uses .yaml files as config.
Therefore, I basically had to write a converter between Perl and yaml. Given that, from what I found, Perl does not play well with Yaml, but there are libs that allow dumping Perl hashes into JSON, and that Python works with JSON -almost- natively, I used this format as an intermediate: Perl -> JSON -> Yaml. The first conversion is done in Perl code, and the second one, in Python code (which also does some mangling on the data).
Using the library mentioned by #simbabque, I can output YAML natively, which afterwards I must modify and play with. As I know next to nothing of Perl, I prefer to do so in Python.
Problem:
The source config files look something like this:
$sites = {
"0100101001" => {
mail => 1,
from => 'mail#mail.com',
to => 'mail#mail.com',
subject => 'á é í ó ú',
msg => 'á é í ó ú',
ftp => 0,
sftp => 0,
},
"22222222" => {
[...]
And many more of those.
My "parsing" code is the following:
use strict;
use warnings;
# use JSON;
use YAML;
use utf8;
use Encode;
use Getopt::Long;
my $conf;
GetOptions('conf=s' => \$conf) or die;
our (
$sites
);
do $conf;
# my $json = encode_json($sites);
my $yaml = Dump($sites);
binmode(STDOUT, ':encoding(utf8)');
# print($json);
print($yaml);
Nothing out of the ordinary. I simply need the JSON YAML version of the Perl data. In fact, it mostly works. My problem is with the encoding.
The output of the above code is this:
[...snip...]
mail: 1
msg: á é í ó ú
sftp: 0
subject: á é í ó ú
[...snip...]
The encoding goes to hell and back. As far as I read, UTF-8 is the default, and just in case, I force it with binmode, but to no avail.
What am I missing here? Any workaround?
Note: I thought I may have been my shell, but locale outputs this:
❯ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Which seems ok.
Note 2: I know next to nothing of Perl, and is not my intent to be an expert on it, so any enhancements/tips are greatly appreciated too.
Note 3: I read this answer, and my code is loosely based on it. The main difference is that I'm not sure how to encode a file, instead of a simple string.
The sites config file is UTF-8 encoded. Here are three workarounds:
Put use utf8 pragma inside the site configuration file. The use utf8 pragma in the main script is not sufficient to treat files included with do/require as UTF-8 encoded.
If that is not feasible, decode the input before you pass it to the JSON encoder. Something like
open CFG, "<:encoding(utf-8)", $conf;
do { local $/; eval <CFG> };
close CFG;
instead of
do $conf
Use JSON::to_json instead of JSON::encode_json. encode_json expects decoded input (Unicode code points) and the output is UTF-8 encoded. The output of to_json is not encoded, or rather, it will have the same encoding as the input, which is what you want.
There is no need to encode the final output as UTF-8. Using any of the three workarounds will already produce UTF-8 encoded output.

Rails - convert ascii to characters

I'm using Rails 5 to show database content in a web browser.
In the db, all of the special characters are written in their ascii form. For instance, instead of an apostrophe, it's written as '.
Thus, my view is showing the ascii code. Is there a way to convert them all to characters for the view?
To transform ANY string containing HTML character entities, using Rails:
CGI.unescape_html "It doesn't look right" # => "It doesn't look right"
The CGI module is in the Ruby standard library and is required by Rails by default. If you want to do the same in a non-Rails project:
require 'cgi'
CGI.unescape_html "It doesn't look right"
Based on your example here's a simple Ruby solution if you want to define your own helper
39.chr # => "'"
'''.delete('&#;').to_i.chr # => "'"
module ApplicationHelper
def ascii_to_char(ascii)
ascii.delete('&#;').to_i.chr
end
end
# in the views
ascii_to_char(''') # => "'"
If what you really need is full HTML escaping see #forsym's answer
Characters were fed through some "html entities" conversion before storing into the database. Go back an fix that.

MATLAB: Read HTML-Codes (within XML)

I'm trying to read the following XML-file of a Polish treebank using MATLAB: http://zil.ipipan.waw.pl/Sk%C5%82adnica?action=AttachFile&do=view&target=Sk%C5%82adnica-frazowa-0.5-TigerXML.xml.gz
Polish letters seem to be encoded as HTML-codes: http://webdesign.about.com/od/localization/l/blhtmlcodes-pl.htm
For instance, ł stands for 'ł'. If I open the treebank using 'UTF-8', I get words like kłaniał, which should actually be displayed as 'kłaniał'
Now, I see 2 options to read the treebank correctly:
Directly read the XML-file such that HTML-codes are transformed into the corresponding characters.
First save the words in non-decoded format (e.g. as kłaniał) and then transform the characters afterwards.
Is it possible to do one of the 2 options (or both) in MATLAB?
A non-MATLAB solution is to preprocess the file through some external utility. For instance, with Ruby installed, one could use the HTMLentities gem to unescape all the special characters.
sudo gem install htmlentities
Let file.xml be the filename which should consist of ascii-only chars. The Ruby code to convert the file could be like this:
#!/usr/bin/env ruby
require 'htmlentities'
xml = File.open("file.xml").read
converted_xml = HTMLEntities.new.decode xml
IO.write "decoded_file.xml", xml
(To run the file, don't forget to chmod +x it to make it executable).
Or more compactly, as a one-liner
ruby -e "require 'htmlentities';IO.write(\"decoded_file.xml\",HTMLEntities.new.decode(File.open(\"file.xml\").read))"
You could then postprocess the xml however you wish.

CGI table with perl

I am trying to build a login form with CGI, using perl.
sub show_login_form{
return div ({-id =>'loginFormDiv'}),
start_form, "\n",
CGI->start_table, "\n",
CGI->end_table, "\n",
end_form, "\n",
div, "\n";
}
I was wondering why I don't need to add CGI-> before start_form but if I don't include it before start_table and end_table, "start_table" and "end_table" are printed as strings?
Thank you for your help.
Why can I use you some subroutines?
Because you are likely importing them using the following use statement:
use CGI qw(:standard);
As documented in CGI - Using the function oriented interface, this will import "standard" features, 'html2', 'html3', 'html4', 'ssl', 'form' and 'cgi'.
But that does not include the table methods.
To get them too, you can modify your use statement to the following:
use CGI qw(:standard *table);
Why does removing CGI-> print start_table as a string?
Because you unwisely do not have use strict turned on.
If you had, you would've gotten the following error:
Bareword "start_table" not allowed while "strict subs"

Intelligent RegEx in Perl?

Background
Consider the following input:
<Foo
Bar="bar"
Baz="1"
Bax="bax"
>
After processing, I need it to look like the following:
<Foo
Bar="bar"
Baz="1"
Bax="bax"
CustomAttribute="TRUE"
>
Implementation
This is all I need to do for no more than 5 files, so using anything other than a regular expression seems like overkill. Anyway, I came up with the following (Perl) regular expression to accomplish this:
$data =~ s/(<\s*Foo)(.*?)>/$1$2 CustomAttribute="TRUE">/sig;
Problems
This works well, however, there is one obvious problem. This sort of pattern is "dumb" because if CustomAttribute has already been added, the operation outlined above will simply append another CustomAttribute=... blindly.
A simple solution, of course, is to write a secondary expression that will attempt to match for CustomAttribute prior to running the replacement operation.
Questions
Since I'm rather new to the scripting language and regular expression worlds, I'm wondering whether it's possible to solve this problem without introducing any host language constructs (i.e., an if-statement in Perl), and simply use a more "intelligent" version of what I wrote above?
I won't beat you over the head with how you should not use a regex for this. I mean, you shouldn't, but you obviously know that from what you said in your question, so moving on...
Something that will accomplish what you're asking for is called a negative lookahead assertion (usually (?!...)), which basically says that you don't want the match to apply if the pattern inside the assertion is found ahead of this point. In your example, you don't want it to apply if CustomAttribute is already present, so:
$data =~ s/(<\s*Foo)(?![^>]*\bCustomAttribute=)(.*?)>/$1$2CustomAttribute="TRUE">/sig;
This sounds like it might be a job for XML::Twig, which can process XML and change parts of it as it runs into them, including adding attributes to tags. I suspect you'd spend as much time getting used to Twig and you would finding a regex solution that only mostly worked. And, at the end you'd know enough Twig to use it on the next project. :)
Time for a lecture I guess ;--)
I am not sure why you think using a full-blown XML processor is overkill. It is actually easier to write the code using the proper tool. A regexp will be more complex and will rely on unwritten assumptions about the data, which is dangerous. Some of those assumptions are likely to be: no '>' in attribute values, no CDATA sections, no non-ascii characters in tag or attribute names, consistent attribute value quoting...
The only thing a regexp will give you is the assurance that the output keeps the original format of the data (in your case the fact that the attributes are each on a separate line). But if your format is consistent that can be done, and if not it should not matter, unless you keep you XML in a line-oriented revision control system.
Here is an example with XML::Twig. It assumes you have enough memory to keep any entire Foo element in memory, and it works even on the admittedly contrived bit of XML in the DATA section. It would probably be just as easy to do with XML::LibXML (read the XML in memory, select all Foo elements, add attribute to each of them, output, that's 5 easy to understand lines by my count).
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my( $tag, $att, $val)= ( 'Foo', 'CustomAttribute', 'TRUE');
XML::Twig->new( # only process those elements
twig_roots => { $tag => sub {
# add/set attribute
$_->set_att( $att => $val);
# output and free memory
$_->flush;
}
},
twig_print_outside_roots => 1, # output everything else
pretty_print => 'cvs', # seems to be the right format
)
->parse( \*DATA) # use parsefile( $file) if parsing... a file
->flush; # not needed in XML::Twig 3.33
__DATA__
<doc>
<Foo
Bar="bar"
Baz="1"
Bax="bax"
>
here is some text
</Foo>
<Foo CustomAttribute="TRUE"><Foo no_att="1"/></Foo>
<bar><![CDATA[<Foo no_att="1">tricked?</Foo>]]></bar>
<Foo><![CDATA[<Foo no_att="1" CustomAttribute="TRUE">tricked?</Foo>]]></Foo>
<Foo
Bar=">"
Baz="1"
Bax="bax"
></Foo>
<Foo
Bar="
>"
Baz="1"
Bax="bax"
></Foo>
<Foo
Bar=">"
Baz="1"
Bax="bax"
CustomAttribute="TRUE"
></Foo>
<Foo
Bar="
>"
Baz="1"
Bax="b
ax"
CustomAttribute="TR
UE"
></Foo>
</doc>
You can send your matches through a function with the 'e' modifier for more processing.
my $str = qq`
<Foo
Bar="bar"
Baz="1"
Bax="bax"
CustomAttribute="TRUE"
>
<Foo
Bar="bar"
Baz="1"
Bax="bax"
>
`;
sub foo {
my $guts = shift;
$guts .= qq` CustomAttribute="TRUE"` if $guts !~ m/CustomAttribute/;
return $guts;
}
$str =~ s/(<Foo )([^>]*)(>)/$1.foo($2).$3/xsge;