Perl: Why do i need to set the latin1 flag explicitly since JSON 2.xx? - json

Since JSON 2.xx i need to set the latin1 flag in order to get umlauts safe to the html document:
my $obj_with_umlauts = {
title => 'geändert',
}
my $json = JSON->new()->latin1(1)->encode($obj_with_umlauts);
This was not necessary using JSON 1.xx :
my $json = JSON->new()->objToJson($obj_with_umlauts);
The html document is in iso-8559-1 (meta-tag).
Can anybody explain to me why?

This is such a huge can of worms that you're opening here.
I suspect that the answer is something along the lines of "a bug was fixed in the character handling of JSON.pm". But it's hard to know what is going on without a lot more information about your situation.
How is $string_with_umlauts being set? How are you encoding the data that you write to the HTML document?
Do you want to handle utf8 data correctly (you really should) or are you happy assuming that you live in a Latin1 world?
It's important to realise that if you completely ignore Unicode considerations then it can often seem that your programs are working correctly as errors often cancel each other out. When you start to address Unicode issues, it can seem that your programs are getting worse until you address all of the issues.
The Perl Unicode Tutorial is a good place to start learning about these things.
P.S. It's "Perl", not "PERL".

What are you talking about?
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->objToJson(["\xE4"]);
say sprintf "%v02X", $json;
'
1.15
5B.22.E4.22.5D # Unicode code points for ["ä"]
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # Unicode code points for ["ä"]
Those two strings are identical! In fact, adding ->latin1() doesn't change anything because the iso-8859-1 encoding of Unicode code point U+00E4 is E4.
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->latin1()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # iso-8859-1 encoding of ["ä"]
There is one difference between the last two: it's stored differently in the scalar. That should make absolutely no difference. If code treats them differently, then that code is incorrectly reading the data in the scalar, and that code is buggy.
$string_with_umlauts definetly is a string in winLatin
Well, that's error number one.
JSON expects strings of decoded text (strings of Unicode code points), not encoded text.
That said, there happens to be no difference between a string encoded using iso-8859-1 and a string of Unicode code points. For example, when encoded using iso-8859-1, "ä" is byte E4, and it's Unicode code point U+00E4, two different notation for the same number.
If the string is encoded using cp1252, though, you'll have problems with characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ (the characters in cp1252 but not in iso-8859-1). For example, when encoded using cp1252, "€" is byte 80, but it's Unicode code point U+20AC. 0x80 != 0x20AC.
The html document is in iso-8559-1 (meta-tag).
Then at some point, you'll have to encode the output into iso-8859-1. You can do it using an :encoding layer, or using Encode's encode or using JSON's ->latin1 directive. The advantage of using this final option is that it will cause JSON to escape any character outside of the iso-8859-1 character set before attempting to encode it.
Can anybody explain to me why?
You have a code (an XS module) that reads the underlying string buffer of the scalar and incorrectly treats that as the content of the string. There is a bug is in that module.

Related

Decode or unescape \u00f0\u009f\u0091\u008d to 👍

We all know UTF-8 is hard. I exported my messages from Facebook and the resulting JSON file escaped all non-ascii characters to unicode code points.
I am looking for an easy way to unescape these unicode code points to regular old UTF-8. I also would love to use PowerShell.
I tried
$str = "\u00f0\u009f\u0091\u008d"
[Regex]::Replace($str, "\\[Uu]([0-9A-Fa-f]{4})", `
{[char]::ToString([Convert]::ToInt32($args[0].Groups[1].Value, 16))} )
but that only gives me ð as a result, not 👍.
I also tried using Notepad++ and I found this SO post: How to convert escaped Unicode (e.g. \u0432\u0441\u0435) to UTF-8 chars (все) in Notepad++. The accepted answer also results in exactly the same as the example above: ð.
I found the decoding solution here: the UTF8.js library that decodes the text perfectly and you can try it out here (with \u00f0\u009f\u0091\u008d as input).
Is there a way in PowerShell to decode \u00f0\u009f\u0091\u008d to receive 👍? I'd love to have real UTF-8 in my exported Facebook messages so I can actually read them.
Bonus points for helping me understand what \u00f0\u009f\u0091\u008d actually represents (besides it being some UTF-8 hex representation). Why is it the same as U+1F44D or \uD83D\uDC4D in C++?
The Unicode code point of the 👍character is U+1F44D.
Using the variable-length UTF-8 encoding, the following 4 bytes (expressed as hex. numbers) are needed to represent this code point: F0 9F 91 8D.
While these bytes are recognizable in your string,
$str = "\u00f0\u009f\u0091\u008d"
they shouldn't be represented as \u escape codes, because they're not Unicode code units / code point, they're bytes.
With a 4-hex-digit escape sequence (UTF-16), the proper representation would require 2 16-bit Unicode code units, a so-called surrogate pair, which together represent the single non-BMP code point U+1F44D:
$str = "\uD83D\uDC4D"
If your JSON input used such proper Unicode escapes, PowerShell would process the string correctly; e.g.:
'{ "str": "\uD83D\uDC4D" }' | ConvertFrom-Json > out.txt
If you examine file out.txt, you'll see something like:
str
---
👍
(The output was sent to a file, because console windows wouldn't render the 👍char. correctly, at least not without additional configuration; note that if you used PowerShell Core on Linux or macOS, however, terminal output would work.)
Therefore, the best solution would be to correct the problem at the source and use proper Unicode escapes (or even use the characters themselves, as long as the source supports any of the standard Unicode encodings).
If you really must parse the broken representation, try the following workaround (PSv4+), building on your own [regex]::Replace() technique:
$str = "A \u00f0\u009f\u0091\u008d for Mot\u00c3\u00b6rhead."
[regex]::replace($str, '(?:\\u[0-9a-f]{4})+', { param($m)
$utf8Bytes = (-split ($m.Value -replace '\\u([0-9a-f]{4})', '0x$1 ')).ForEach([byte])
[text.encoding]::utf8.GetString($utf8Bytes)
})
This should yield A 👍 for Motörhead.
The above translates sequences of \u... escapes into the byte values they represent and interprets the resulting byte array as UTF-8 text.
To save the decoded string to a UTF-8 file, use ... | Set-Content -Encoding utf8 out.txt
Alternatively, in PSv5+, as Dennis himself suggests, you can make Out-File and therefore it's virtual alias, >, default to UTF-8 via PowerShell's global parameter-defaults hashtable:
$PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'
Note, however, that on Windows PowerShell (as opposed to PowerShell Core) you'll get an UTF-8 file with a BOM in both cases - avoiding that requires direct use of the .NET framework: see Using PowerShell to write a file in UTF-8 without the BOM
iso-8859-1 - very often - intermediate member in operations with Utf-8
$text=[regex]::Unescape("A \u00f0\u009f\u0091\u008d for Mot\u00c3\u00b6rhead.")
Write-Host "[regex]::Unescape(utf-8) = $text"
$encTo=[System.Text.Encoding]::GetEncoding('iso-8859-1') # Change it to yours (iso-8859-2) i suppose
$bytes = $encTo.GetBytes($Text)
$text=[System.Text.Encoding]::UTF8.GetString($bytes)
Write-Host "utf8_DecodedFrom_8859_1 = $text"
[regex]::Unescape(utf-8) = A ð for Motörhead.
utf8_DecodedFrom_8859_1 = A 👍 for Motörhead.
What pleases in mklement0 example - it is easy to get an encoded string of this type.
What is bad - the line will be huge. (First 2 nibbles '00' is a waste)
I must admit, the mklement0 example is charming.
The code for encoding - one line only!!!:
$emoji='A 👍 for Motörhead.'
[Reflection.Assembly]::LoadWithPartialName("System.Web") | Out-Null
$str=(([System.Web.HttpUtility]::UrlEncode($emoji)) -replace '%','\u00') -replace '\+',' '
$str
You can decode this by the standard url way:
$str="A \u00f0\u009f\u0091\u008d for Mot\u00c3\u00b6rhead."
$str=$str -replace '\\u00','%'
[Reflection.Assembly]::LoadWithPartialName("System.Web") | Out-Null
[System.Web.HttpUtility]::UrlDecode($str)
A 👍 for Motörhead.

Perl Remove invalid characters, invalid latin1 characters from string

I have a perl script that reads from a web service and saves in a mysql table. this table uses latin1. from the web service there are coming some wrong characters and need to remove them before saving them in the database, otherwise they get saved as '?'
wanted to do something similar as:
$desc=~s///gsi;
but is not removing them.
the webservice that has the wrong characters is: https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478
using a user agent to get the data, seems coming in utf8 but the characters need to be removed:
my $ua = LWP::UserAgent->new ();
$ua->default_headers->push_header ('Accept' =>
"text/html,application/xhtml" .
"+xml,application/xml");
$ua->default_headers->push_header ('Accept-Charset' => "utf-8");
my $doc = $ua->get ("https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478")
If you just want to remove the characters outside the 7-bit ascii set (which are sufficient to display messages in english), you can you do this:
$desc=~s/[^\x00-\x7f]//g
Edit: If you want something more elaborate that supports the entire latin-1 set, you can do this:
use Encode;
$desc=encode('latin-1',$desc,sub {''});
This will remove exactly the characters that cannot be represented by latin-1. Note that this line expects that the utf-8 flag is on for the string $desc and that the resulting string will have the utf-8 flag is off.
Finally, if you want to preserve the euro sign (€), please note that you cannot do that with latin-1 because it is not part of that encoding. You will have to use a different encoding, such as ISO-8859-15.
The content sent by the web service is XML that contains HTML in the Description tag. If this is that content that worries you, another option than deleting non-Latin-1 character is to encode characters using HTML encoding:
$desc =~ s/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge
Here is an example:
$ echo 'é' | perl -C -pE 's/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge'
&233;
Change your column definition to CHARACTER SET utf8mb4 so that the naughty character does not need to be removed, and can actually be stored.

Perl JSON encode in UTF-8 strange behaviour

Based on Perl JSON 2.90 documentation, to encode JSON object in UTF-8 all you need to do is:
$json_text = JSON->new->utf8->encode($perl_scalar)
That is obvious and this what I did. After a while, I got an issue report on GitHub from one of users, which made me really surprised, as it shouldn't be happening!
I was beating for hours to figure out what was happening but the solution happened to be very weird and wrong from my point of view.
What eventually worked for me is this:
$json_text = JSON->new->latin1->encode($perl_scalar)
After that, I tested this code with all different characters, including Russian and Chinese - it just worked?
Can anyone please explain, why encoding is working correctly with latin1 and not with utf8, when it's actually has to be visa versa?
Two possible bugs could result in the described outcome.
You were passing strings already encoded using UTF-8 to encode.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.C3 A9, are suffering from this bug.
If you are suffering from the this bug, properly decode all inputs to your program, and continue using JSON->new->utf8->encode (aka encode_json).
You were encoding the output of the JSON command using UTF-8 a second time, possibly via a :utf8 or :encoding layer on a file handle.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.E9, are suffering from this bug.
If you are suffering from the this bug, either use JSON->new->encode (aka to_json) and keep the second layer of encoding, or use JSON->new->utf8->encode (aka encode_json) and remove the second layer of encoding.
In neither case is the solution to use JSON->new->latin1->encode.
What are you doing to output $json_text? What kind of binmode do you use on that handle? The screenshot looks like it's double-encoded, which suggests the handle has :utf8 or :encoding enabled (which is incorrect for writing encoded data to). As unintuitively as it may seem, ->latin1 giving a correct result matches that hypothesis (PerlIO assumes any binary string is encoded as latin-1).

Scrambled umlauts since upgrade from JSON1 to JSON2 in Perl

I wondered why some german umlauts were scrambled on our page.
Then i found out that the recent version of JSON (i use 2.07) does convert strings in an other manner than JSON 1.5.
Problem here is that i have a hash with strings like
use Data::Dumper;
my $test = {
'fields' => 'überrascht'
};
print Dumper(to_json($test)); gives me
$VAR1 = "{ \"fields\" : \"\x{fc}berrascht\" } ";
Using the old module using
$json = JSON->new();
print Dumper ($json->to_json($test));
gives me (the correct result)
$VAR1 = '{"fields":[{"title":"überrascht"}]}';
So umlauts are scrammbled using the new JSON 2 module.
What do i need to get them correct?
Update: It might be bad to use Data::Dumper to show output, because Dumper uses its own encoding. Well, a difference in the result from Dumper shows that anything is treated differently here. It might be better to describe the backend as Brad mentioned:
The json string gets printed using Template-Toolkit and then gets assigned to a javascript variable for further use. The correct javascript shows something like this
{
"title" : "Geändert",
},
using the new module i get
{
"title" : "Geändert",
},
The target page is in 8859-1 (latin1).
Any suggestions?
\x{fc} is ü, at least in Latin-1, Latin-9 etc. Also, ü is codepoint U+00FC in Unicode. However, we want UTF-8 (I suppose). The easiest solution to get UTF-8 string literals is to save your Perl source code with this encoding, and put a use utf8; at the top of your script.
Then, encoding the string as JSON yields correct output:
use strict; use warnings; use utf8;
use Data::Dumper; use JSON;
print Dumper encode_json {fields => "nicht überrascht"};
The encode_json assumes UTF-8. Read the documentation for more info.
Output:
$VAR1 = '{"fields":"nicht überrascht"}';
(JSON module version: 2.53)
my $json_text = to_json($data);
is short for
my $json_text = JSON->new->encode($data);
This returns a string of Unicode Code Points. U+00FC is indeed the correct Unicode code point for "ü", so the output is correct. (As proof, the HTML source for that is actually "ü".)
It's hard to tell what your original output actually contained (since you showed non-ASCII characters), so it's hard to determine what your problem is actually.
But one thing you must do before outputing the string is to convert it from a string of code points into bytes, say, by using Encode's encode or encode_utf8.
my $json_cp1252 = encode('cp1252', to_json($data));
my $json_utf8 = encode_utf8(to_json($data));
If the appropriate encoding is UTF-8, you can also use any of the following:
my $json_utf8 = to_json($data, { utf8 => 1 });
my $json_utf8 = encode_json($data);
my $json_utf8 = JSON->new->utf8->encode($data);
Use encode_json instead. According to the manual it converts the given Perl data structure to a UTF-8 encoded, binary string.
Regarding your update: If you actually want to produce JSON in Latin1 (ISO-8859-1), you can try:
to_json($test, { latin1 => 1 })
Or
JSON->new->latin1->encode($test)
Note that if you dump the result, getting \x{fc} for ü is correct in this case. I guess that the root of your problem is that you receive text in Perl's UTF-8 format from somewhere. In this case, the latin1 option of the JSON module is needed.
You can also try to use ascii instead of latin1 as the safest option.
Another solution might be to specify an output encoding for Template-Toolkit. I don't know if that's possible. Or, you could encode your result as Latin1 in the final step before sending it to the client.
Strictly-speaking, Latin-1-encoded JSON is not valid JSON. The JSON spec allows UTF-8, UTF-16 or UTF-32 encodings.
If you want to be standards-compliant or you want to ensure your JSON will be compatible with both your current pages and future UTF-8-based pages, you need to use JSON->new->utf8->encode($str). Being strict about generated valid JSON could save you lots of headaches in the future.
You can translate UTF-8 JSON to Latin-1 using client-side Javascript if you need to, using this trick.
The ascii option also produces valid JSON, by escaping any non-ASCII characters using valid JSON unicode escapes. But the latin1 option does not, and therefore should be avoided IMHO. The utf8(0) option should be avoided too unless you specify an encoding when writing the data out to clients: utf8(0) is subtly different from the utf8 option in that it generates Perl character strings instead of byte strings. If you do any I/O using character strings without specifying an encoding, Perl will translate it on-the-fly back to Latin-1. The utf8 option generates raw UTF-8 bytes, which are perfect for doing raw I/O.

Data Type of Module Output

I have a script that I run on various texts to convert XHTML (e.g., ü) to ASCII. For Example, my script is written in the following manner:
open (INPUT, '+<file') || die "File doesn't exist! $!";
open (OUTPUT, '>file') || die "Can't find file! $!";
while (<INPUT>) {
s/&uuml/ü/g;
}
print OUTPUT $_;
This works as expected and substitutes the XHTML with the ASCII equivalent. However, since this is often run, I've attempted to convert it into a module. But, Perl doesn't return "ü" it returns the decomposition. How can I get Perl to return the data back with the ASCII equivalent (as run and printed in my regular .pl file)?
There is no ASCII. Not in practice anyway, and certainly not outside the US. I suggest you specify an encoding that will have all characters you might encounter (ASCII does not contain ü, it is only a 7-bit encoding!). Latin-1 is possible, but still suboptimal, so you should use Unicode, preferably UTF-8.
If you don't want to output in Unicode, at least your Perl script should be encoded with UTF-8. To signal this to the perl interpreter, use utf8 at the top of your script.
Then open the input file with an encoding layer like this:
open my $fh, "<:encoding(UTF-8)", $filename
The same goes for the output file. Just make sure to specify an an encoding when you want to use one.
You can change the encoding of a file with binmode, just see the documentation.
You can also use the Encode module to translate a byte string to unicode and vice versa. See this excellent question for further information about using Unicode with Perl.
If you want to, you can use the existing HTML::Entities module to handle the entity decoding and just focus in the I/O.