Convert emoji Unicode byte sequences to Unicode characters with jq

Convert emoji Unicode byte sequences to Unicode characters with jq - json

I'm filtering Facebook Messenger JSON dumps with jq. The source JSON contains emojis as Unicode sequences. How can I output these back as emojis?
echo '{"content":"\u00f0\u009f\u00a4\u00b7\u00f0\u009f\u008f\u00bf\u00e2\u0080\u008d\u00e2\u0099\u0082\u00ef\u00b8\u008f"}' | jq -c '.'
Actual result:
{"content":"ð¤·ð¿ââï¸"}
Desired result:
{"content":"🤷🏿‍♂️"}

#chepner's use of Latin1 in Python finally shook free in my head how to do with jq almost directly. You'll need to pipe through iconv:
$ echo '{"content":"\u00f0\u..."}' | jq -c . | iconv -t latin1
{"content":"🤷🏿‍♂️"}
In JSON, the string \u00f0 does not mean "the byte 0xF0, as part of a UTF-8 encoded sequence." It means "Unicode code point 0x00F0." That's ð, and jq is displaying it correctly as the UTF-8 encoding 0xc3 0xb0.
The iconv call reinterprets the UTF-8 string for ð (0xc3 0xb0) back into Latin1 as 0xf0 (Latin1 exactly matches the first 255 Unicode code points). Your UTF-8 capable terminal then interprets that as the first byte of a UTF-8 sequence.

The problem is that the response contains the UTF-8 encoding of the Unicode code points, not the code points themselves. jq cannot decode this itself. You could use another language; for example, in Python
>>> x = json.load(open("response.json"))['content']
>>> x
'ð\x9f¤·ð\x9f\x8f¿â\x80\x8dâ\x99\x82ï¸\x8f'
>>> x.encode('latin1').decode()
'🤷🏿\u200d♂️'
It's not exact, but I'm not sure the encoding is unambiguous. For example,
>>> x.encode('latin1')
b'\xf0\x9f\xa4\xb7\xf0\x9f\x8f\xbf\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
>>> '🤷🏿‍♂️'.encode()
b'\xf0\x9f\xa4\xb7\xf0\x9f\x8f\xbf\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
>>> '🤷🏿‍♂️'.encode().decode()
'🤷🏿\u200d♂️'
The result of re-encoding the response using Latin-1 is identical to encoding the desired emoji as UTF-8, but decoding doesn't not give back precisely the same emoji (or at least, Python isn't rendering it identically.)

Here's a jq-only solution. It works with both the C and Go implementations of jq.
# input: a decimal integer
# output: the corresponding binary array, most significant bit first
def binary_digits:
if . == 0 then 0
else [recurse( if . == 0 then empty else ./2 | floor end ) % 2]
| reverse
| .[1:] # remove the leading 0
end ;
def binary_to_decimal:
reduce reverse[] as $b ({power:1, result:0};
.result += .power * $b
| .power *= 2)
| .result;
# input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_decode:
# Magic numbers:
# x80: 128, # 10000000
# xe0: 224, # 11100000
# xf0: 240 # 11110000
(-6) as $mb # non-first bytes start 10 and carry 6 bits of data
# first byte of a 2-byte encoding starts 110 and carries 5 bits of data
# first byte of a 3-byte encoding starts 1110 and carries 4 bits of data
# first byte of a 4-byte encoding starts 11110 and carries 3 bits of data
| map(binary_digits) as $d
| .[0]
| if . < 128 then $d[0]
elif . < 224 then [$d[0][-5:][], $d[1][$mb:][]]
elif . < 240 then [$d[0][-4:][], $d[1][$mb:][], $d[2][$mb:][]]
else [$d[0][-3:][], $d[1][$mb:][], $d[2][$mb:][], $d[3][$mb:][]]
end
| binary_to_decimal ;
{"content":"\u00f0\u009f\u00a4\u00b7\u00f0\u009f\u008f\u00bf\u00e2\u0080\u008d\u00e2\u0099\u0082\u00ef\u00b8\u008f"}
| .content|= (explode| [utf8_decode] | implode)
Transcript:
$ jq -nM -f program.jq
{
"content": "🤷"
}

First of all, you need a font which supports this.
You are confusing Unicode composed chars with UTF-8 encoding. It has to be either:
$ echo '{"content":"\u1F937\u200D\u2642"}' | jq -c '.'
or
$ echo '{"content":"\u1F937\u200D\u2642\uFE0F"}' | jq -c '.'

Related

Powershell not able to convert while converting values from "&" to JSON

RoleFullPath
Applications\User Admin & Support-DEMO
PowerShell Code
$NewJSON.roleFullPath = $Line.RoleFullPath
.
.
.
.
$JSONPath = $RolePath + $FolderName + "-JSON.json"
Convertto-JSON $NewJSON | Out-file -Encoding "UTF8" $JSONPath
Output:
"roleFullPath": "Applications\\User Admin \u0026 Support-DEMO"
While converting from csv to json, character '&' is getting converted to '\u0026'
Any help?

In Windows PowerShell v5.1, ConvertTo-Json indeed unexpectedly encodes & characters as Unicode escape sequence \u0026, where 0026 represents hex. number 0x26, the Unicode code point representing the & character, U+0026.
(PowerShell Core, by contrast, preserves the & as-is.)
That said, JSON parsers should be able to interpret such escape sequences and, indeed, the complementary ConvertFrom-Json cmdlet is.
Note: The solutions below are general ones that can handle the Unicode escape sequences of any Unicode character; since ConvertTo-Json seemingly only uses these Unicode escape-sequence representations for the characters &, ', < and >, a simpler solution is possible, unless false positives must be ruled out - see this answer.
That said, if you do want to manually convert Unicode escape sequences into their character equivalents in JSON text, you can use the following - limited solution:
# Sample JSON with Unicode escapes.
$json = '{ "roleFullPath": "Applications\\User Admin \u0026 Support-DEMO" }'
# Replace Unicode escapes with the chars. they represent,
# with limitations.
[regex]::replace($json, '\\u[0-9a-fA-F]{4}', {
param($match) [char] [int] ('0x' + $match.Value.Substring(2))
})
The above yields:
{ "roleFullPath": "Applications\\User Admin & Support-DEMO" }
Note how \u0026 was converted to the char. it represents, &.
A robust solution requires more work:
There are characters that must be escaped in JSON and cannot be represented literally, so in order for the to-character conversion to work generically, these characters must be excluded.
Additionally, false positives must be avoided; e.g., \\u0026 is not a valid Unicode escape sequence, because a JSON parser interprets \\ as an escaped \ followed by verbatim u0026.
Finally, the Unicode sequences for " and \ must be translated into their escaped forms, \" and \\, and it is possible to represent a few ASCII-range control characters by C-style escape sequences, e.g., \t for a tab character (\u0009).
The following robust solution addresses all these issues:
# Sample JSON with Unicode escape sequences:
# \u0026 is &, which CAN be converted to the literal char.
# \u000a is a newline (LF) character, which CANNOT be converted, but can
# be translated to escape sequence "\n"
# \\u0026 is *not* a Unicode escape sequence and must be preserved as-is.
$json = '{
"roleFullPath": "Applications\u000aUser Admin \u0026 Support-DEMO-\\u0026"
}'
[regex]::replace($json, '(?<=(?:^|[^\\])(?:\\\\)*)\\u([0-9a-fA-F]{4})', {
param($match)
$codePoint = [int] ('0x' + $match.Groups[1].Value)
if ($codePoint -in 0x22, 0x5c) {
# " or \ must be \-escaped.
'\' + [char] $codePoint
}
elseif ($codePoint -in 0x8, 0x9, 0xa, 0xc, 0xd) {
# Control chars. that can be represented as short, C-style escape sequences.
('\b', '\t', '\n', $null, '\f', '\r')[$codePoint - 0x8]
}
elseif ($codePoint -le 0x1f -or [char]::IsSurrogate([char] $codePoint)) {
# Other control chars. and halves of surrogate pairs must be retained
# as escape sequences.
# (Converting surrogate pairs to a single char. would require much more effort.)
$match.Value
}
else {
# Translate to literal char.
[char] $codePoint
}
})
Output:
{
"roleFullPath": "Applications\nUser Admin & Support-DEMO-\\u0026"
}

To stop Powershell from doing this pipe your Json output through this
$jsonOutput | ForEach-Object { [System.Text.RegularExpressions.Regex]::Unescape($_) } | Set-Content $jsonPath -Encoding UTF8;
This will prevent the & being converted :)

Base64 encoding, new line after 60 chars vs new line after 76 chars

Before u start reading all my long post I figured it out why it doesnt work. Its about base64, it adds new line after each 60chars by default. This link : base64 encode length parameter
is a bit explenation. Is there Base64 module which adds new line after every 76 chars ?
So solution is just this
#checkout_signature = Base64.strict_encode64(#signature + "|" + checkout_request)
#checkout_signature = #checkout_signature.insert(76, "\n") # once
or
#checkout_signature.gsub!(/.{76}(?=.)/, '\0'+"\n") # every 76 chars
Please help me translate this bash script to ruby in proper way. Actually just base64 encoding is bad
UPDATE
bash way
#!/bin/bash
echo 'Signature test'
export checkout_request='{"charge":{"amount":499,"currency":"EUR"}}'
echo $checkout_request
export signature=`echo -n "$checkout_request" | openssl dgst -sha256 -hmac 'pr_test_tXHm9qV9qV9bjIRHcQr9PLPa' | sed 's/^.* //'`
echo $signature
echo '--------'
echo -n "$signature|$checkout_request" | base64
RESULT
{"charge":{"amount":499,"currency":"EUR"}}
cf9ce2d8331c531f8389a616a18f9578c134b784dab5cb7e4b5964e7790f173c
--------
Y2Y5Y2UyZDgzMzFjNTMxZjgzODlhNjE2YTE4Zjk1NzhjMTM0Yjc4NGRhYjVjYjdlNGI1OTY0ZTc3
OTBmMTczY3x7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwiY3VycmVuY3kiOiJFVVIifX0=
Y2Y5Y2UyZDgzMzFjNTMxZjgzODlhNjE2YTE4Zjk1NzhjMTM0Yjc4NGRhYjVjYjdlNGI1OTY0ZTc3OTBmMTczY3x7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwiY3VycmVuY3kiOiJFVVIifX0=
ruby way
checkout_request='{"charge":{"amount":499,"currency":"EUR"}}'
secret_key ='pr_test_tXHm9qV9qV9bjIRHcQr9PLPa'
#signature = OpenSSL::HMAC.hexdigest('sha256', secret_key, checkout_request);
puts #signature;
puts "-----"
#checkout_signature = Base64.urlsafe_encode64(#signature + "|" + checkout_request)
puts #checkout_signature
RESULT
cf9ce2d8331c531f8389a616a18f9578c134b784dab5cb7e4b5964e7790f173c
-----
ODY4YzY4YTg4NmFmOTg2MGY5OGVjMmUyODM5OTBhYmViNmQyZjUzYWI5ZjgxMzlhYzFlODllNThhZTVhZTFkMnx7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwiY3VycmVuY3kiOiJFVVIifX0=
UPDATE
ok signature is ok, but I use bad base64 encoding....
SOMEHOW
echo -n "$signature|$checkout_request" | base64
adds newline after $signature
and its the only '\n' in bash result
in ruby when i use .encode64 i get many more n's
"Y2Y5Y2UyZDgzMzFjNTMxZjgzODlhNjE2YTE4Zjk1NzhjMTM0Yjc4NGRhYjVj\nYjdlNGI1OTY0ZTc3OTBmMTczY3x7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwi\nY3VycmVuY3kiOiJFVVIifX0=\n"
when I use .strict_encode64 or .urlsafe_encode64 there is no \n
"Y2Y5Y2UyZDgzMzFjNTMxZjgzODlhNjE2YTE4Zjk1NzhjMTM0Yjc4NGRhYjVjYjdlNGI1OTY0ZTc3OTBmMTczY3x7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwiY3VycmVuY3kiOiJFVVIifX0="
and result from bash looks like this
Y2Y5Y2UyZDgzMzFjNTMxZjgzODlhNjE2YTE4Zjk1NzhjMTM0Yjc4NGRhYjVjYjdlNGI1OTY0ZTc3
OTBmMTczY3x7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwiY3VycmVuY3kiOiJFVVIifX0=
there must be newline after c3, that's the way pritnenv shows variable.I can't inspect it like in ruby
WHY THERE IS SUCH DIFFERENCE

Extract UTF-uncoded binary data from JSON using jq

Say I have a JSON with a 0xb7 byte encoded as a UTF codepoint:
{"key":"_\u00b7_"}
If I extract the value of the "key" with jq it keeps the utf8 encoding of this byte which is "c2 b7":
$ echo '{"key":"_\u00b7_"}' | ./jq '.key' -r | xxd
0000000: 5fc2 b75f 0a _.._.
Is there any jq command that extracts the decoded "5f b7 5f" byte sequence out of this JSON?
I can solve this with extra tools like iconv but it's a bit ugly:
$ echo '{"key":"_\u00b7_"}' | ./jq '.key' -r \
| iconv -f utf8 -t utf32le \
| xxd -ps | sed -e 's/000000//g' | xxd -ps -r \
| xxd
0000000: 5fb7 5f0a _._.

def hx:
def hex: [if . < 10 then 48 + . else 55 + . end] | implode ;
tonumber | "\(./16 | floor | hex)\(. % 16 | hex)";
{"key":"_\u00b7_"} | .key | explode | map(hx)
produces:
["5F","B7","5F"]
"Raw Bytes" (caveat emptor)
Since jq only supports UTF-8 strings, you would have to use some external tool to obtain the "raw bytes". Maybe this is closer to what you want:
jq -nrj '{"key":"_\u00b7_"} | .key' | iconv -f utf-8 -t ISO8859-1
This produces the three bytes.
And here's an iconv-free solution:
jq -nrj '{"key":"_\u00b7_"} | .key' | php -r 'print utf8_decode(readline());'

Alternate
Addressing the character encoding scenario outside of jq:
Though you didn't want extra tools, iconv and hexdump are indeed readily available - I for one frequently lean on iconv when I require certain parts of a pipeline to be completely known to me, and hexdump when I want control of the formatting of the representation of those parts.
So an alternative is:
jq -njr '{"key":"_\u00b7_"} | .key' | iconv -f utf8 -t UTF-32LE | hexdump -ve '1/1 "%.X"'
Result:
5FB75F

Why is JSON::XS Not Generating Valid UTF-8?

I'm getting some corrupted JSON and I've reduced it down to this test case.
use utf8;
use 5.18.0;
use Test::More;
use Test::utf8;
use JSON::XS;
BEGIN {
# damn it
my $builder = Test::Builder->new;
foreach (qw/output failure_output todo_output/) {
binmode $builder->$_, ':encoding(UTF-8)';
}
}
foreach my $string ( 'Deliver «French Bread»', '日本国' ) {
my $hashref = { value => $string };
is_sane_utf8 $string, "String: $string";
my $json = encode_json($hashref);
is_sane_utf8 $json, "JSON: $json";
say STDERR $json;
}
diag ord('»');
done_testing;
And this is the output:
utf8.t ..
ok 1 - String: Deliver «French Bread»
not ok 2 - JSON: {"value":"Deliver Â«French BreadÂ»"}
# Failed test 'JSON: {"value":"Deliver Â«French BreadÂ»"}'
# at utf8.t line 17.
# Found dodgy chars "<c2><ab>" at char 18
# String not flagged as utf8...was it meant to be?
# Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK char - codepoint 171 (dec), ab (hex)
{"value":"Deliver «French Bread»"}
ok 3 - String: 日本国
ok 4 - JSON: {"value":"æ¥æ¬å½"}
1..4
{"value":"日本国"}
# 187
So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. What am I missing? The utf8 pragma is correctly marking my source. Further, that trailing 187 is from the diag. That's less than 255, so it almost looks like a variant of the old Unicode bug in Perl. (And the test output still looks like crap. Never could quite get that right with Test::Builder).
Switching to JSON::PP produces the same output.
This is Perl 5.18.1 running on OS X Yosemite.

is_sane_utf8 doesn't do what you think it does. You're suppose to pass strings you've decoded to it. I'm not sure what's the point of it, but it's not the right tool. If you want to check if a string is valid UTF-8, you could use
ok(eval { decode_utf8($string, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 },
'$string is valid UTF-8');
To show that JSON::XS is correct, let's look at the sequence is_sane_utf8 flagged.
+--------------------- Start of two byte sequence
| +---------------- Not zero (good)
| | +---------- Continuation byte indicator (good)
| | |
v v v
C2 AB = [110]00010 [10]101011
00010 101011 = 000 1010 1011 = U+00AB = «
The following shows that JSON::XS produces the same output as Encode.pm:
use utf8;
use 5.18.0;
use JSON::XS;
use Encode;
foreach my $string ('Deliver «French Bread»', '日本国') {
my $hashref = { value => $string };
say(sprintf("Input: U+%v04X", $string));
say(sprintf("UTF-8 of input: %v02X", encode_utf8($string)));
my $json = encode_json($hashref);
say(sprintf("JSON: %v02X", $json));
say("");
}
Output (with some spaces added):
Input: U+0044.0065.006C.0069.0076.0065.0072.0020.00AB.0046.0072.0065.006E.0063.0068.0020.0042.0072.0065.0061.0064.00BB
UTF-8 of input: 44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB
JSON: 7B.22.76.61.6C.75.65.22.3A.22.44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB.22.7D
Input: U+65E5.672C.56FD
UTF-8 of input: E6.97.A5.E6.9C.AC.E5.9B.BD
JSON: 7B.22.76.61.6C.75.65.22.3A.22.E6.97.A5.E6.9C.AC.E5.9B.BD.22.7D

JSON::XS is generating valid UTF-8, but you're using the resulting UTF-8 encoded byte strings in two different contexts that expect character strings.
Issue 1: Test::utf8
Here are the two main situations when is_sane_utf8 will fail:
You have a miscoded character string that had been decoded from a UTF-8 byte string as if it were Latin-1 or from double encoded UTF-8, or the character string is perfectly fine and looks like a potentially "dodgy" miscoding (using the terminology from its docs).
You have a valid UTF-8 byte string containing the encoded code points U+0080 through U+00FF, for example «French Bread».
The is_sane_utf8 test is intended only for character strings and has the documented potential for false negatives.
Issue 2: Output Encoding
All of your non-JSON strings are character strings while your JSON strings are UTF-8 encoded byte strings, as returned from the JSON encoder. Since you're using the :encoding(UTF-8) PerlIO layer for TAP output, the character strings are being implicitly encoded to UTF-8 with good results, while the byte strings containing JSON are being double encoded. STDERR however does not have an :encoding PerlIO layer set, so the encoded JSON byte strings look good in your warnings since they're already encoded and being passed straight out.
Only use the :encoding(UTF-8) PerlIO layer for IO with character strings, as opposed to the UTF-8 encoded byte strings returned by default from the JSON encoder.

How do I convert stored misencoded data?

My Perl app and MySQL database now handle incoming UTF-8 data properly, but I have to convert the pre-existing data. Some of the data appears to have been encoded as CP-1252 and not decoded as such before being encoded as UTF-8 and stored in MySQL. I've read the O'Reilly article Turning MySQL data in latin1 to utf8 utf-8, but although it's frequently referenced, it's not a definitive solution.
I've looked at Encode::DoubleEncodedUTF8 and Encoding::FixLatin, but neither worked on my data.
This is what I've done so far:
#Return the $bytes from the DB using BINARY()
my $characters = decode('utf-8', $bytes);
my $good = decode('utf-8', encode('cp-1252', $characters));
That fixes most of the cases, but if run against proplerly-encoded records, it mangles them. I've tried using Encode::Guess and Encode::Detect, but they cannot distinguish between the properly encoded and the misencoded records. So I just undo the conversion if the \x{FFFD} character is found after the conversion.
Some records, though, are only partially converted. Here's an example where the left curly quotes are properly converted, but the right curly quotes get mangled.
perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "\xC3\xA2\xE2\x82\xAC\xC5\x93four score\xC3\xA2\xE2\x82\xAC\xC2\x9D")))'
And and here's an example where a right single quote did not convert:
perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "bob\xC3\xAF\xC2\xBF\xC2\xBDs")))'
Am I also dealing with double encoded data here? What more must I do to convert these records?

With the "four score" example, it almost certainly is doubly-encoded data. It looks like either:
cp1252 data that was run through a cp1252 to utf8 process twice, or
utf8 data that was run through a cp1252 to utf8 process
(Naturally, both cases look identical)
Now, that's what you expected, so why didn't your code work?
First, I'd like to refer you to this table which shows the conversion from cp1252 to unicode. The important thing I want you to note is that there are some bytes (such as 0x9D) which are not valid in cp1252.
When I imagine writing a cp1252 to utf8 converter, therefore, I need to do something with those bytes that aren't in cp1252. The only sensible thing I can think of is to transform the unknown bytes into unicode characters at the same value. In fact, this appears to be what happened. Let's take your "four score" example back one step at a time.
First, since it is valid utf-8, let's decode with:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
This yields this sequence of unicode code points:
e2 20ac 153 66 6f 75 72 20 73 63 6f 72 65 e2 20ac 9d
("fmt" is a unix command that just reformats text so that we have nice line breaks with long data)
Now, let's represent each of these as a byte in cp1252, but when the unicode character can't be represented in cp1252, let's just replace it with a byte that has the same numeric value. (Instead of the default, which is to replace it with a question mark) We should then, if we're correct about what happened to the data, have a valid utf8 byte stream.
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
$a=encode("cp-1252", $a, sub { chr($_[0]) } );
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
That third argument to encode - when it's a sub - tells what to do with unrepresentable characters.
This yields:
e2 80 9c 66 6f 75 72 20 73 63 6f 72 65 e2 80 9d
Now, this is a valid utf8 byte stream. Can't tell that by inspection? Well, let's ask perl to decode this byte stream as utf8:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
$a=encode("cp-1252", $a, sub { chr($_[0]) } );
$a=decode("utf-8", $a, 1);
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
Passing "1" as the third argument to decode ensures that our code will croak if the byte stream is invalid. This yields:
201c 66 6f 75 72 20 73 63 6f 72 65 201d
Or printed:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
$a=encode("cp-1252", $a, sub { chr($_[0]) } );
$a=decode("utf-8", $a, 1);
print "$a\n"'
“four score”
So I think that the full algorithm should be this:
Grab the byte stream from mysql. Assign this to $bytestream.
While $bytestream is a valid utf8 byte stream:
Assign the current value of $bytestream to $good
If $bytestream is all-ASCII (i.e., every byte is less than 0x80), break out of this "while ... valid utf8" loop.
Set $bytestream to the result of "demangle($bytestream)", where demangle is given below. This routine undoes the cp1252-to-utf8 converter we think this data has suffered from.
Put $good back in the database if it isn't undef. If $good was never assigned, assume $bytestream was a cp1252 byte stream and convert it to utf8. (Of course, optimize and don't do this if the loop in step 2 didn't change anything, etc.)
.
sub demangle {
my($a) = shift;
eval { # the non-string form of eval just traps exceptions
# so that we return undef on exception
local $SIG{__WARN__} = sub {}; # No warning messages
$a = decode("utf-8", $a, 1);
encode("cp-1252", $a, sub {$_[0] <= 255 or die $_[0]; chr($_[0])});
}
}
This is based on the assumption that it's actually very rare for a string that isn't all-ASCII to be a valid utf-8 byte stream unless it really is utf-8. That is, it's not the sort of thing that happens accidentally.
EDITED TO ADD:
Note that this technique does not help too much with your "bob's" example, unfortunately. I think that that string also went through two rounds of cp1252-to-utf8 conversion, but unfortunately there was also some corruption. Using the same technique as before, we first read the byte sequence as utf8 and look at the sequence of unicode character references we get:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"bob\xC3\xAF\xC2\xBF\xC2\xBDs");
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
This yields:
62 6f 62 ef bf bd 73
Now, it just so happens that for the three bytes ef bf bd, unicode and cp1252 agree. So representing this sequence of unicode code points in cp1252 is just:
62 6f 62 ef bf bd 73
That is, the same sequence of numbers. Now, this is in fact a valid utf-8 byte stream, but what it decodes to may surprise you:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"bob\xC3\xAF\xC2\xBF\xC2\xBDs");
$a=encode("cp-1252", $a, sub { chr(shift) } );
$a=decode("utf-8", $a, 1);
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
62 6f 62 fffd 73
That is, the utf-8 byte stream, though a legitimate utf-8 byte stream, encoded the character 0xFFFD, which is generally used for "untranslatable character". I suspect that what happened here is that the first *-to-utf8 transformation saw a character it didn't recognize and replaced it with "untranslatable". There's no way to then programmatically recover the original character.
A consequence is that you can't detect whether a stream of bytes is valid utf8 (needed for that algorithm I gave above) simply by doing a decode and then looking for 0xFFFD. Instead, you should use something like this:
sub is_valid_utf8 {
defined(eval { decode("utf-8", $_[0], 1) })
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Convert emoji Unicode byte sequences to Unicode characters with jq - json

First of all, you need a font which supports this. You are confusing Unicode composed chars with UTF-8 encoding. It has to be either: $ echo '{"content":"\u1F937\u200D\u2642"}' | jq -c '.' or $ echo '{"content":"\u1F937\u200D\u2642\uFE0F"}' | jq -c '.'

Related

Powershell not able to convert while converting values from "&" to JSON

Base64 encoding, new line after 60 chars vs new line after 76 chars

Extract UTF-uncoded binary data from JSON using jq

Why is JSON::XS Not Generating Valid UTF-8?

How do I convert stored misencoded data?

Categories

Resources