How do I convert stored misencoded data?

How do I convert stored misencoded data? - mysql

My Perl app and MySQL database now handle incoming UTF-8 data properly, but I have to convert the pre-existing data. Some of the data appears to have been encoded as CP-1252 and not decoded as such before being encoded as UTF-8 and stored in MySQL. I've read the O'Reilly article Turning MySQL data in latin1 to utf8 utf-8, but although it's frequently referenced, it's not a definitive solution.
I've looked at Encode::DoubleEncodedUTF8 and Encoding::FixLatin, but neither worked on my data.
This is what I've done so far:
#Return the $bytes from the DB using BINARY()
my $characters = decode('utf-8', $bytes);
my $good = decode('utf-8', encode('cp-1252', $characters));
That fixes most of the cases, but if run against proplerly-encoded records, it mangles them. I've tried using Encode::Guess and Encode::Detect, but they cannot distinguish between the properly encoded and the misencoded records. So I just undo the conversion if the \x{FFFD} character is found after the conversion.
Some records, though, are only partially converted. Here's an example where the left curly quotes are properly converted, but the right curly quotes get mangled.
perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "\xC3\xA2\xE2\x82\xAC\xC5\x93four score\xC3\xA2\xE2\x82\xAC\xC2\x9D")))'
And and here's an example where a right single quote did not convert:
perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "bob\xC3\xAF\xC2\xBF\xC2\xBDs")))'
Am I also dealing with double encoded data here? What more must I do to convert these records?

With the "four score" example, it almost certainly is doubly-encoded data. It looks like either:
cp1252 data that was run through a cp1252 to utf8 process twice, or
utf8 data that was run through a cp1252 to utf8 process
(Naturally, both cases look identical)
Now, that's what you expected, so why didn't your code work?
First, I'd like to refer you to this table which shows the conversion from cp1252 to unicode. The important thing I want you to note is that there are some bytes (such as 0x9D) which are not valid in cp1252.
When I imagine writing a cp1252 to utf8 converter, therefore, I need to do something with those bytes that aren't in cp1252. The only sensible thing I can think of is to transform the unknown bytes into unicode characters at the same value. In fact, this appears to be what happened. Let's take your "four score" example back one step at a time.
First, since it is valid utf-8, let's decode with:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
This yields this sequence of unicode code points:
e2 20ac 153 66 6f 75 72 20 73 63 6f 72 65 e2 20ac 9d
("fmt" is a unix command that just reformats text so that we have nice line breaks with long data)
Now, let's represent each of these as a byte in cp1252, but when the unicode character can't be represented in cp1252, let's just replace it with a byte that has the same numeric value. (Instead of the default, which is to replace it with a question mark) We should then, if we're correct about what happened to the data, have a valid utf8 byte stream.
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
$a=encode("cp-1252", $a, sub { chr($_[0]) } );
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
That third argument to encode - when it's a sub - tells what to do with unrepresentable characters.
This yields:
e2 80 9c 66 6f 75 72 20 73 63 6f 72 65 e2 80 9d
Now, this is a valid utf8 byte stream. Can't tell that by inspection? Well, let's ask perl to decode this byte stream as utf8:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
$a=encode("cp-1252", $a, sub { chr($_[0]) } );
$a=decode("utf-8", $a, 1);
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
Passing "1" as the third argument to decode ensures that our code will croak if the byte stream is invalid. This yields:
201c 66 6f 75 72 20 73 63 6f 72 65 201d
Or printed:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
$a=encode("cp-1252", $a, sub { chr($_[0]) } );
$a=decode("utf-8", $a, 1);
print "$a\n"'
“four score”
So I think that the full algorithm should be this:
Grab the byte stream from mysql. Assign this to $bytestream.
While $bytestream is a valid utf8 byte stream:
Assign the current value of $bytestream to $good
If $bytestream is all-ASCII (i.e., every byte is less than 0x80), break out of this "while ... valid utf8" loop.
Set $bytestream to the result of "demangle($bytestream)", where demangle is given below. This routine undoes the cp1252-to-utf8 converter we think this data has suffered from.
Put $good back in the database if it isn't undef. If $good was never assigned, assume $bytestream was a cp1252 byte stream and convert it to utf8. (Of course, optimize and don't do this if the loop in step 2 didn't change anything, etc.)
.
sub demangle {
my($a) = shift;
eval { # the non-string form of eval just traps exceptions
# so that we return undef on exception
local $SIG{__WARN__} = sub {}; # No warning messages
$a = decode("utf-8", $a, 1);
encode("cp-1252", $a, sub {$_[0] <= 255 or die $_[0]; chr($_[0])});
}
}
This is based on the assumption that it's actually very rare for a string that isn't all-ASCII to be a valid utf-8 byte stream unless it really is utf-8. That is, it's not the sort of thing that happens accidentally.
EDITED TO ADD:
Note that this technique does not help too much with your "bob's" example, unfortunately. I think that that string also went through two rounds of cp1252-to-utf8 conversion, but unfortunately there was also some corruption. Using the same technique as before, we first read the byte sequence as utf8 and look at the sequence of unicode character references we get:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"bob\xC3\xAF\xC2\xBF\xC2\xBDs");
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
This yields:
62 6f 62 ef bf bd 73
Now, it just so happens that for the three bytes ef bf bd, unicode and cp1252 agree. So representing this sequence of unicode code points in cp1252 is just:
62 6f 62 ef bf bd 73
That is, the same sequence of numbers. Now, this is in fact a valid utf-8 byte stream, but what it decodes to may surprise you:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"bob\xC3\xAF\xC2\xBF\xC2\xBDs");
$a=encode("cp-1252", $a, sub { chr(shift) } );
$a=decode("utf-8", $a, 1);
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
62 6f 62 fffd 73
That is, the utf-8 byte stream, though a legitimate utf-8 byte stream, encoded the character 0xFFFD, which is generally used for "untranslatable character". I suspect that what happened here is that the first *-to-utf8 transformation saw a character it didn't recognize and replaced it with "untranslatable". There's no way to then programmatically recover the original character.
A consequence is that you can't detect whether a stream of bytes is valid utf8 (needed for that algorithm I gave above) simply by doing a decode and then looking for 0xFFFD. Instead, you should use something like this:
sub is_valid_utf8 {
defined(eval { decode("utf-8", $_[0], 1) })
}

Related

Convert emoji Unicode byte sequences to Unicode characters with jq

I'm filtering Facebook Messenger JSON dumps with jq. The source JSON contains emojis as Unicode sequences. How can I output these back as emojis?
echo '{"content":"\u00f0\u009f\u00a4\u00b7\u00f0\u009f\u008f\u00bf\u00e2\u0080\u008d\u00e2\u0099\u0082\u00ef\u00b8\u008f"}' | jq -c '.'
Actual result:
{"content":"ð¤·ð¿ââï¸"}
Desired result:
{"content":"🤷🏿‍♂️"}

#chepner's use of Latin1 in Python finally shook free in my head how to do with jq almost directly. You'll need to pipe through iconv:
$ echo '{"content":"\u00f0\u..."}' | jq -c . | iconv -t latin1
{"content":"🤷🏿‍♂️"}
In JSON, the string \u00f0 does not mean "the byte 0xF0, as part of a UTF-8 encoded sequence." It means "Unicode code point 0x00F0." That's ð, and jq is displaying it correctly as the UTF-8 encoding 0xc3 0xb0.
The iconv call reinterprets the UTF-8 string for ð (0xc3 0xb0) back into Latin1 as 0xf0 (Latin1 exactly matches the first 255 Unicode code points). Your UTF-8 capable terminal then interprets that as the first byte of a UTF-8 sequence.

The problem is that the response contains the UTF-8 encoding of the Unicode code points, not the code points themselves. jq cannot decode this itself. You could use another language; for example, in Python
>>> x = json.load(open("response.json"))['content']
>>> x
'ð\x9f¤·ð\x9f\x8f¿â\x80\x8dâ\x99\x82ï¸\x8f'
>>> x.encode('latin1').decode()
'🤷🏿\u200d♂️'
It's not exact, but I'm not sure the encoding is unambiguous. For example,
>>> x.encode('latin1')
b'\xf0\x9f\xa4\xb7\xf0\x9f\x8f\xbf\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
>>> '🤷🏿‍♂️'.encode()
b'\xf0\x9f\xa4\xb7\xf0\x9f\x8f\xbf\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
>>> '🤷🏿‍♂️'.encode().decode()
'🤷🏿\u200d♂️'
The result of re-encoding the response using Latin-1 is identical to encoding the desired emoji as UTF-8, but decoding doesn't not give back precisely the same emoji (or at least, Python isn't rendering it identically.)

Here's a jq-only solution. It works with both the C and Go implementations of jq.
# input: a decimal integer
# output: the corresponding binary array, most significant bit first
def binary_digits:
if . == 0 then 0
else [recurse( if . == 0 then empty else ./2 | floor end ) % 2]
| reverse
| .[1:] # remove the leading 0
end ;
def binary_to_decimal:
reduce reverse[] as $b ({power:1, result:0};
.result += .power * $b
| .power *= 2)
| .result;
# input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_decode:
# Magic numbers:
# x80: 128, # 10000000
# xe0: 224, # 11100000
# xf0: 240 # 11110000
(-6) as $mb # non-first bytes start 10 and carry 6 bits of data
# first byte of a 2-byte encoding starts 110 and carries 5 bits of data
# first byte of a 3-byte encoding starts 1110 and carries 4 bits of data
# first byte of a 4-byte encoding starts 11110 and carries 3 bits of data
| map(binary_digits) as $d
| .[0]
| if . < 128 then $d[0]
elif . < 224 then [$d[0][-5:][], $d[1][$mb:][]]
elif . < 240 then [$d[0][-4:][], $d[1][$mb:][], $d[2][$mb:][]]
else [$d[0][-3:][], $d[1][$mb:][], $d[2][$mb:][], $d[3][$mb:][]]
end
| binary_to_decimal ;
{"content":"\u00f0\u009f\u00a4\u00b7\u00f0\u009f\u008f\u00bf\u00e2\u0080\u008d\u00e2\u0099\u0082\u00ef\u00b8\u008f"}
| .content|= (explode| [utf8_decode] | implode)
Transcript:
$ jq -nM -f program.jq
{
"content": "🤷"
}

First of all, you need a font which supports this.
You are confusing Unicode composed chars with UTF-8 encoding. It has to be either:
$ echo '{"content":"\u1F937\u200D\u2642"}' | jq -c '.'
or
$ echo '{"content":"\u1F937\u200D\u2642\uFE0F"}' | jq -c '.'

Split Lines seperated by spaces to arrays

I have a text file that contains output from a program. It reads like this:
1 2
23 24
54 21
87 12
I need the output to be
arr[1]=2
arr[23]=24
arr[54]=21
arr[87]=12
and so on.
Each line is seperated by a space. How can I parse the lines to the array format as described above, using TCL? (I am doing this for NS2 by the way)

With awk:
awk '{ print "arr[" $1 "]=" $2 }' filename

You have mentioned that each line is separated by space, but gave the content separated by new line. I assume, you have each line separated by new line and in each line, the array index and it's value are separated by space.
If your text file contains only those texts given as below
1 2
23 24
54 21
87 12
then, you first read the whole file into a string.
set fp [open "input.txt" r]
set content [ read $fp ]
close $fp
Now, with array set we can easily convert them into an array.
# If your Tcl version less than 8.5, use the below line of code
eval array set legacy {$content}
foreach index [array names legacy] {
puts "array($index) = $legacy($index)"
}
# If you have Tcl 8.5 and more, use the below line of code
array set latest [list {*}$content]
foreach index [array names latest] {
puts "array($index) = $latest($index)"
}
Suppose if your file has some other contents along with these input contents, then you can get them alone using regexp and you can add elements to the array one by one with the classical approach.

You can use this in BASH:
declare -A arr
while read -r k v ; do
arr[$k]=$v
done < file
Testing:
declare -p arr
declare -A arr='([23]="24" [54]="21" [87]="12" [1]="2" )'

Why is JSON::XS Not Generating Valid UTF-8?

I'm getting some corrupted JSON and I've reduced it down to this test case.
use utf8;
use 5.18.0;
use Test::More;
use Test::utf8;
use JSON::XS;
BEGIN {
# damn it
my $builder = Test::Builder->new;
foreach (qw/output failure_output todo_output/) {
binmode $builder->$_, ':encoding(UTF-8)';
}
}
foreach my $string ( 'Deliver «French Bread»', '日本国' ) {
my $hashref = { value => $string };
is_sane_utf8 $string, "String: $string";
my $json = encode_json($hashref);
is_sane_utf8 $json, "JSON: $json";
say STDERR $json;
}
diag ord('»');
done_testing;
And this is the output:
utf8.t ..
ok 1 - String: Deliver «French Bread»
not ok 2 - JSON: {"value":"Deliver Â«French BreadÂ»"}
# Failed test 'JSON: {"value":"Deliver Â«French BreadÂ»"}'
# at utf8.t line 17.
# Found dodgy chars "<c2><ab>" at char 18
# String not flagged as utf8...was it meant to be?
# Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK char - codepoint 171 (dec), ab (hex)
{"value":"Deliver «French Bread»"}
ok 3 - String: 日本国
ok 4 - JSON: {"value":"æ¥æ¬å½"}
1..4
{"value":"日本国"}
# 187
So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. What am I missing? The utf8 pragma is correctly marking my source. Further, that trailing 187 is from the diag. That's less than 255, so it almost looks like a variant of the old Unicode bug in Perl. (And the test output still looks like crap. Never could quite get that right with Test::Builder).
Switching to JSON::PP produces the same output.
This is Perl 5.18.1 running on OS X Yosemite.

is_sane_utf8 doesn't do what you think it does. You're suppose to pass strings you've decoded to it. I'm not sure what's the point of it, but it's not the right tool. If you want to check if a string is valid UTF-8, you could use
ok(eval { decode_utf8($string, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 },
'$string is valid UTF-8');
To show that JSON::XS is correct, let's look at the sequence is_sane_utf8 flagged.
+--------------------- Start of two byte sequence
| +---------------- Not zero (good)
| | +---------- Continuation byte indicator (good)
| | |
v v v
C2 AB = [110]00010 [10]101011
00010 101011 = 000 1010 1011 = U+00AB = «
The following shows that JSON::XS produces the same output as Encode.pm:
use utf8;
use 5.18.0;
use JSON::XS;
use Encode;
foreach my $string ('Deliver «French Bread»', '日本国') {
my $hashref = { value => $string };
say(sprintf("Input: U+%v04X", $string));
say(sprintf("UTF-8 of input: %v02X", encode_utf8($string)));
my $json = encode_json($hashref);
say(sprintf("JSON: %v02X", $json));
say("");
}
Output (with some spaces added):
Input: U+0044.0065.006C.0069.0076.0065.0072.0020.00AB.0046.0072.0065.006E.0063.0068.0020.0042.0072.0065.0061.0064.00BB
UTF-8 of input: 44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB
JSON: 7B.22.76.61.6C.75.65.22.3A.22.44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB.22.7D
Input: U+65E5.672C.56FD
UTF-8 of input: E6.97.A5.E6.9C.AC.E5.9B.BD
JSON: 7B.22.76.61.6C.75.65.22.3A.22.E6.97.A5.E6.9C.AC.E5.9B.BD.22.7D

JSON::XS is generating valid UTF-8, but you're using the resulting UTF-8 encoded byte strings in two different contexts that expect character strings.
Issue 1: Test::utf8
Here are the two main situations when is_sane_utf8 will fail:
You have a miscoded character string that had been decoded from a UTF-8 byte string as if it were Latin-1 or from double encoded UTF-8, or the character string is perfectly fine and looks like a potentially "dodgy" miscoding (using the terminology from its docs).
You have a valid UTF-8 byte string containing the encoded code points U+0080 through U+00FF, for example «French Bread».
The is_sane_utf8 test is intended only for character strings and has the documented potential for false negatives.
Issue 2: Output Encoding
All of your non-JSON strings are character strings while your JSON strings are UTF-8 encoded byte strings, as returned from the JSON encoder. Since you're using the :encoding(UTF-8) PerlIO layer for TAP output, the character strings are being implicitly encoded to UTF-8 with good results, while the byte strings containing JSON are being double encoded. STDERR however does not have an :encoding PerlIO layer set, so the encoded JSON byte strings look good in your warnings since they're already encoded and being passed straight out.
Only use the :encoding(UTF-8) PerlIO layer for IO with character strings, as opposed to the UTF-8 encoded byte strings returned by default from the JSON encoder.

Same regexp won't match on another machine

I have this Tcl8.5 code:
set regexp_str {^[[:blank:]]*\[[[:blank:]]*[0-9]+\][[:blank:]]+0\.0\-([0-9]+\.[0-9]+) sec.+([0-9]+\.[0-9]+) ([MK]?)bits/sec[[:blank:]]*$}
set subject {
[ 5] 0.0- 1.0 sec 680 KBytes 5.57 Mbits/sec
[ 5] 0.0-150.0 sec 153 MBytes 8.56 Mbits/sec
[ 4] 0.0- 1.0 sec 0.00 Bytes 0.00 bits/sec
[ 4] 0.0-150.4 sec 38.6 MBytes 2.15 Mbits/sec
}
set matches [regexp -line -inline -all -- $regexp_str $subject]
$matches populates with the matched data on one machine, while the other simply gets an empty list.
Both machines have Tcl8.5.
Using the -about flag of regexp, the following list is returned: 3 {REG_UUNPORT REG_ULOCALE}
I don't understand how could this be possible and what else should I do to debug it?
Edit #1, 17 Feb 07:00 UTC:
#Donal Fellows:
The patch level on the "good" machine is 8.5.15.
The patch level on the "bad" machine is 8.5.10.
I'm familiar with \s and \d, but as far as I know (please correct me), they both mean to a broader characters range than I need to:
\s includes newlines, which in my example mustn't exists.
\d includes Unicode digits, which I will not encounter in my example.
In regexp I generally prefer to be as specific as possible to avoid cases I didn't think of..
There's something which I didn't specify and could be important:
The variable $subject is populated using the expect_out(buffer) variable, following a grep command executed in shell.
expect_out(buffer) returns the output from a ssh session that is tunneled using a proxy called netcat (binary name is nc):
spawn ssh -o "ProxyCommand nc %h %p" "$username#$ipAddress"
In general, the output received & sent on this session is only ASCII/English characters.
The prompt of the destination PC contains control characters like ESC and BEL and they are contained in $subject.
I don't think of it to be a problem because that I tested the regular expression with all of these characters and it worked OK.
Thank you guys for the elaborated info!
Edit #2, 17 Feb 11:05 UTC:
Response to #Donal Fellows:
Indeed I've tried:
set regexp_str {^[[:blank:]]*\[[[:blank:]]*[0-9]+\][[:blank:]]+0\.0\-([0-9]+\.[0-9]+) sec.+([0-9]+\.[0-9]+) ([MK]?)bits/sec[[:blank:]]*$}
puts [regexp -line -inline -all -- $regexp_str [string map {\r\n \n \r \n} $subject]]
and got (please ignore the different numbers in the output, the idea is the same):
{[ 5] 0.0-150.0 sec 86.7 MBytes 4.85 Mbits/sec} 150.0 4.85 M {[ 4] 0.0-150.8 sec 60.4 MBytes 3.36 Mbits/sec} 150.8 3.36 M
Also I tried to replace the [[:blank:]] from both sides of regexp string with \s:
set regexp_str {^\s*\[[[:blank:]]*[0-9]+\][[:blank:]]+0\.0\-([0-9]+\.[0-9]+) sec.+([0-9]+\.[0-9]+) ([MK]?)bits/sec\s*$}
puts [regexp -line -inline -all -- $regexp_str $subject]
and it finally found what I needed:
{[ 5] 0.0-150.0 sec 86.7 MBytes 4.85 Mbits/sec
} 150.0 4.85 M {[ 4] 0.0-150.8 sec 60.4 MBytes 3.36 Mbits/sec
} 150.8 3.36 M

Tcl uses the same regular expression engine on all platforms. (But double-check whether you've got the same patchlevel on the two machines; that'll let us examine what — if any — exact code changes might there be between the systems.) It also shouldn't be anything related to newline terminators; Tcl automatically normalizes them under anything even remotely resembling normal circumstances (and in particular, does so in scripts).
With respect to the -about flags, only the 3 is useful (it's the number of capture groups). The other item in the list is the set of state flags set about the RE by the RE compiler, and frankly they're only useful to real RE experts (and our test suite). I've never found a use for them!
You can probably shorten your RE by using \s (mnemonically “spaces”) instead of that cumbersome [[:blank:]] and \d (“digits”) instead of [0-9]. When I do that, I get something quite a lot shorter and so easier to understand.
set regexp_str {^\s*\[\s*\d+\]\s+0\.0-(\d+\.\d+) sec.+(\d+\.\d+) ([MK]?)bits/sec\s*$}
It produces the same match groups.
[EDIT]: Even with the exact version of the code you report, checked out directly from the source code repository tag that was used to drive the 8.5.10 distribution, I can't reproduce your problem. However, the fact that it's really coming from an Expect buffer is really helpful; the problem may well actually be that the line separation sequence is not a newline but rather something else (CRLF — \r\n — is the number 1 suspect, but a plain carriage return could also be there). Expect is definitely not the same as normal I/O for various reasons (in particular, exact byte sequences are often needed in terminal handling).
The easiest thing might be to manually standardize the line separators before feeding the string into regexp. (This won't affect the string in the buffer; it copies, as usual with Tcl.)
regexp -line -inline -all -- $regexp_str [string map {\r\n \n \r \n} $subject]
It's also possible that there are other, invisible characters in the output. Working out what is really going on can be complex, but in general you can use a regular expression to test this theory by looking to see if the inverse of the set of expected characters is matchable:
regexp {[^\n [:graph:]]} $subject
When I try with what you pasted, that doesn't match (good!). If it does against your real buffer, it gives you a way to hunt the problem.

I saw that you are missing optional space(s) right after the first dash. I inserted those optional spaces in and all is working:
set regexp_str {^[[:blank:]]*\[[[:blank:]]*[0-9]+\][[:blank:]]+0\.0\-[[:blank:]]*([0-9]+\.[0-9]+) sec.+([0-9]+\.[0-9]+) ([MK]?)bits/sec[[:blank:]]*$}
# missing --> ^^^^^^^^^^^^
set subject {
[ 5] 0.0- 1.0 sec 680 KBytes 5.57 Mbits/sec
[ 5] 0.0-150.0 sec 153 MBytes 8.56 Mbits/sec
[ 4] 0.0- 1.0 sec 0.00 Bytes 0.00 bits/sec
[ 4] 0.0-150.4 sec 38.6 MBytes 2.15 Mbits/sec
}
set matches [regexp -line -inline -all -- $regexp_str $subject]
puts "\n\n"
foreach {all a b c} $matches {
puts "- All: >$all<"
puts " >$a<"
puts " >$b<"
puts " >$c<"
}
Output
- All: > [ 5] 0.0- 1.0 sec 680 KBytes 5.57 Mbits/sec<
>1.0<
>5.57<
>M<
- All: > [ 5] 0.0-150.0 sec 153 MBytes 8.56 Mbits/sec<
>150.0<
>8.56<
>M<
- All: > [ 4] 0.0- 1.0 sec 0.00 Bytes 0.00 bits/sec<
>1.0<
>0.00<
><
- All: > [ 4] 0.0-150.4 sec 38.6 MBytes 2.15 Mbits/sec<
>150.4<
>2.15<
>M<
Update
When dealing with complex regular expression, I often break up the expression into several lines and add comments. The following is equivalent to my previous code, but more verbose and easier to troubleshoot. The key is to use and additional flag to the regexp command: the -expanded flag, which tells regexp to ignore any white spaces and comments in the expression.
set regexp_str {
# Initial blank
^[[:blank:]]*
# Bracket, number, optional spaces, bracket
\[[[:blank:]]*[0-9]+\]
# Spaces
[[:blank:]]+
# Number, dash, number
0\.0\-[[:blank:]]*([0-9]+\.[0-9]+)
# Unwanted stuff
[[:blank:]]sec.+
# Final number, plus unit
([0-9]+\.[0-9]+)[[:blank:]]([MK]?)bits/sec
# Trailing spaces
[[:blank:]]*$
}
set subject {
[ 5] 0.0- 1.0 sec 680 KBytes 5.57 Mbits/sec
[ 5] 0.0-150.0 sec 153 MBytes 8.56 Mbits/sec
[ 4] 0.0- 1.0 sec 0.00 Bytes 0.00 bits/sec
[ 4] 0.0-150.4 sec 38.6 MBytes 2.15 Mbits/sec
}
set matches [regexp -expanded -line -inline -all -- $regexp_str $subject]
puts "\n\n"
foreach {all a b c} $matches {
puts "- All: >$all<"
puts " >$a<"
puts " >$b<"
puts " >$c<"
}

(ETA: the question is about regular expressions, so why am I talking about massaging a string into a list and picking items out of that? See the end of this answer.)
As a workaround, if you don't really need to use a regular expression, this code gives the exact same result:
set result [list]
foreach line [split [string trim $subject] \n] {
set list [string map {- { } / { }} $line]
lappend result \
$line \
[lindex $list 3] \
[lindex $list 7] \
[string map {Mbits M Kbits K bits {}} [lindex $list 8]]
}
The lines aren't strictly well-formed lists because of the brackets, but it does work.
To clarify:
the string trim command takes out the newlines before and after the data: they would otherwise yield extra empty elements
the split command creates a list of four elements, each corresponding to a line of data
the foreach command processes each of those elements
the string map command changes each - or / character into a space, essentially making it a (part of a) list item separator
the lappend incrementally builds the result list out of four items per line of data: the items are the whole line, the fourth item in the corresponding list, the eight item in the corresponding list, and the ninth item in the corresponding list after the string map command has shortened the strings Mbits, Kbits, and bits to M, K, and the empty string, respectively.
The thing is (moderate rant warning): regular expression matching isn't the only tool in the string analysis toolbox, even though it sometimes looks that way. Tcl itself is, among other things, a powerful string and list manipulation language, and usually far more readable than RE. There is also, for instance, scan: the scan expression "[ %*d] %*f- %f sec %*f %*s %f %s" captures the relevant fields out of the data strings (provided they are split into lines and processed separately) -- all that remains is to look at the last captured string to see if it begins with M, K, or something else (which would be b). This code gives the same result as my solution above and as your example:
set result [list]
foreach line [split [string trim $subject] \n] {
scan $line "\[ %*d\] %*f- %f sec %*f %*s %f %s" a b c
lappend result $line $a $b [string map {its/sec {} Mb M Kb K b {}} $c]
}
Regular expressions are very useful, but they are also hard to get right and to debug when they aren't quite right, and even when you've got them right they're still hard to read and, in the long run, to maintain. Since in very many cases they are actually overkill, it makes sense to at least consider if other tools can't do the job instead.

Normalization on utf8 filenames stored in JSON with perl

I have two Json files which come from different OSes.
Both files are encoded in UTF-8 and contain UTF-8 encoded filenames.
One file comes from OS X and the filename is in NFD form: (od -bc)
0000160 166 145 164 154 141 314 201 057 110 157 165 163 145 040 155 145
v e t l a ́ ** / H o u s e m e
the second contains the same filename but in NFC form:
000760 166 145 164 154 303 241 057 110 157 165 163 145 040 155 145 163
v e t l á ** / H o u s e m e s
As I have learned, this is called 'different normalization', and there is an CPAN module Unicode::Normalize for handling it.
I'm reading both files with the next:
my $json1 = decode_json read_file($file1, {binmode => ':raw'}) or die "..." ;
my $json2 = decode_json read_file($file2, {binmode => ':raw'}) or die "..." ;
The read_file is from File::Slurp and decode_json from the JSON::XS.
Reading the JSON into perl structure, from one json file the filename comes into key position and from the second file comes into the values. I need to search when the hash key from the 1st hash is equvalent to a value from the second hash, so need ensure than they are "binary" identical.
Tried the next:
grep 'House' file1.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
and
grep 'House' file2.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
produces for me the same output.
Now the questions:
How to simply read both json files to get the same normalization into the both $hashrefs?
or need after the decode_json run someting like on both hashes?
while(my($k,$v) = each(%$json1)) {
$copy->{ NFD($k) } = NFD($v);
}
In short:
How to read different JSON files to get the same normalization 'inside' the perl $href? It is possible to achieve somewhat nicer as explicitly doing NFD on each key value and creating another NFD normalized (big) copy of the hashes?
Some hints, suggestions - pleae...
Because my english is very bad, here is a simulation of the problem
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use Data::Dumper;
use JSON::XS;
#Creating two files what contains different "normalizations"
my($nfc, $nfd);;
$nfc->{ NFC('key') } = NFC('vál');
$nfd->{ NFD('vál') } = 'something';
#save as NFC - this comes from "FreeBSD"
my $jnfc = JSON::XS->new->encode($nfc);
open my $fd, ">:utf8", "nfc.json" or die("nfc");
print $fd $jnfc;
close $fd;
#save as NFD - this comes from "OS X"
my $jnfd = JSON::XS->new->encode($nfd);
open $fd, ">:utf8", "nfd.json" or die("nfd");
print $fd $jnfd;
close $fd;
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
say $jd->{ $jc->{key} } // "NO FOUND"; #wanted to print "something"
my $jc2;
#is here a better way to DO THIS?
while(my($k,$v) = each(%$jc)) {
$jc2->{ NFD($k) } = NFD($v);
}
say $jd->{ $jc2->{key} } // "NO FOUND"; #OK

While searching the right solution for your question i discovered: the software is c*rp :) See: https://stackoverflow.com/a/17448888/632407 .
Anyway, found the solution for your particular question - how to read json with filenames regardless of normalization:
instead of your:
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
use the next:
#now read them
my $jc = get_json_from_utf8_file('nfc.json') ;
my $jd = get_json_from_utf8_file('nfd.json') ;
...
sub get_json_from_utf8_file {
my $file = shift;
return
decode_json #let parse the json to perl
encode 'utf8', #the decode_json want utf8 encoded binary string, encode it
NFC #conv. to precomposed normalization - regardless of the source
read_file #your file contains utf8 encoded text, so read it correctly
$file, { binmode => ':utf8' } ;
}
This should (at least i hope) ensure than regardles what decomposition uses the JSON content, the NFC will convert it to precomposed version and the JSON:XS will read parse it correctly to the same internal perl structure.
So your example prints:
something
without traversing the $json
The idea comes from Joseph Myers and Nemo ;)
Maybe some more skilled programmers will give more hints.

Even though it may be important right now only to convert a few file names to the same normalization for comparison, other unexpected problems could arise from almost anywhere if JSON data has a different normalization.
So my suggestion is to normalize the entire input from both sources as your first step before doing any parsing (i.e., at the same time you read the file and before decode_json). This should not corrupt any of your JSON structures since those are delimited using ASCII characters. Then your existing perl code should be able to blindly assume all UTF8 characters have the same normalization.
$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";
my $json1 = decode_json NFD($rawdata1);
my $json2 = decode_json NFD($rawdata2);
To make this process slightly faster (it should be plenty fast already, since the module uses fast XS procedures), you can find out whether one of the two data files is already in a certain normalization form, and then leave that file unchanged, and convert the other file into that form.
For example:
$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";
if (checkNFD($rawdata1)) {
# then you know $file1 is already in Normalization Form D
# (i.e., it was formed by canonical decomposition).
# so you only need to convert $file2 into NFD
$rawdata2 = NFD($rawdata2);
}
my $json1 = decode_json $rawdata1;
my $json2 = decode_json $rawdata2;
Of course, you would naturally have to experiment now in the development time to see if one or other of the input files is already in a normalized form, and then in your final version of the code, you would no longer need a conditional statement, but simply convert the other input file into the same normalized form.
Also note that it is suggested to produce output in NFC form (if your program produces any output that would be stored and used later). See here, for example: http://www.perl.com/pub/2012/05/perlunicookbook-unicode-normalization.html

Hm. I can't advice you some better "programming" solution. But why simply doesn't run
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < freebsd.json >bsdok.json
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < osx.json >osxok.json
and now your script can read and use both because they are both in the same normalisation? So instead searching for som programming solution inside of your script, solve the problem before entering to the script. (The second command is unnecessary - simple convert on the file level. Sure is more easy as traversing data structures...

Instead of traversing the data structure manually, let a module handle this for you.
Data::Visitor
Data::Rmap
Data::Dmap

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How do I convert stored misencoded data? - mysql

Related

Convert emoji Unicode byte sequences to Unicode characters with jq

Split Lines seperated by spaces to arrays

Why is JSON::XS Not Generating Valid UTF-8?

Same regexp won't match on another machine

Normalization on utf8 filenames stored in JSON with perl

Categories

Resources