Keys of Dict are not enconded when I read from txt - Julia - json

I read a .txt file (it contains a Dict) but the keys of Dict are with errors. In the original file the names are right (ex: the file has "P. Cárdenas" but I got "P. C\xe1rdenas")
>> f = open("dict.txt", "r")
>> dict_maestro = JSON.parse(f)
>>Dict{String,Any} with 5 entries:
"P. C\xe1rdenas" => Dict{String,Any}("dist_tm"=>Any[Any[0.248, 0.074, 0.…
"S. L\xf3pez" => Dict{String,Any}("dist_tm"=>Any[Any[0.096, 0.082, 0.…
"S. Cabrera" => Dict{String,Any}("dist_tm"=>Any[Any[0.341, 0.094, 0.…
"C. Mu\xf1oz" => Dict{String,Any}("dist_tm"=>Any[Any[0.246, 0.073, 0.…
"R. Bugue\xf1o" => Dict{String,Any}("dist_tm"=>Any[Any[0.261, 0.068, 0.…
How can I get the right names?

If I am not mistaken you are reading the file as bytes, not as UTF strings. According to the answer to the linked duplicate question you should first convert the contents of the file to appropriately encoded strings and then parse it as JSON. This would go roughly the following way:
s = open("dict.txt", "r") do f
utf16(readbytes(f))
end
dict_maestro = JSON.parse(s)
You can use utf8 instead of utf16 if this is the encoding you have in your file.

Related

Perl issue when encoding mysql data from UTF-8 to UCS-2 for SMPP

I am trying to fetch UTF-8 accentuated characters "é" "ê" from mysql and convert them to UCS-2 when sending over SMPP. The data is stored as utf8_general_ci and I perform the following when opening the DB connection:
$dbh->{'mysql_enable_utf8'}=1;
$dbh->do("set NAMES 'utf8'");
If I test the sending part by hard coding the string value with "é" "ê" using data_encoding=8, it goes through perfectly. However if I comment out the first line and just use what comes from the DB, it fails. Also, if I try to send the characters using the DB and setting data_encoding=3, it also works fine, but then the "ê" would not appear, which is also expected. Here is what I use:
$fred = 'éêcole'; <-- If I comment out this line, the SMPP call fails
$fred = decode('utf-8', $fred);
$fred = encode('UCS-2', $fred);
$resp_pdu = $short_smpp->submit_sm(
source_addr_ton => 0x00,
source_addr_npi => 0x01,
source_addr => $didnb,
dest_addr_ton => 0x01,
dest_addr_npi => 0x01,
destination_addr => $number,
data_coding => 0x08,
short_message => $fred
) or do {
Log("ERROR: submit_sm indicated error: " . $resp_pdu->explain_status());
$success = 0;
};
The different values for the data_coding fields are the following:
Meaning of "data_coding" field in SMPP
00000000 (0) - usually GSM7
00000011 (3) for standard ISO-8859-1
00001000 (8) for the universal character set -- de facto UTF-16
The SMPP provider's documentation also mentions that special characters should be handled via UCS-2:
https://community.sinch.com/t5/SMS-365-enterprise-service/Handling-Special-Characters/ta-p/1137
How should I prepare the data that is coming out of the DB to make this SMPP call work?
I am using Perl v5.10.1
Thanks !
$dbh->{'mysql_enable_utf8'} = 1; is used to decode the values returned from the database, causing queries to return decoded text (strings of Unicode Code Points). It makes no sense to decode such a string. Go straight to the encode.
my $s_ucp = "\xE9\xEA\x63\x6F\x6C\x65"; # éêcole
# -or-
use utf8; # Script is encoded using UTF-8.
my $s_ucp = "éêcole";
printf "%vX\n", $s_ucp; # E9.EA.63.6F.6C.65
my $s_ucs2be = encode('UCS-2', $s_ucp);
printf "%vX\n", $s_ucs2be; # 0.E9.0.EA.0.63.0.6F.0.6C.0.65
SET NAMES says the encoding you have/want in the client. That is, regardless of the encoding in the table, MySQL will convert it to whatever SET NAMES says during a SELECT.
So, feed what comes from the SELECT directly to SMPP. (It won't be readable by most other clients.)
SET NAMES ucs2
(The collation is irrelevant to the encoding.)
You could ask the SELECT to convert with something like
CONVERT(col_name, CHAR UNICODE)
https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html

How to add columns in csv file with Perl?

I have a csv file contains following data. Now I want to add some extra columns at the end. In these extra columns, they have different value, but each row content is same. How can we do this with Perl?
Original csv file
Item Price Number
A 11 2
B 10 3
C 20 1
...(many lines)
...
I want to add retailer and buyer information in the end
Output csv file
Item Price Number Retailer Buyer
A 11 2 Mike Tom
B 21 1 Mike Tom
C 30 4 Mike Tom
...(many lines)
...
Each column has only 1 value, but it extends to the last row of csv file.
(The column title and value can be hardcode, it's not the point)
Use Text::CSV.
use strict;
use warnings;
use Text::CSV 'csv';
my $rows = csv(in => *STDIN, encoding => 'UTF-8', auto_diag => 2,
sep_char => "\t", keep_headers => \my #headers);
push #headers, 'Retailer', 'Buyer';
foreach my $row (#$rows) {
$row->{Retailer} = 'Mike';
$row->{Buyer} = 'Tom';
}
csv(in => $rows, out => *STDOUT, encoding => 'UTF-8', auto_diag => 2,
sep_char => "\t", headers => \#headers);
Assuming your CSV file is actually single-tab-separated, and getting input on STDIN and printing to STDOUT.
Are they CSV files? I can't see any commas in your sample data. Perhaps they are tab-separated instead.
In any case, you don't need Text::CSV for this as you're just adding data to each line.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<>) {
chomp;
if ($. == 1) { # first line
say "$_\tRetailer\tBuyer";
} else {
say "$_\tMike\tTom";
}
}
(If you really have CSV files, then replace the "\t"s in my code with ",")
This reads from STDIN and writes to STDOUT so (assuming it's in a file called append_cols) you would call it like this:
./append_cols < your_old_file.csv > some_new_file.csv

Saving json file by dumping dictionary in a for loop, leading to malformed json

So I have the following dictionaries that I get by parsing a text file
keys = ["scientific name", "common names", "colors]
values = ["somename1", ["name11", "name12"], ["color11", "color12"]]
keys = ["scientific name", "common names", "colors]
values = ["somename2", ["name21", "name22"], ["color21", "color22"]]
and so on. I am dumping the key value pairs using a dictionary to a json file using a for loop where I go through each key value pair one by one
for loop starts
d = dict(zip(keys, values))
with open("file.json", 'a') as j:
json.dump(d, j)
If I open the saved json file I see the contents as
{"scientific name": "somename1", "common names": ["name11", "name12"], "colors": ["color11", "color12"]}{"scientific name": "somename2", "common names": ["name21", "name22"], "colors": ["color21", "color22"]}
Is this the right way to do it?
The purpose is to query the common name or colors for a given scientific name. So then I do
with open("file.json", "r") as j:
data = json.load(j)
I get the error, json.decoder.JSONDecodeError: Extra data:
I think this is because I am not dumping the dictionaries in json in the for loop correctly. I have to insert some square brackets programatically. Just doing json.dump(d, j) won't suffice.
JSON may only have one root element. This root element can be [], {} or most other datatypes.
In your file, however, you get multiple root elements:
{...}{...}
This isn't valid JSON, and the error Extra data refers to the second {}, where valid JSON would end instead.
You can write multiple dicts to a JSON string, but you need to wrap them in an array:
[{...},{...}]
But now off to how I would fix your code. First, I rewrote what you posted, because your code was rather pseudo-code and didn't run directly.
import json
inputs = [(["scientific name", "common names", "colors"],
["somename1", ["name11", "name12"], ["color11", "color12"]]),
(["scientific name", "common names", "colors"],
["somename2", ["name21", "name22"], ["color21", "color22"]])]
for keys, values in inputs:
d = dict(zip(keys, values))
with open("file.json", 'a') as j:
json.dump(d, j)
with open("file.json", 'r') as j:
print(json.load(j))
As you correctly realized, this code failes with
json.decoder.JSONDecodeError: Extra data: line 1 column 105 (char 104)
The way I would write it, is:
import json
inputs = [(["scientific name", "common names", "colors"],
["somename1", ["name11", "name12"], ["color11", "color12"]]),
(["scientific name", "common names", "colors"],
["somename2", ["name21", "name22"], ["color21", "color22"]])]
jsonData = list()
for keys, values in inputs:
d = dict(zip(keys, values))
jsonData.append(d)
with open("file.json", 'w') as j:
json.dump(jsonData, j)
with open("file.json", 'r') as j:
print(json.load(j))
Also, for python's json library, it is important that you write the entire json file in one go, meaning with 'w' instead of 'a'.

Read a file in R with mixed character encodings

I'm trying to read tables into R from HTML pages that are mostly encoded in UTF-8 (and declare <meta charset="utf-8">) but have some strings in some other encodings (I think Windows-1252 or ISO 8859-1). Here's an example. I want everything decoded properly into an R data frame. XML::readHTMLTable takes an encoding argument but doesn't seem to allow one to try multiple encodings.
So, in R, how can I try several encodings for each line of the input file? In Python 3, I'd do something like:
with open('file', 'rb') as o:
for line in o:
try:
line = line.decode('UTF-8')
except UnicodeDecodeError:
line = line.decode('Windows-1252')
There do seem to be R library functions for guessing character encodings, like stringi::stri_enc_detect, but when possible, it's probably better to use the simpler determinstic method of trying a fixed set of encodings in order. It looks like the best way to do this is to take advantage of the fact that when iconv fails to convert a string, it returns NA.
linewise.decode = function(path)
sapply(readLines(path), USE.NAMES = F, function(line) {
if (validUTF8(line))
return(line)
l2 = iconv(line, "Windows-1252", "UTF-8")
if (!is.na(l2))
return(l2)
l2 = iconv(line, "Shift-JIS", "UTF-8")
if (!is.na(l2))
return(l2)
stop("Encoding not detected")
})
If you create a test file with
$ python3 -c 'with open("inptest", "wb") as o: o.write(b"This line is ASCII\n" + "This line is UTF-8: I like π\n".encode("UTF-8") + "This line is Windows-1252: Müller\n".encode("Windows-1252") + "This line is Shift-JIS: ハローワールド\n".encode("Shift-JIS"))'
then linewise.decode("inptest") indeed returns
[1] "This line is ASCII"
[2] "This line is UTF-8: I like π"
[3] "This line is Windows-1252: Müller"
[4] "This line is Shift-JIS: ハローワールド"
To use linewise.decode with XML::readHTMLTable, just say something like XML::readHTMLTable(linewise.decode("http://example.com")).

JSON no longer parsable after being sent through TCP in Ruby

I've created a basic client and server that pass a string, which I've changed to JSON instead. But the JSON string is only parsable before it gets sent through TCP. After it's sent, the string version is identical (after a chomp), but on the server side it no longer processes the JSON correctly. Here is some of my code (with other bits trimmed)
Some of the client code
require 'json'
require 'socket'
foo = {'a' => 1, 'b' => 2, 'c' => 3}
puts foo.to_s + "......."
foo.to_json
puts foo['b'] # => outputs the correct '2' answer
client = TCPSocket.open('localhost', 2000)
client.puts json
client.close;
Some of the server code
require 'socket'
require 'json'
server = TCPServer.open(2000)
while true
client = server.accept # Accept client
response = client.gets
print response
response = response.chomp
response.to_json
puts response['b'] # => outputs 'b'
end
The output 'b' should be '2' instead. How do I fix this?
Thanks
In your server you wrote response.to_json. This turns a string to JSON, then throws it away. And I don't like the .chomp, either.
Try
response = client.gets
hash = JSON.parse(response)
Now hash is a Ruby Hash object with your data in it, and hash['b'] should work correctly.
The problem is that .to_json does not parse JSON inside a string and replace itself with the result. It is used to convert the string into a format that is an acceptable JSON value.
require 'json'
string = "abc"
puts string
puts string.to_json
This will output:
abc
"abc"
The method is added to the String class by the JSON generator and it uses it internally to generate the JSON document.
But why does your response['b'] return "b"?
Because Ruby strings have a method called [] that can be used to:
Return a substring: "abc"[0,2] => "ab"
Return a single character from index: "abc"[1] => "b"
Return a substring if the string contains it: "abc"["bc"] => "bc", "abc"["fg"] => nil
Return a regexp match: "abc"[/^a([a-z])c/, 1] => "b"
and possibly some other ways I can't think of right now.
So this happens because your response is a string that has the character "b" in it:
response = "something with a b"
puts response["b"]
# outputs b
puts response["x"]
# outputs a blank line because response does not contain "x".
Instead of .to_json your code has to call JSON.parse or JSON.load:
data = JSON.parse(response)
puts data['b']