What encoding Facebook uses in JSON files from data export?

What encoding Facebook uses in JSON files from data export? - json

I've used the Facebook feature to download all my data. The resulting zip file contains meta information in JSON files. The problem is that unicode characters in strings in these JSON files are escaped in a weird way.
Here's an example of such a string:
"nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n"
When I try parse the string for example with javascript's JSON.parse() and print it out I get:
"nejniÅ¾Å¡Ã bod: 0 mnm BenÃ¡tky\n"
While it should be
"nejnižší bod: 0 mnm Benátky\n"
I can see that \u00c5\u00be should somehow correspond to ž but I can't figure out the general pattern.
I've been able to figure out these characters so far:
'\u00c2\u00b0' : '°',
'\u00c3\u0081' : 'Á',
'\u00c3\u00a1' : 'á',
'\u00c3\u0089' : 'É',
'\u00c3\u00a9' : 'é',
'\u00c3\u00ad' : 'í',
'\u00c3\u00ba' : 'ú',
'\u00c3\u00bd' : 'ý',
'\u00c4\u008c' : 'Č',
'\u00c4\u008d' : 'č',
'\u00c4\u008f' : 'ď',
'\u00c4\u009b' : 'ě',
'\u00c5\u0098' : 'Ř',
'\u00c5\u0099' : 'ř',
'\u00c5\u00a0' : 'Š',
'\u00c5\u00a1' : 'š',
'\u00c5\u00af' : 'ů',
'\u00c5\u00be' : 'ž',
So what is this weird encoding? Is there any known tool that can correctly decode it?

The encoding is valid UTF-8. The problem is, JavaScript doesn't use UTF-8, it uses UTF-16. So you have to convert from the valid UTF-8, to JavaScript UTF-16:
function decode(s) {
let d = new TextDecoder;
let a = s.split('').map(r => r.charCodeAt());
return d.decode(new Uint8Array(a));
}
let s = "nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n";
s = decode(s);
console.log(s);
https://developer.mozilla.org/docs/Web/API/TextDecoder

You can use a regular expression to find groups of almost unicode characters, decode them into Latin-1 and then encode back into UTF-8
The following code should work in python3.x:
import re
re.sub(r'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'), s)

The JSON file itself is UTF-8, but the strings are UTF-16 characters converted to byte sequences then converted to UTF-8 using escape sequences.
This command fixes a file like this in Emacs:
(defun k/format-facebook-backup ()
"Normalize a Facebook backup JSON file."
(interactive)
(save-excursion
(goto-char (point-min))
(let ((inhibit-read-only t)
(size (point-max))
bounds str)
(while (search-forward "\"\\u" nil t)
(message "%.f%%" (* 100 (/ (point) size 1.0)))
(setq bounds (bounds-of-thing-at-point 'string))
(when bounds
(setq str (--> (json-parse-string (buffer-substring (car bounds)
(cdr bounds)))
(string-to-list it)
(apply #'unibyte-string it)
(decode-coding-string it 'utf-8)))
(setf (buffer-substring (car bounds) (cdr bounds))
(json-serialize str))))))
(save-buffer))

Thanks to Jen's excellent question and Shawn's comment.
Basically facebook seems to take each individual byte of the unicode string representation, then exporting to JSON as if these bytes are individual Unicode code points.
What we need to do is take last two characters of each sextet (e.g. c3 from \u00c3), concatenate them together and read as a Unicode string.
This is how I do it in Ruby (see gist):
require 'json'
require 'uri'
bytes_re = /((?:\\\\)+|[^\\])(?:\\u[0-9a-f]{4})+/
txt = File.read('export.json').gsub(bytes_re) do |bad_unicode|
$1 + eval(%Q{"#{bad_unicode[$1.size..-1].gsub('\u00', '\x')}"}).to_json[1...-1]
end
good_data = JSON.load(txt)
With bytes_re we catch all sequences of bad Unicode characters.
Then for each sequence replace '\u00' with '\x' (e.g. \xc3), put quotes around it " and use Ruby's built-in string parsing so that the \xc3\xbe... strings are converted to actual bytes, that will later remain as Unicode characters in the JSON or properly quoted by the #to_json method.
The [1...-1] is to remove quotes inserted by #to_json
I wanted to explain the code because question is not ruby specific and reader may use another language.
I guess somebody can do it with a sufficiently ugly sed command..

Just adding the general rule how to get from something like '\u00c5\u0098' to 'Ř'. Putting together the last two letters from the \u parts gets you c5 and 98 which are the two bytes of the utf-8 representation. UTF-8 encodes the code point in two bytes like this: 110xxxxx 10xxxxxx, where x are the actual bits of the character code. You can take the two bytes, use & to get the x parts, put them one after the next and read that as a number and you get the 0x158, which is the code for 'Ř'.
My javascript implementation:
function fixEncoding(s) {
var reg = /\\u00([a-f0-9]{2})\\u00([a-f0-9]{2})/gi;
return s.replace(reg, function(a, m1, m2){
b1 = parseInt(m1,16);
b2 = parseInt(m2,16);
var maskedb1 = b1 & 0x1F;
var maskedb2 = b2 & 0x3F;
var result = (maskedb1 << 6) | maskedb2;
return String.fromCharCode(result);
})
}

Related

reading .csv file with decimals separated by a comma with CSV.jl

I am trying to read some data into julia into a data frame to work with it. A minimal example of the .csv file could look like this:
A; B; C; D
ab; 1,23; 4; 9,2
ab; 3,4; 7; 1,1
ba; 6; 2,3; 8,6
I load the following to packages and read the data:
using DataFrames
using CSV
d = CSV.read( "test.csv", delim=";")
Julia recognizes the following types:
eltypes(d)
CategoricalArrays.CategoricalString{UInt32}
String
String
String
How could I now turn whole columns to floats with the comma replaced by a dot? My first idea was to use:
float(d[1,2])
But I did not find an option to tell julia to replace the comma with a dot.
My next idea was to first replace the comma and then convert it:
float(replace(d[1,2], ",", "."))
That works fine on a single cell but not on a whole column:
float(replace(d[:,2], ",", "."))
MethodError: no method matching
replace(::WeakRefStrings.WeakRefStringArray{WeakRefString{UInt8},1,Union{}},
::String, ::String)
I also tried:
d = CSV.read( "test.csv", delim=";", decimal=",")
which also just gives an error ...
Any ideas how to handle this problem and how to efficiently read the data into julia?
Thanks a lot!
Best regards.

One straightforward way is to read the file to string, replace the comma decimal separators by dots and then create the DataFrame from it:
s = replace(readstring("test.csv"), ",", ".")
CSV.read(IOBuffer(s); delim=';', types=[String, Float64, Float64, Float64])
Note that you can use the types keyword to specifiy the column types (it will then implicitly parse the string entries).
EDIT: According to this github issue the CSV.jl's read method supports a decimal keyword (from version v0.2.0 on) which allows you to do
CSV.read("test.csv"; delim=';', decimal=',', types=[String, Float64, Float64, Float64])
EDIT: Removed hint to alternatively use readtable from DataFrames.jl because it seems to be deprecated in favor of CSV.read.

How to parse json escaped string with Python 3.0

I have a bunch of json escaped strings, for example
str = "what a war\/what a peace"
(the escape doesn't limit to slash "/")
I want to parse it to
"what a war/what a peace"
How can I do it with python 3.0 ?

Is there anything stopping your from using the built-in replace on string instances?
s = "what a war\/what a peace"
r = s.replace('\\', '')
print(r)
'what a war/what a peace'
As an addendum, I know this in an example of what you want to do but, refrain from using names like str, list et-cetera. You'll regret it later.

Can languages with char counts be described by context free grammars?

I am looking at a the German HBCI/FinTS protocol. One peculiarity of this protocol is that it can contain binary blobs, which are prefixed by #NUM_OF_BINARY_CHARS#. Otherwise the protocol is quite simple, a grammar could be described as follows (a bit simplified, terminals are quoted by "):
message = segment+
segment = elements "'"
elements = element "+" elements | element
element = items
items = item ":" items | item
item = [a-zA-Z0-9,._-]* | escaped item
escaped = ?[-#?_-a-zA-Z0-9,.]
The # is missing here!
A sample message could look something like this
FirstSegment+Elem1+Item1:Item2+#4#:'+#+The_last_four_chars_are_binary+Elem4'SecondSegment+Elem5'
Can this language (with the escaping of binary strings) be described by a context free grammar?

No, this language is not context-free. The format you're describing is essentially equivalent to this language
{ #n#w | n is a natural number and |w| = n }
You can show that this isn't context-free by using the context-free pumping lemma. Let the pumping length be p and consider the string #1p#x1111...1 (p times). This is a string encoding of a binary piece of data that show have length 111...1 (p times). Now split the string into u, v, x, y, z where |vy| > 1 and |vxy| ≤ p. If v or y is the # sign, then uv0xy0z isn't in the language because it doesn't have enough # signs. If v and y are purely contained in 1p, then pumping up the string will end up producing a string not in the language because the binary data string won't have the right size. Similarly, if v and y are purely contained in x111...1 (p times), pumping up or down will make the payload the wrong size. Finally, if v is in the length field and x is in the payload, pumping up v and x simultaneously will make the payload have the wrong length because v is written in decimal (so each extra character increases the payload size by a factor of ten) while x's length isn't.
Hope this helps!

Ruby override .index() in String to search for a character or its HTML equivalent

So... I've been working with WYSIWYG editors, and have realized, that they occasionally replace certain characters with the hex codes for that character, like the ' or the & for example.
How do I override String's index method such that it includes these hex codes?
Like, when do somestring.index("\'hello there") how do I get it to search \' and '
note: single quote is escaped for clarity against double quotes.
what is the most efficient way to do this kind of string search?
is there something like this already built in.
Also, since I'm using external tools, I don't really have a say in the format things are in.

THE SOLUTION:
search_reg_exp = Regexp.escape(str).gsub(/(both|options|or|more)/, "(both|options|or|more)")
long_str.index(search_reg_exp)
ORIGINAL ANSWER:
String#index doesn't just work for single characters, it can be used for a substring of any length, and you can give it a regular expression which would probably be best in this case:
some_string = "Russell's teapot"
another_string = "Russell's teapot"
apostrophe_expr = /'|'/
some_string.index apostrophe_expr
# => 7
another_string.index apostrophe_expr
# => 7
Another option would be to just decode the HTML entities before you start manipulating the string. There are various gems for this including html_helpers:
require 'html_helpers'
another_string = "Russell's teapot"
yet_another_string = HTML::EntityCoder.decode_entities another_string
# => "Russell's teapot"
yet_another_string.index "'"
# => 7
yet_another_string.index ?' # bonus syntax tip--Ruby 1.9.1+
# => 7

How to convert data to CSV or HTML format on iOS?

In my application iOS I need to export some data into CSV or HTML format. How can I do this?

RegexKitLite comes with an example of how to read a csv file into an NSArray of NSArrays, and to go in the reverse direction is pretty trivial.
It'd be something like this (warning: code typed in browser):
NSArray * data = ...; //An NSArray of NSArrays of NSStrings
NSMutableString * csv = [NSMutableString string];
for (NSArray * line in data) {
NSMutableArray * formattedLine = [NSMutableArray array];
for (NSString * field in line) {
BOOL shouldQuote = NO;
NSRange r = [field rangeOfString:#","];
//fields that contain a , must be quoted
if (r.location != NSNotFound) {
shouldQuote = YES;
}
r = [field rangeOfString:#"\""];
//fields that contain a " must have them escaped to "" and be quoted
if (r.location != NSNotFound) {
field = [field stringByReplacingOccurrencesOfString:#"\"" withString:#"\"\""];
shouldQuote = YES;
}
if (shouldQuote == YES) {
[formattedLine addObject:[NSString stringWithFormat:#"\"%#\"", field]];
} else {
[formattedLine addObject:field];
}
}
NSString * combinedLine = [formattedLine componentsJoinedByString:#","];
[csv appendFormat:#"%#\n", combinedLine];
}
[csv writeToFile:#"/path/to/file.csv" atomically:NO];

The general solution is to use stringWithFormat: to format each row. Presumably, you're writing this to a file or socket, in which case you would write a data representation of each string (see dataUsingEncoding:) to the file handle as you create it.
If you're formatting a lot of rows, you may want to use initWithFormat: and explicit release messages, in order to avoid running out of memory by piling up too many string objects in the autorelease pool.
And always, always, always remember to escape the values correctly before passing them to the formatting method.
Escaping (along with unescaping) is a really good thing to write unit tests for. Write a function to CSV-format a single row, and have test cases that compare its result to correct output. If you have a CSV parser on hand, or you're going to need one, or you just want to be really sure your escaping is correct, write unit tests for the parsing and unescaping as well as the escaping and formatting.
If you can start with a single record containing any combination of CSV-special and/or SQL-special characters, format it, parse the formatted string, and end up with a record equal to the one you started with, you know your code is good.
(All of the above applies equally to CSV and to HTML. If possible, you might consider using XHTML, so that you can use XML validation tools and parsers, including NSXMLParser.)

CSV - comma separated values.
I usually just iterate over the data structures in my application and output one set of values per line, values within set separated with comma.
struct person
{
string first_name;
string second_name;
};
person tony = {"tony", "momo"};
person john = {"john", "smith"};
would look like
tony, momo
john, smith

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

What encoding Facebook uses in JSON files from data export? - json

You can use a regular expression to find groups of almost unicode characters, decode them into Latin-1 and then encode back into UTF-8 The following code should work in python3.x: import re re.sub(r'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'), s)

Related

reading .csv file with decimals separated by a comma with CSV.jl

How to parse json escaped string with Python 3.0

Can languages with char counts be described by context free grammars?

Ruby override .index() in String to search for a character or its HTML equivalent

How to convert data to CSV or HTML format on iOS?

Categories

Resources