Related
I've used the Facebook feature to download all my data. The resulting zip file contains meta information in JSON files. The problem is that unicode characters in strings in these JSON files are escaped in a weird way.
Here's an example of such a string:
"nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n"
When I try parse the string for example with javascript's JSON.parse() and print it out I get:
"nejnižšà bod: 0 mnm Benátky\n"
While it should be
"nejnižší bod: 0 mnm Benátky\n"
I can see that \u00c5\u00be should somehow correspond to ž but I can't figure out the general pattern.
I've been able to figure out these characters so far:
'\u00c2\u00b0' : '°',
'\u00c3\u0081' : 'Á',
'\u00c3\u00a1' : 'á',
'\u00c3\u0089' : 'É',
'\u00c3\u00a9' : 'é',
'\u00c3\u00ad' : 'í',
'\u00c3\u00ba' : 'ú',
'\u00c3\u00bd' : 'ý',
'\u00c4\u008c' : 'Č',
'\u00c4\u008d' : 'č',
'\u00c4\u008f' : 'ď',
'\u00c4\u009b' : 'ě',
'\u00c5\u0098' : 'Ř',
'\u00c5\u0099' : 'ř',
'\u00c5\u00a0' : 'Š',
'\u00c5\u00a1' : 'š',
'\u00c5\u00af' : 'ů',
'\u00c5\u00be' : 'ž',
So what is this weird encoding? Is there any known tool that can correctly decode it?
The encoding is valid UTF-8. The problem is, JavaScript doesn't use UTF-8, it uses UTF-16. So you have to convert from the valid UTF-8, to JavaScript UTF-16:
function decode(s) {
let d = new TextDecoder;
let a = s.split('').map(r => r.charCodeAt());
return d.decode(new Uint8Array(a));
}
let s = "nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n";
s = decode(s);
console.log(s);
https://developer.mozilla.org/docs/Web/API/TextDecoder
You can use a regular expression to find groups of almost unicode characters, decode them into Latin-1 and then encode back into UTF-8
The following code should work in python3.x:
import re
re.sub(r'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'), s)
The JSON file itself is UTF-8, but the strings are UTF-16 characters converted to byte sequences then converted to UTF-8 using escape sequences.
This command fixes a file like this in Emacs:
(defun k/format-facebook-backup ()
"Normalize a Facebook backup JSON file."
(interactive)
(save-excursion
(goto-char (point-min))
(let ((inhibit-read-only t)
(size (point-max))
bounds str)
(while (search-forward "\"\\u" nil t)
(message "%.f%%" (* 100 (/ (point) size 1.0)))
(setq bounds (bounds-of-thing-at-point 'string))
(when bounds
(setq str (--> (json-parse-string (buffer-substring (car bounds)
(cdr bounds)))
(string-to-list it)
(apply #'unibyte-string it)
(decode-coding-string it 'utf-8)))
(setf (buffer-substring (car bounds) (cdr bounds))
(json-serialize str))))))
(save-buffer))
Thanks to Jen's excellent question and Shawn's comment.
Basically facebook seems to take each individual byte of the unicode string representation, then exporting to JSON as if these bytes are individual Unicode code points.
What we need to do is take last two characters of each sextet (e.g. c3 from \u00c3), concatenate them together and read as a Unicode string.
This is how I do it in Ruby (see gist):
require 'json'
require 'uri'
bytes_re = /((?:\\\\)+|[^\\])(?:\\u[0-9a-f]{4})+/
txt = File.read('export.json').gsub(bytes_re) do |bad_unicode|
$1 + eval(%Q{"#{bad_unicode[$1.size..-1].gsub('\u00', '\x')}"}).to_json[1...-1]
end
good_data = JSON.load(txt)
With bytes_re we catch all sequences of bad Unicode characters.
Then for each sequence replace '\u00' with '\x' (e.g. \xc3), put quotes around it " and use Ruby's built-in string parsing so that the \xc3\xbe... strings are converted to actual bytes, that will later remain as Unicode characters in the JSON or properly quoted by the #to_json method.
The [1...-1] is to remove quotes inserted by #to_json
I wanted to explain the code because question is not ruby specific and reader may use another language.
I guess somebody can do it with a sufficiently ugly sed command..
Just adding the general rule how to get from something like '\u00c5\u0098' to 'Ř'. Putting together the last two letters from the \u parts gets you c5 and 98 which are the two bytes of the utf-8 representation. UTF-8 encodes the code point in two bytes like this: 110xxxxx 10xxxxxx, where x are the actual bits of the character code. You can take the two bytes, use & to get the x parts, put them one after the next and read that as a number and you get the 0x158, which is the code for 'Ř'.
My javascript implementation:
function fixEncoding(s) {
var reg = /\\u00([a-f0-9]{2})\\u00([a-f0-9]{2})/gi;
return s.replace(reg, function(a, m1, m2){
b1 = parseInt(m1,16);
b2 = parseInt(m2,16);
var maskedb1 = b1 & 0x1F;
var maskedb2 = b2 & 0x3F;
var result = (maskedb1 << 6) | maskedb2;
return String.fromCharCode(result);
})
}
I have a table I need to handle various characters. The characters include Ø, ® etc.
I have set my table to utf-8 as the default collation, all columns use table default, however when I try to insert these characters I get error: Incorrect string value: '\xEF\xBF\xBD' for column 'buyerName' at row 1
My connection string is defined as
string mySqlConn = "server="+server+";user="+username+";database="+database+";port="+port+";password="+password+";charset=utf8;";
I am at a loss as to why I am still seeing errors. Have I missed anything with either the .net connector, or with my MySQL setup?
--Edit--
My (new) C# insert statement looks like:
MySqlCommand insert = new MySqlCommand( "INSERT INTO fulfilled_Shipments_Data " +
"(amazonOrderId,merchantOrderId,shipmentId,shipmentItemId,"+
"amazonOrderItemId,merchantOrderItemId,purchaseDate,"+ ...
VALUES (#amazonOrderId,#merchantOrderId,#shipmentId,#shipmentItemId,"+
"#amazonOrderItemId,#merchantOrderItemId,#purchaseDate,"+
"paymentsDate,shipmentDate,reportingDate,buyerEmail,buyerName,"+ ...
insert.Parameters.AddWithValue("#amazonorderId",lines[0]);
insert.Parameters.AddWithValue("#merchantOrderId",lines[1]);
insert.Parameters.AddWithValue("#shipmentId",lines[2]);
insert.Parameters.AddWithValue("#shipmentItemId",lines[3]);
insert.Parameters.AddWithValue("#amazonOrderItemId",lines[4]);
insert.Parameters.AddWithValue("#merchantOrderItemId",lines[5]);
insert.Parameters.AddWithValue("#purchaseDate",lines[6]);
insert.Parameters.AddWithValue("#paymentsDate",lines[7]);
insert.ExecuteNonQuery();
Assuming that this is the correct way to use parametrized statements, it is still giving an error
"Incorrect string value: '\xEF\xBF\xBD' for column 'buyerName' at row 1"
Any other ideas?
\xEF\xBF\xBD is the UTF-8 encoding for the unicode character U+FFFD. This is a special character, also known as the "Replacement character". A quote from the wikipedia page about the special unicode characters:
The replacement character � (often a black diamond with a white question mark) is a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table. It is used to indicate problems when a system is not able to decode a stream of data to a correct symbol. It is most commonly seen when a font does not contain a character, but is also seen when the data is invalid and does not match any character:
So it looks like your data source contains corrupted data. It is also possible that you try to read the data using the wrong encoding. Where do the lines come from?
If you can't fix the data, and your input indeed contains invalid characters, you could just remove the replacement characters:
lines[n] = lines[n].Replace("\xFFFD", "");
Mattmanser is right, never write a sql query by concatenating the parameters directly in the query. An example of parametrized query is:
string lastname = "Doe";
double height = 6.1;
DateTime date = new DateTime(1978,4,18);
var connection = new MySqlConnection(connStr);
try
{
connection.Open();
var command = new MySqlCommand(
"SELECT * FROM tblPerson WHERE LastName = #Name AND Height > #Height AND BirthDate < #BirthDate", connection);
command.Parameters.AddWithValue("#Name", lastname);
command.Parameters.AddWithValue("#Height", height);
command.Parameters.AddWithValue("#Name", birthDate);
MySqlDataReader reader = command.ExecuteReader();
...
}
finally
{
connection.Close();
}
To those who have a similar problem using PHP, try the function utf8_encode($string). It just works!
I have this some problem, when my website encoding is utf-u and I tried to send in form CP-1250 string (example taken by listdir dictionaries).
I think you must send string encoded like website.
I have an application that allows users to persist strings to a database and those strings may contain emojis. The problem I have is an emoji such as 😊 will get stored in MySQL as 😊
When I retrieve this string using a PHP MySQL client and render it in a web browser, it renders fine probably because the Content-Type is set to UTF-8. When I try to read the string in node.js, I get back what I think is the ISO8859-1 encoding a literal 😊. The charset on the table is set to latin1 and that's where I'm getting ISO8859-1 from.
What's the right way to encode the string in node.js so that I can see the emoji and not the encoding set by MySQL when I console.log the string?
😊 is Mojibake for 😊. Interpreting the former as latin1, you get hex F09F988A, which is the UTF-8 hex for that Emoji.
(Note: UTF-8 outside MySQL is equivalent to utf8mb4 inside MySQL.)
In MySQL, you must have the column/table declared with CHARACTER SET utf8mb4. You must also state that the data being stored/fetched is encoded utf8mb4. Note: utf8 will not suffice.
Do a SELECT HEX(col) FROM ... to see if you get that hex for that Emoji. If that is the case and the column is currently latin1, then part of the fix is to carefully convert the column to utf8mb4. That is, you have CHARACTER SET latin1, but have UTF-8 bytes in it; this will leave bytes alone while fixing charset. Assuming the column is already VARCHAR(111) CHARACTER SET latin1 NOT NULL, then do this 2-step ALTER:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Virtually any other conversion mechanism will make a worse mess.
As for establishing the connection correctly, it goes something like this for node.js:
var connection = mysql.createConnection({ ... , charset : 'utf8mb4'});
You do not need, and should not convert encoding. Just use the right protocols. If you send the HTML page in UTF-8, the browser will send the data back to your server in UTF-8.
Then you want to store the data to your database which is in latin1, that won't work at all. You must convert your database to UTF-8 as well. That includes the database, the tables, and eventually the columns themselves. Also make sure that your database client is configured to connect in UTF-8, because the client itself has to declare its encoding.
Once you have the whole data-flux in UTF-8, everything will work flawlessly.
Server -> GET HTML -> POST -> Server -> SQL Client -> Database -> Table -> Column
It is recommended to use iconv(A simple ISO-8859-1 to UTF-8 conversion)
From this gist
var iconv = require('iconv');
function toUTF8(body) {
// convert from iso-8859-1 to utf-8
var ic = new iconv.Iconv('iso-8859-1', 'utf-8');
var buf = ic.convert(body);
return buf.toString('utf-8');
}
here if you pass anything in ISO-8859-1 , it will return it's UTF-8.
for example,
toUTF8("😊");
will return 😊
I have found a super dirty way to convert it back:
const isoToUtfTable = {
'ð': 0xf0,
'Ÿ': 0x9f,
'˜': 0x98,
'Š': 0x8a
};
function convertISO8859ToUtf8(s) {
const buf = new Uint8Array([...s].map(c => isoToUtfTable[c]));
return String.fromCharCode(...buf)
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
console.log(decode_utf8(convertISO8859ToUtf8('😊')))
Now you simply need to complete the isoToUtfTable table (it's small, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1).
Maybe try to look at node-iconv.
const iconv = new Iconv('ISO-8859-2', 'UTF-8');
const buffer = iconv.convert(something);
console.log(buffer);
console.log(buffer.toString('UTF8'));
This is a full answer from #zerkms solution
const isoToUtfTable = {
'€':0x80,
'na':0x81,
'‚':0x82,
'ƒ':0x83,
'„':0x84,
'…':0x85,
'†':0x86,
'‡':0x87,
'ˆ':0x88,
'‰':0x89,
'Š':0x8a,
'‹':0x8b,
'Œ':0x8c,
'na':0x8d,
'Ž':0x8e,
'na':0x8f,
'na':0x90,
'‘':0x91,
'’':0x92,
'“':0x93,
'”':0x94,
'•':0x95,
'–':0x96,
'—':0x97,
'˜':0x98,
'™':0x99,
'š':0x9a,
'›':0x9b,
'œ':0x9c,
'na':0x9d,
'ž':0x9e,
'Ÿ':0x9f,
'NSBP':0xa0,
'¡':0xa1,
'¢':0xa2,
'£':0xa3,
'¤':0xa4,
'¥':0xa5,
'¦':0xa6,
'§':0xa7,
'¨':0xa8,
'©':0xa9,
'ª':0xaa,
'«':0xab,
'¬':0xac,
'SHY':0xad,
'®':0xae,
'¯':0xaf,
'°':0xb0,
'±':0xb1,
'²':0xb2,
'³':0xb3,
'´':0xb4,
'µ':0xb5,
'¶':0xb6,
'·':0xb7,
'¸':0xb8,
'¹':0xb9,
'º':0xba,
'»':0xbb,
'¼':0xbc,
'½':0xbd,
'¾':0xbe,
'¿':0xbf,
'À':0xc0,
'Á':0xc1,
'Â':0xc2,
'Ã':0xc3,
'Ä':0xc4,
'Å':0xc5,
'Æ':0xc6,
'Ç':0xc7,
'È':0xc8,
'É':0xc9,
'Ê':0xca,
'Ë':0xcb,
'Ì':0xcc,
'Í':0xcd,
'Î':0xce,
'Ï':0xcf,
'Ð':0xd0,
'Ñ':0xd1,
'Ò':0xd2,
'Ó':0xd3,
'Ô':0xd4,
'Õ':0xd5,
'Ö':0xd6,
'×':0xd7,
'Ø':0xd8,
'Ù':0xd9,
'Ú':0xda,
'Û':0xdb,
'Ü':0xdc,
'Ý':0xdd,
'Þ':0xde,
'ß':0xdf,
'à':0xe0,
'á':0xe1,
'â':0xe2,
'ã':0xe3,
'ä':0xe4,
'å':0xe5,
'æ':0xe6,
'ç':0xe7,
'è':0xe8,
'é':0xe9,
'ê':0xea,
'ë':0xeb,
'ì':0xec,
'í':0xed,
'î':0xee,
'ï':0xef,
'ð':0xf0,
'ñ':0xf1,
'ò':0xf2,
'ó':0xf3,
'ô':0xf4,
'õ':0xf5,
'ö':0xf6,
'÷':0xf7,
'ø':0xf8,
'ù':0xf9,
'ú':0xfa,
'û':0xfb,
'ü':0xfc,
'ý':0xfd,
'þ':0xfe,
'ÿ':0xff }
let offsetArray = [];
function convertISO8859ToUtf8Simple(s) {
offsetArray = [];
const buf = new Uint8Array([...s].map((c, index) =>
{
if(isoToUtfTable[c]) {
if(offsetArray.length > 0 && offsetArray[offsetArray.length -1]+3 < index) {
offsetArray.push(index);
}
if(offsetArray.length == 0) {
offsetArray.push(index);
}
}
return isoToUtfTable[c];
}
));
return String.fromCharCode(...buf);
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
function emojiStringToArray(str) {
split = str.split(/([\uD800-\uDBFF][\uDC00-\uDFFF])/);
arr = [];
for (var i=0; i<split.length; i++) {
char = split[i]
if (char !== "" && !char.includes('\u0000')) {
arr.push(char);
}
}
return arr;
};
const string = 'hello 😌😌 with some emojis 😊 🐝';
function finalString(s) {
const emojis = emojiStringToArray(decode_utf8(convertISO8859ToUtf8Simple(s)));
for(let i = 0; i<offsetArray.length; i++){
let position = 0;
if (i == 0) {
position = offsetArray[i];
} else {
position = (i * -3) + offsetArray[i] + (i);
}
s = [s.slice(0, position), emojis[i], s.slice(position+4)].join('');
}
return s;
}
console.log(finalString(string));
I am inserting arbitrary binary data into a mysql database MEDIUMBLOB using the below code. I am writing the same data to a file, from the same program. I then create a file from the DB contents:
select data from table where tag=95 order by date, time into outfile "dbout";
I then compare the output written directly to the file to the output in dbout. There are escape (0x5c, '\') characters before some bytes in the dbout file (e.g. before 0x00). This garbles the output from the database. My understanding was that by using a MEDIUMBLOB and prepared statements, I could avoid this problem. Initially I was using mysql_real_escape_string with a regular INSERT, and having the problem. Nothing seems to be fixing this.
void
insertdb(int16_t *data, size_t size, size_t nmemb)
{
int16_t *fwbuf; // I have also tried this as char *fwbuf
unsigned long i;
struct tm *info;
time_t rawtime;
char dbuf[12];
char tbuf[12];
if(fwinitialized==0){
fwbuf = malloc(CHUNK_SZ);
fwinitialized = 1;
}
if(fwindex + (nmemb*size) + 1 >= CHUNK_SZ || do_exit == 1){
MYSQL_STMT *stmt = mysql_stmt_init(con);
MYSQL_BIND param[1];
time(&rawtime);
info = localtime(&rawtime);
snprintf(dbuf, 16, "%d-%02d-%02d", 1900+info->tm_year, 1+info->tm_mon, info->tm_mday);
snprintf(tbuf, 16, "%02d:%02d:%02d", info->tm_hour, info->tm_min, info->tm_sec);
char *tmp = "INSERT INTO %s (date, time, tag, data) VALUES ('%s', '%s', %d, ?)";
int len = strlen(tmp)+strlen(db_mon_table)+strlen(dbuf)+strlen(tbuf)+MAX_TAG_LEN+1;
char *sql = (char *) malloc(len);
int sqllen = snprintf(sql, len, tmp, db_mon_table, dbuf, tbuf, tag);
if(mysql_stmt_prepare(stmt, sql, strlen(sql)) != 0){
printf("Unable to create session: mysql_stmt_prepare()\n");
exit(1);
}
memset(param, 0, sizeof(param));
param[0].buffer_type = MYSQL_TYPE_MEDIUM_BLOB;
param[0].buffer = fwbuf;
param[0].is_unsigned = 0;
param[0].is_null = 0;
param[0].length = &fwindex;
if(mysql_stmt_bind_param(stmt, param) != 0){
printf("Unable to create session: mysql_stmt_bind_param()\n");
exit(1);
}
if(mysql_stmt_execute(stmt) != 0){
printf("Unabel to execute session: mysql_stmt_execute()\n");
exit(1);
}
printf("closing\n");
mysql_stmt_close(stmt);
free(sql);
fwindex = 0;
} else {
memcpy((void *) fwbuf+fwindex, (void *) data, nmemb*size);
fwindex += (nmemb*size);
}
}
So, why the escape characters in the database? I have tried a couple of combinations of hex/unhex in the program and when creating the file from msyql. That didn't seem to help either. Isn't inserting arbitrary binary data into a database be a common thing with a well-defined solution?
P.S. - Is it ok to have prepared statements that open, insert, and close like this, or are prepared statements generally for looping and inserting a bunch of data before closing?
PPS - Maybe this is important to the problem: When I try to use UNHEX like this:
select unhex(data) from table where tag=95 order by date, time into outfile "dbout";
the output is very short (less than a dozen bytes, truncated for some reason).
As MEDIUMBLOB can contain any character (even an ASCII NUL) MySQL normally escapes the output so you can tell when fields end. You can control this using ESCAPED BY. The documentation is here. Below is an excerpt. According to the last paragraph below (which I've put in bold), you can entirely disable escaping. I have never tried that, for the reason in the last sentence.
FIELDS ESCAPED BY controls how to write special characters. If the FIELDS ESCAPED BY character is not empty, it is used when necessary to avoid ambiguity as a prefix that precedes following characters on output:
The FIELDS ESCAPED BY character
The FIELDS [OPTIONALLY] ENCLOSED BY character
The first character of the FIELDS TERMINATED BY and LINES TERMINATED BY values
ASCII NUL (the zero-valued byte; what is actually written following the escape character is ASCII "0", not a zero-valued byte)
The FIELDS TERMINATED BY, ENCLOSED BY, ESCAPED BY, or LINES TERMINATED BY characters must be escaped so that you can read the file back in reliably. ASCII NUL is escaped to make it easier to view with some pagers.
The resulting file does not have to conform to SQL syntax, so nothing else need be escaped.
If the FIELDS ESCAPED BY character is empty, no characters are escaped and NULL is output as NULL, not \N. It is probably not a good idea to specify an empty escape character, particularly if field values in your data contain any of the characters in the list just given.
A better strategy (if you only need one blob in the output file) is SELECT INTO ... DUMPFILE, documented on the same page, per the below:
If you use INTO DUMPFILE instead of INTO OUTFILE, MySQL writes only one row into the file, without any column or line termination and without performing any escape processing. This is useful if you want to store a BLOB value in a file.
Example based on the 好 Chinese character (utf8:E5A5BD, utf16:597D), MySQL 5.5.35 UTF-8 Unicode
I can get UTF-8 code point from character:
SELECT HEX('好');
=> E5A5BD
I can get UTF-16 encoded character from UTF-16 code point:
SELECT CHAR(0x597D USING utf16);
=> 好
But then how to get to the related UTF-8 code point?
And I can't figure out how to get from the UTF-8 code point back to anywhere, neither to the character nor to the UTF-16 code point.
Any suggestion?
You can use the CONVERT function to encode the string in UTF-8, and then the HEX function to get the hexadecimal representation.
SELECT hex(convert(CHAR(0x597D using utf16) using utf8));
=> E5A5BD
If you want to check emoji (U+10000 and later);
// initialize character set utf8mb4
SET NAMES 'utf8mb4';
// codepoint: U+1F42C (DOLPHIN)
// UTF-32: 0x0001F42C
// UTF-16: 0xD83D 0xDC2C
// UTF-8 : 0xF0 0x9F 0x90 0xAC
// UTF-32 -> UTF-16
// result: D83DDC2C
SELECT HEX(CONVERT(CHAR(0x1F42C using utf32) using utf16));
// UTF-16 -> UTF-8
// result: F09F90AC
SELECT HEX(CONVERT(CHAR(0xD83DDC2C USING utf16) USING utf8mb4));
// UTF-8 -> UTF-32
// result: 0001F42C
SELECT HEX(CONVERT(CHAR(0xF09F90AC USING utf8mb4) USING utf32));