Pandas DataFrame to MySQL export with Unicode characters - mysql

I have been trying to export a large pandas dataframe using DataFrame.to_sql to a MySQL database, but the dataframe has unicode characters in some columns, some of which cause warnings during export and are converted to ?.
I managed to reproduce the issue with this example (database login removed):
import pandas as pd
import sqlalchemy
import pymysql
engine = sqlalchemy.create_engine('mysql+pymysql://{}:{}#{}/{}?charset=utf8'.format(*login_info), encoding='utf-8')
df_test = pd.DataFrame([[u'\u010daj',2], \
['čaj',2], \
['špenát',4], \
['květák',7], \
['kuře',1]], \
columns = ['a','b'])
df_test.to_sql('test', engine, if_exists = 'replace', index = False, dtype={'a': sqlalchemy.types.UnicodeText()})
The first two rows of the dataframe should be the same, just defined differently.
I get the following warning, and the problematic characters (č, ě, ř) are rendered as ?:
/usr/local/lib/python3.6/site-packages/pymysql/cursors.py:166: Warning: (1366, "Incorrect string value: '\\xC4\\x8Daj' for column 'a' at row 1")
result = self._query(query)
/usr/local/lib/python3.6/site-packages/pymysql/cursors.py:166: Warning: (1366, "Incorrect string value: '\\xC4\\x8Daj' for column 'a' at row 2")
result = self._query(query)
/usr/local/lib/python3.6/site-packages/pymysql/cursors.py:166: Warning: (1366, "Incorrect string value: '\\xC4\\x9Bt\\xC3\\xA1k' for column 'a' at row 4")
result = self._query(query)
/usr/local/lib/python3.6/site-packages/pymysql/cursors.py:166: Warning: (1366, "Incorrect string value: '\\xC5\\x99e' for column 'a' at row 5")
result = self._query(query)
with the resulting database table test looking like this:
a b
?aj 2
?aj 2
špenát 4
kv?ták 7
ku?e 1
Curiously, the ž, š and á characters (and others in my full dataset) are processed correctly, so it seems to only affect a subset of unicode characters. As you can see above, I also tried setting utf-8 wherever I could (engine, DataFrame.to_sql) with no effect.

pymysql:
import pymysql
con = pymysql.connect(host='127.0.0.1', port=3306,
user='root', passwd='******',
charset="utf8mb4")
sqlalchemy:
db_url = sqlalchemy.engine.url.URL(drivername='mysql', host=foo.db_host,
database=db_schema,
query={ 'read_default_file' : foo.db_config, 'charset': 'utf8mb4' })
See "Best practice" in http://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored Explanation of ?:
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is CHARACTER SET utf8 (or utf8mb4). Fix this.
Also, check that the connection during reading is UTF-8.
(Note: The CHARACTER SETs utf8 and utf8mb4 are interchangeable for European languages.)
These are Czech characters?

I haved met the same problem,use pymysql drive as well.
I change mysql drive to mysql-connector,1366 Warning disappear
install mysql-connector drive
pip install mysql-connector
sqlalchemy engine setting like this
create_engine('mysql+mysqlconnector://root:tj1996#localhost:3306/new?charset=utf8mb4')

Related

Perl issue when encoding mysql data from UTF-8 to UCS-2 for SMPP

I am trying to fetch UTF-8 accentuated characters "é" "ê" from mysql and convert them to UCS-2 when sending over SMPP. The data is stored as utf8_general_ci and I perform the following when opening the DB connection:
$dbh->{'mysql_enable_utf8'}=1;
$dbh->do("set NAMES 'utf8'");
If I test the sending part by hard coding the string value with "é" "ê" using data_encoding=8, it goes through perfectly. However if I comment out the first line and just use what comes from the DB, it fails. Also, if I try to send the characters using the DB and setting data_encoding=3, it also works fine, but then the "ê" would not appear, which is also expected. Here is what I use:
$fred = 'éêcole'; <-- If I comment out this line, the SMPP call fails
$fred = decode('utf-8', $fred);
$fred = encode('UCS-2', $fred);
$resp_pdu = $short_smpp->submit_sm(
source_addr_ton => 0x00,
source_addr_npi => 0x01,
source_addr => $didnb,
dest_addr_ton => 0x01,
dest_addr_npi => 0x01,
destination_addr => $number,
data_coding => 0x08,
short_message => $fred
) or do {
Log("ERROR: submit_sm indicated error: " . $resp_pdu->explain_status());
$success = 0;
};
The different values for the data_coding fields are the following:
Meaning of "data_coding" field in SMPP
00000000 (0) - usually GSM7
00000011 (3) for standard ISO-8859-1
00001000 (8) for the universal character set -- de facto UTF-16
The SMPP provider's documentation also mentions that special characters should be handled via UCS-2:
https://community.sinch.com/t5/SMS-365-enterprise-service/Handling-Special-Characters/ta-p/1137
How should I prepare the data that is coming out of the DB to make this SMPP call work?
I am using Perl v5.10.1
Thanks !
$dbh->{'mysql_enable_utf8'} = 1; is used to decode the values returned from the database, causing queries to return decoded text (strings of Unicode Code Points). It makes no sense to decode such a string. Go straight to the encode.
my $s_ucp = "\xE9\xEA\x63\x6F\x6C\x65"; # éêcole
# -or-
use utf8; # Script is encoded using UTF-8.
my $s_ucp = "éêcole";
printf "%vX\n", $s_ucp; # E9.EA.63.6F.6C.65
my $s_ucs2be = encode('UCS-2', $s_ucp);
printf "%vX\n", $s_ucs2be; # 0.E9.0.EA.0.63.0.6F.0.6C.0.65
SET NAMES says the encoding you have/want in the client. That is, regardless of the encoding in the table, MySQL will convert it to whatever SET NAMES says during a SELECT.
So, feed what comes from the SELECT directly to SMPP. (It won't be readable by most other clients.)
SET NAMES ucs2
(The collation is irrelevant to the encoding.)
You could ask the SELECT to convert with something like
CONVERT(col_name, CHAR UNICODE)
https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html

Warning: (1366, "Incorrect string value: '\\xB4\\xEB\\xC7\\xD1\\xB9\\xCE...' for column 'VARIABLE_VALUE' at row 1") result = self._query(query)

This error occurs whenever I try to insert below table in mysql schema.
I used code like this
self.collect.allday_list.to_sql(name=code_name, con=self.collect.engine_allday, if_exists='append')
and self.collect.egine_allday is
self.engine_allday = create_engine("mysql+mysqldb://" + info.db_id + ":" + info.db_passwd + "#"
+ info.db_ip + ":" + info.db_port + "/allday_list", encoding = 'utf-8')
and Warning: (1366, "Incorrect string value: '\xB4\xEB\xC7\xD1\xB9\xCE...' for column 'VARIABLE_VALUE' at row 1") result = self._query(query)
it doesn't have any emoji or special letter, but error is occurred
how do i get this table into schema in mysql?
Somewhere in the data you are trying to insert there is text encoded in euc-kr:
>>> bytes.decode(b'\xB4\xEB\xC7\xD1\xB9\xCE', encoding='euc-kr')
'대한민'
These bytes do not represent valid text in utf-8. You can try changing the database connection encoding to euc-kr, or converting the data from euc-kr to utf-8 when you read it.
On the other hand, the screenshot of the data has only numbers, no text. This suggests you may be trying to load the wrong file or that perhaps there is a portion of the file that you didn't want to insert in the database.

How to load a json file with strings including double quotes (")

I've been given a load of JSON files which I'm trying to load into python 3.5
I've already had to do some clean up work, removing double backslashes and extra quotations, however I've run into an issue I don't know how to solve.
I'm running the following code:
with open(filepath,'r') as json_file:
reader = json_file.readlines()
for row in reader:
row = row.replace('\\', '')
row = row.replace('"{', '{')
row = row.replace('}"', '}')
response = json.loads(row)
for i in response:
responselist.append(i['ActionName'])
However it's throwing up the error:
JSONDecodeError: Expecting ',' delimiter: line 1 column 388833 (char 388832)
The part of the JSON that's causing the issue is the status text entry below:
"StatusId":8,
"StatusIdString":"UnknownServiceError",
"StatusText":"u003cCompany docTypeu003d"Mobile.Tile" statusIdu003d"421" statusTextu003d"Start time of 11/30/2015 12:15:00 PM is more than 5 minutes in the past relative to the current time of 12/1/2015 12:27:01 AM." copyrightu003d"Copyright Company Inc." versionNumberu003d"7.3" createdDateu003d"2015-12-01T00:27:01Z" responseIdu003d"e74710c0-dc7c-42db-b608-bf905d95d153" /u003e",
"ActionName":"GetTrafficTile"
I added the line breaks to illustrate my point, it looks like python is unhappy that the string contains double quotes.
I have a feeling this may be to do with my replacing '\ \' with '' messing with the unicode characters in the string. Is there any way to repair these nested strings? I don't mind if the StatusText field is deleted completely, all I'm after is a list of the ActionName fields.
EDIT:
I've hosted an example file here:
https://www.dropbox.com/s/1oanrneg3aqandz/2015-12-01T00%253A00%253A42.527Z_2015-12-01T00%253A01%253A17.478Z?dl=0
This is exactly as I received, before I've replaced the extra backslashes and quotations
Here is a pared down version of the sample with one bad entry
["{\"apiServerType\":0,\"RequestId\":\"52a65260-1637-4653-a496-7555a2386340\",\"StatusId\":0,\"StatusIdString\":\"Ok\",\"StatusText\":null,\"ActionName\":\"GetCameraImage\",\"Url\":\"http://mosi-prod.cloudapp.net/api/v1/GetCameraImage?AuthToken=vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew%7C&CameraId=13782\",\"Lat\":0.0,\"Lon\":0.0,\"iVendorId\":12561,\"iConsumerId\":2986897,\"iSliverId\":51846,\"UserId\":\"2986897\",\"HardwareId\":null,\"AuthToken\":\"vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew|\",\"RequestTime\":\"2015-12-01T00:00:42.5278699Z\",\"ResponseTime\":\"2015-12-01T00:01:02.5926127Z\",\"AppId\":null,\"HttpMethod\":\"GET\",\"RequestHeaders\":\"{\\\"Connection\\\":[\\\"keep-alive\\\"],\\\"Via\\\":[\\\"HTTP/1.1 nycnz01msp1ts10.wnsnet.attws.com\\\"],\\\"Accept\\\":[\\\"application/json\\\"],\\\"Accept-Encoding\\\":[\\\"gzip\\\",\\\"deflate\\\"],\\\"Accept-Language\\\":[\\\"en-us\\\"],\\\"Host\\\":[\\\"mosi-prod.cloudapp.net\\\"],\\\"User-Agent\\\":[\\\"Traffic/5.4.0\\\",\\\"CFNetwork/758.1.6\\\",\\\"Darwin/15.0.0\\\"]}\",\"RequestContentHeaders\":\"{}\",\"RequestContentBody\":\"\",\"ResponseBody\":null,\"ResponseContentHeaders\":\"{\\\"Content-Type\\\":[\\\"image/jpeg\\\"]}\",\"ResponseHeaders\":\"{}\",\"MiniProfilerJson\":null}"]
The problem is a little different than you think. Whatever program built these files used data that was already json-encoded and ended up double and even triple encoding some of the information. I peeled it apart in a shell session and got usable python data. You can (1) go dope-slap whoever wrote the program that built this steaming pile of... um... goodness? and (2) manually scan through and decode inner json strings.
I decoded the data and it was a list of strings, but those strings looked suspiciously like json
>>> data = json.load(open('test.json'))
>>> type(data)
<class 'list'>
>>> d0 = data[0]
>>> type(d0)
<class 'str'>
>>> d0[:70]
'{"apiServerType":0,"RequestId":"52a65260-1637-4653-a496-7555a2386340",'
Sure enough, I can decode it
>>> d0_1 = json.loads(d0)
>>> type(d0_1)
<class 'dict'>
>>> d0_1
{'ResponseBody': None, 'StatusText': None, 'AppId': None, 'ResponseTime': '2015-12-01T00:01:02.5926127Z', 'HardwareId': None, 'RequestTime': '2015-12-01T00:00:42.5278699Z', 'StatusId': 0, 'Lon': 0.0, 'Url': 'http://mosi-prod.cloudapp.net/api/v1/GetCameraImage?AuthToken=vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew%7C&CameraId=13782', 'RequestContentBody': '', 'RequestId': '52a65260-1637-4653-a496-7555a2386340', 'MiniProfilerJson': None, 'RequestContentHeaders': '{}', 'ActionName': 'GetCameraImage', 'StatusIdString': 'Ok', 'HttpMethod': 'GET', 'iSliverId': 51846, 'ResponseHeaders': '{}', 'ResponseContentHeaders': '{"Content-Type":["image/jpeg"]}', 'apiServerType': 0, 'AuthToken': 'vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew|', 'iConsumerId': 2986897, 'RequestHeaders': '{"Connection":["keep-alive"],"Via":["HTTP/1.1 nycnz01msp1ts10.wnsnet.attws.com"],"Accept":["application/json"],"Accept-Encoding":["gzip","deflate"],"Accept-Language":["en-us"],"Host":["mosi-prod.cloudapp.net"],"User-Agent":["Traffic/5.4.0","CFNetwork/758.1.6","Darwin/15.0.0"]}', 'iVendorId': 12561, 'Lat': 0.0, 'UserId': '2986897'}
Picking one of the entries, that looks like more json
>>> hdrs = d0_1['RequestHeaders']
>>> type(hdrs)
<class 'str'>
Yep, it decodes to what I want
>>> hdrs_0 = json.loads(hdrs)
>>> type(hdrs_0)
<class 'dict'>
>>>
>>> hdrs_0["Via"]
['HTTP/1.1 nycnz01msp1ts10.wnsnet.attws.com']
>>>
>>> type(hdrs_0["Via"])
<class 'list'>
Here you are :) :
responselist = []
with open('dataFile.json','r') as json_file:
reader = json_file.readlines()
for row in reader:
strActNm = 'ActionName":"'; lenActNm = len(strActNm)
actionAt = row.find(strActNm)
while actionAt > 0:
nxtQuotAt = row.find('"',actionAt+lenActNm+2)
responselist.append( row[actionAt-1: nxtQuotAt+1] )
actionAt = row.find('ActionName":"', nxtQuotAt)
print(responselist)
which gives:
>python3.6 -u "dataFile.py"
['"ActionName":"GetTrafficTile"']
>Exit code: 0
where dataFile.json is the file with the line you provided and dataFile.py the code provided above.
It's the hard tour, but if the files are in a bad format you have to find a way around and a simple pattern matching works in any case. For more complex cases you will need regex (regular expressions), but in this case a simple .find() is enough to do the job.
The code finds also multiple "actions" in the line (if the line would contain more than one action).
Here the result for the file you provided in your link while using following small modification of the code above:
responselist = []
with open('dataFile1.json','r') as json_file:
reader = json_file.readlines()
for row in reader:
strActNm='\\"ActionName\\":\\"'
# strActNm = 'ActionName":"'
lenActNm = len(strActNm)
actionAt = row.find(strActNm)
while actionAt > 0:
nxtQuotAt = row.find('"',actionAt+lenActNm+2)
responselist.append( row[actionAt: nxtQuotAt+1].replace('\\','') )
actionAt = row.find('ActionName":"', nxtQuotAt)
print(responselist)
gives:
>python3.6 -u "dataFile.py"
['"ActionName":"GetCameraImage"']
>Exit code: 0
where dataFile1.json is the file you provided in the link.

Encapsulate complex mysql inserts with back quotes

I have exported some sql statements, and now need to replay them. The issue arises when I play this to mysql where actual java code is included in the VALUES. This code includes back slashes, quotes, double quotes - sometimes causing mysql to mis-interpret escaping and quoting:
INSERT INTO sonar.snapshot_sources(ID, SNAPSHOT_ID, DATA) VALUES (267, 420,
'package com.company.gateway.dl.util
"((\''|\")*)(stuff|moreStuff)((\''|\")*):((\''|\")*)([0-9]+)((\''|\")*)";
');
Above is not the full text, but the quoted bit here, cause the insert to fail. A work around is to enclose the complex text value in back quotes, e.g.
INSERT INTO sonar.snapshot_sources(ID, SNAPSHOT_ID, DATA) VALUES (267, 420,
`'package com.company.gateway.dl.util
"((\''|\")*)(stuff|moreStuff)((\''|\")*):((\''|\")*)([0-9]+)((\''|\")*)";
'`);
The question I have is, how can I sed precisely these statements to add back quotes to encapsulate where required? I want to replace 'package with `'package and '); with '`); The closing bracket seems the most complicated since many more statements will match this.
The below python script can help you. The script will read the sql dump file and add backtick to the matched regex and prints it to output file. Input file and output file should be given as a commandline argument.
#!/usr/bin/env python3
import re
import os
import sys
if len(sys.argv) < 3:
print('Provide input and output file to begin processing')
sys.exit(1)
else:
if not os.path.isfile(sys.argv[1]):
print('Input file does not exit')
sys.exit(1)
ifile = sys.argv[1]
ofile = sys.argv[2]
# Compiling the regex
rex = re.compile(r"""(INSERT[\s\S]+?)('package[\s\S]+?')(\);)""")
# Reading the file as a string
dataFile = open(ifile, 'r').read()
fp = open(ofile, 'w')
# Substituting it
print(re.sub(rex, r'\1`\2`\3', dataFile), end='', file=fp)
fp.close()
Output
INSERT INTO sonar.snapshot_sources(ID, SNAPSHOT_ID, DATA) VALUES (267, 420,
`'package com.company.gateway.dl.util
"((\''|\")*)(stuff|moreStuff)((\''|\")*):((\''|\")*)([0-9]+)((\''|\")*)";
'`);

R Programming - Japanese Characters Insert into MySQL

I want to insert a tab-delimted file, which is conatining both japanese and english characters with special charcters. I am using RMySQL to do is. One of a solution i tried giving below error:
dbWriteTable(con, "japan_test2", d, append = T, row.names=FALSE);
Error in mysqlExecStatement(conn, statement, ...) : RS-DBI driver: (could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '˜¨å¤œã®ã‚³ãƒ³_ text)' at line 3)
In addition: Warning message:
In strsplit(msg, "\n") : input string 1 is invalid in this locale
[1] FALSE
Warning message:
In mysqlWriteTable(conn, name, value, ...) :
could not create table: aborting mysqlWriteTable
Current Locale: LC_COLLATE=English_United States.1252;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
Locale Tried: US, Japanese.
Encoding Tried: UTF-8,16,ASCII.
System: Windows7
RStudio Version 0.98.977
MySQL 5.4.27 CE
Probably you aren't setting properly the encoding of the connection. You can try this:
con <- dbConnect(MySQL(), user=user, password=password,dbname=dbname, host=host, port=port)
# With the next line I try to get the right encoding (it works for Spanish keyboards)
encoding <- if(grepl(pattern = 'utf8|utf-8',x = Sys.getlocale(),ignore.case = T)) 'utf8' else 'latin1'
dbGetQuery(con,paste("SET names",encoding))
dbGetQuery(con,paste0("SET SESSION character_set_server=",encoding))
dbGetQuery(con,paste0("SET SESSION character_set_database=",encoding))
dbWriteTable( con, value = dfr, name = table, append = TRUE, row.names = FALSE )
dbDisconnect(con)
Remember that you have to use your local encoding as the right encoding of the connection. I try to get my encoding in the third line of the proposed code and then set the encoding according to my local encoding. Good luck!