Seems like I opened another chapter of the encoding from hell book. I seek help with a problem I encounter when pulling and writing data from\to a MySQL data base with R. After a good amount of time I was able to write my data back but still don't understand what exactly is going on.
library(RMySQL)
library(dbplyr)
con <- dbConnect(MySQL(),
host = "localhost",
user = "root",
dbname="test",
password = rstudioapi::askForPassword("Database password"))
address <- as_tibble(tbl(con, "address"))
The pulled address dataframe looks like
address <- structure(list(address_id = c(1809463, 2213341, 2614879, 4536353
), street = c("5, RUE DU GRAND CORMORAN APPT. C15", "14, PLACE EGLISE",
"1058 TENNESSEE", "38 ALLEE GERARD DE NERVAL"), city = c("31240 L A°NION",
"85140 L AÂIE", "ELK GROVE VILLAGE A¨LLINOIS 60007", "F-69360 SAINT-
SYPHORIEN D AÂZON"
)), .Names = c("address_id", "street", "city"), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
You can see right away that there is some encoding issues in address$city so I run
address$city <- iconv(address$city, from = "UTF-8", "windows-1252")
which seems to fix it as everything looks fine now but as soon as I want to write the file back to the MySQL I run into problems with the encoding again getting following error
dbWriteTable(con, value =address, name = "address_cleaned", overwrite=TRUE ,rownames = FALSE )
Error in .local(conn, statement, ...) :
could not run statement: Invalid utf8 character string: '31240 L A'
What I do now fixes the problem but I don't really understand what is going on.
Encoding(address$city) <- 'UTF-8'
address$city <- iconv(address$city, from = "windows-1252","UTF-8")
address$city <- iconv(address$city, from = "latin1","UTF-8")
While this code works it seems more like a work around than a real solution. I'm sure it has to do with the encoding of the MySQL data as well as Windows as my OS but I wonder if there is a more elegant solution to this.
Additional info
dbGetQuery(con, "SHOW VARIABLES LIKE 'character_set_%';")
Variable_name Value
1 character_set_client utf8
2 character_set_connection utf8
3 character_set_database utf8
4 character_set_filesystem binary
5 character_set_results utf8
6 character_set_server utf8
7 character_set_system utf8
8 character_sets_dir C:\\Program Files\\MySQL\\MySQL Server 5.7\\share\\charsets\\
and
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Edit 1. hex
1809463 31240 L A°NION 3331323430204C2041C2B04E494F4E
2213341 85140 L AIE 3835313430204C2041C2904945
2614879 ELK GROVE VILLAGE A¨LLINOIS 60007 454C4B2047524F56452056494C4C4147452041C2A84C4C494E4F4953203630303037
4536353 F-69360 SAINT-SYPHORIEN D AZON 462D3639333630205341494E542D535950484F5249454E20442041C2905A4F4E
Do not use any conversion functions, it will probably make things worse.
¨ is Mojibake for ¨ and ° for °. Since I see A before each of those, I guess you are trying to enter an accented A by first typing the A, then the accent. However, your data entry tool is failing to combine those. What editor are you using?
(Yes, you have 'opened another chapter of the encoding from hell book' -- I have seen a lot of character set problems, but not this one until now.)
Related
I am trying to fetch UTF-8 accentuated characters "é" "ê" from mysql and convert them to UCS-2 when sending over SMPP. The data is stored as utf8_general_ci and I perform the following when opening the DB connection:
$dbh->{'mysql_enable_utf8'}=1;
$dbh->do("set NAMES 'utf8'");
If I test the sending part by hard coding the string value with "é" "ê" using data_encoding=8, it goes through perfectly. However if I comment out the first line and just use what comes from the DB, it fails. Also, if I try to send the characters using the DB and setting data_encoding=3, it also works fine, but then the "ê" would not appear, which is also expected. Here is what I use:
$fred = 'éêcole'; <-- If I comment out this line, the SMPP call fails
$fred = decode('utf-8', $fred);
$fred = encode('UCS-2', $fred);
$resp_pdu = $short_smpp->submit_sm(
source_addr_ton => 0x00,
source_addr_npi => 0x01,
source_addr => $didnb,
dest_addr_ton => 0x01,
dest_addr_npi => 0x01,
destination_addr => $number,
data_coding => 0x08,
short_message => $fred
) or do {
Log("ERROR: submit_sm indicated error: " . $resp_pdu->explain_status());
$success = 0;
};
The different values for the data_coding fields are the following:
Meaning of "data_coding" field in SMPP
00000000 (0) - usually GSM7
00000011 (3) for standard ISO-8859-1
00001000 (8) for the universal character set -- de facto UTF-16
The SMPP provider's documentation also mentions that special characters should be handled via UCS-2:
https://community.sinch.com/t5/SMS-365-enterprise-service/Handling-Special-Characters/ta-p/1137
How should I prepare the data that is coming out of the DB to make this SMPP call work?
I am using Perl v5.10.1
Thanks !
$dbh->{'mysql_enable_utf8'} = 1; is used to decode the values returned from the database, causing queries to return decoded text (strings of Unicode Code Points). It makes no sense to decode such a string. Go straight to the encode.
my $s_ucp = "\xE9\xEA\x63\x6F\x6C\x65"; # éêcole
# -or-
use utf8; # Script is encoded using UTF-8.
my $s_ucp = "éêcole";
printf "%vX\n", $s_ucp; # E9.EA.63.6F.6C.65
my $s_ucs2be = encode('UCS-2', $s_ucp);
printf "%vX\n", $s_ucs2be; # 0.E9.0.EA.0.63.0.6F.0.6C.0.65
SET NAMES says the encoding you have/want in the client. That is, regardless of the encoding in the table, MySQL will convert it to whatever SET NAMES says during a SELECT.
So, feed what comes from the SELECT directly to SMPP. (It won't be readable by most other clients.)
SET NAMES ucs2
(The collation is irrelevant to the encoding.)
You could ask the SELECT to convert with something like
CONVERT(col_name, CHAR UNICODE)
https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html
My data contains special characters like German umlauts.
p=structure(list(ppl_code = c(992621L, 992381L, 992136L, 991989L,
991898L, 991759L, 991681L, 991593L, 991294L, 991036L, 990934L,
990751L, 990535L, 990411L, 990182L, 989507L), proj_name = c("klo",
"Dalbygda", "Oosterhorn", "Hån", "Yatir", "Montigny la Cour",
"Valle Hermoso", "Acciona Honawad - 120 MW", "Apfeltrang", "RiaBlades",
"General Acha", "Lindau-Böhlitz", "Apfeltrang", "Alcazar Round 2",
"Peckelsheim", "Linnich 3")), .Names = c("ppl_code", "proj_name"
), row.names = 15:30, class = "data.frame")
When I try to write it into MySQL database :
conn <- dbConnect(
drv = RMySQL::MySQL(),
dbname = "mydb",
host = "#####",
username = "#####",
password = "#####")
dbWriteTable(conn, value = p, name = "MyTable",row.names=FALSE)
I'm getting the Encoding error :
could not run statement: Invalid utf8 character string: 'Lindau-B'
I have checked several posts regarding this issue like here and here but they are all general explanation without a clear solution !
can anybody help me with a clear query that could solve this issue ?
You need to announce that UTF-8 is being used.
Tool -> Global Options -> Code -> Saving and put UTF-8
rs <- dbSendQuery(con, 'SET NAMES utf8')
I have been trying to export a large pandas dataframe using DataFrame.to_sql to a MySQL database, but the dataframe has unicode characters in some columns, some of which cause warnings during export and are converted to ?.
I managed to reproduce the issue with this example (database login removed):
import pandas as pd
import sqlalchemy
import pymysql
engine = sqlalchemy.create_engine('mysql+pymysql://{}:{}#{}/{}?charset=utf8'.format(*login_info), encoding='utf-8')
df_test = pd.DataFrame([[u'\u010daj',2], \
['čaj',2], \
['špenát',4], \
['květák',7], \
['kuře',1]], \
columns = ['a','b'])
df_test.to_sql('test', engine, if_exists = 'replace', index = False, dtype={'a': sqlalchemy.types.UnicodeText()})
The first two rows of the dataframe should be the same, just defined differently.
I get the following warning, and the problematic characters (č, ě, ř) are rendered as ?:
/usr/local/lib/python3.6/site-packages/pymysql/cursors.py:166: Warning: (1366, "Incorrect string value: '\\xC4\\x8Daj' for column 'a' at row 1")
result = self._query(query)
/usr/local/lib/python3.6/site-packages/pymysql/cursors.py:166: Warning: (1366, "Incorrect string value: '\\xC4\\x8Daj' for column 'a' at row 2")
result = self._query(query)
/usr/local/lib/python3.6/site-packages/pymysql/cursors.py:166: Warning: (1366, "Incorrect string value: '\\xC4\\x9Bt\\xC3\\xA1k' for column 'a' at row 4")
result = self._query(query)
/usr/local/lib/python3.6/site-packages/pymysql/cursors.py:166: Warning: (1366, "Incorrect string value: '\\xC5\\x99e' for column 'a' at row 5")
result = self._query(query)
with the resulting database table test looking like this:
a b
?aj 2
?aj 2
špenát 4
kv?ták 7
ku?e 1
Curiously, the ž, š and á characters (and others in my full dataset) are processed correctly, so it seems to only affect a subset of unicode characters. As you can see above, I also tried setting utf-8 wherever I could (engine, DataFrame.to_sql) with no effect.
pymysql:
import pymysql
con = pymysql.connect(host='127.0.0.1', port=3306,
user='root', passwd='******',
charset="utf8mb4")
sqlalchemy:
db_url = sqlalchemy.engine.url.URL(drivername='mysql', host=foo.db_host,
database=db_schema,
query={ 'read_default_file' : foo.db_config, 'charset': 'utf8mb4' })
See "Best practice" in http://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored Explanation of ?:
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is CHARACTER SET utf8 (or utf8mb4). Fix this.
Also, check that the connection during reading is UTF-8.
(Note: The CHARACTER SETs utf8 and utf8mb4 are interchangeable for European languages.)
These are Czech characters?
I haved met the same problem,use pymysql drive as well.
I change mysql drive to mysql-connector,1366 Warning disappear
install mysql-connector drive
pip install mysql-connector
sqlalchemy engine setting like this
create_engine('mysql+mysqlconnector://root:tj1996#localhost:3306/new?charset=utf8mb4')
I am having an issue running a basic query on a sample dataset(link below)
http://kbcdn.tableausoftware.com/data/Superstore.xls
using R.
I have attached my code below.
#read file with XLConnect
path <- file.path("/Users/petergensler/Desktop/Sample - Superstore Sales.xls")
superstore <- readWorksheetFromFile(path, sheet= "Orders")
#Query
test <- sqldf("SELECT * FROM superstore WHERE 'Product Sub-Category' = 'Appliances'",)
test
The query executes fine, but it returns the following results:
[1] Row.ID Order.ID Order.Date Order.Priority Order.Quantity Sales
[7] Discount Ship.Mode Profit Unit.Price Shipping.Cost Customer.Name
[13] Province Region Customer.Segment Product.Category Product.Sub.Category Product.Name
[19] Product.Container Product.Base.Margin Ship.Date
<0 rows> (or 0-length row.names)
Is there something wrong with my attached packages that would be causing the query to run wrong, or is it something with my data? the column I am querying on seems to be fine, as it is a type character, and specifying a literal string should match the values(unless their is trailing whitespace), correct?
I am running R on Mac OS X 10.11.5 with the following session info:
session_info()
Session info -------------------------------------------------------------------------------------------
setting value
version R version 3.3.0 (2016-05-03)
system x86_64, darwin13.4.0
ui RStudio (0.99.896)
language (EN)
collate en_US.UTF-8
tz America/Chicago
date 2016-06-08
I have also attached my packages attached to the current session as well.
https://drive.google.com/open?id=0Bxhxg_yftHNubEc4NUZTUVoxa0E
Thanks for your help!
Following up on what G. Grothndieck said (use brackets for column names) I ran this and it worked for me:
#Query
test <- sqldf(x = "SELECT * FROM superstore WHERE [Product.Sub.Category] = 'Appliances'")
test
Most methods of reading in Data frames change spaces and hyphens in column names into . so you need to update that part of it.
I want to insert a tab-delimted file, which is conatining both japanese and english characters with special charcters. I am using RMySQL to do is. One of a solution i tried giving below error:
dbWriteTable(con, "japan_test2", d, append = T, row.names=FALSE);
Error in mysqlExecStatement(conn, statement, ...) : RS-DBI driver: (could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '˜¨å¤œã®ã‚³ãƒ³_ text)' at line 3)
In addition: Warning message:
In strsplit(msg, "\n") : input string 1 is invalid in this locale
[1] FALSE
Warning message:
In mysqlWriteTable(conn, name, value, ...) :
could not create table: aborting mysqlWriteTable
Current Locale: LC_COLLATE=English_United States.1252;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
Locale Tried: US, Japanese.
Encoding Tried: UTF-8,16,ASCII.
System: Windows7
RStudio Version 0.98.977
MySQL 5.4.27 CE
Probably you aren't setting properly the encoding of the connection. You can try this:
con <- dbConnect(MySQL(), user=user, password=password,dbname=dbname, host=host, port=port)
# With the next line I try to get the right encoding (it works for Spanish keyboards)
encoding <- if(grepl(pattern = 'utf8|utf-8',x = Sys.getlocale(),ignore.case = T)) 'utf8' else 'latin1'
dbGetQuery(con,paste("SET names",encoding))
dbGetQuery(con,paste0("SET SESSION character_set_server=",encoding))
dbGetQuery(con,paste0("SET SESSION character_set_database=",encoding))
dbWriteTable( con, value = dfr, name = table, append = TRUE, row.names = FALSE )
dbDisconnect(con)
Remember that you have to use your local encoding as the right encoding of the connection. I try to get my encoding in the third line of the proposed code and then set the encoding according to my local encoding. Good luck!