RMySQL encoding issue on Windows - Spanish Character ñ - mysql

While using RMySQL::dbWriteTable function in R to write a table to MySQL on Windows I get an error message concerning the character [ñ].
The simplified example is:
table <- data.frame(a=seq(1:3), b=c("És", "España", "Compañía"))
table
a b
1 1 És
2 2 España
3 3 Compañía
db <- dbConnect(MySQL(), user = "####", password = "####", dbname ="test", host= "localhost")
RMySQL::dbWriteTable(db, name="test1", table, overwrite=T, append=F )
Error in .local(conn, statement, ...) :
could not run statement: Invalid utf8 character string: 'Espa'
As you can see, there is no problem with the accents ("És") but there is with the ñ character ("España").
On the other hand, there is no problem with MySQL since this query works fine:
INSERT INTO test.test1 (a,b)
values (1, "España");
Things I have already tried previous to write the table:
Encoding(x) <- "UTF-8" for all table.
iconv(x, "UTF-8", "UTF-8") for all table.
Sent pre-query: dbSendQuery(db, "SET NAMES UTF8;")
Change MySQL table Collation to: "utf-8-general, latin-1, latin-1-spanish...)
*Tried "Latin-1" encoding and didn't work either.
I have been looking for an answer to this question for a while with no luck.
Please help!
Versions:
MySQL 5.7.17
R version 3.3.0
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=C"
PS: Works fine in Linux environment but I am stuck with Windows in my current project :(

At the end, it looks like it is a problem of the encoding setup of the connection. By default my connection was setup to utf-8 but my local encoding was setup to latin1. Therefore, my final solution was:
con <- dbConnect(MySQL(), user=user, password=password,dbname=dbname, host=host, port=port)
# With the next line I try to get the right encoding (it works for Spanish keyboards)
encoding <- if(grepl(pattern = 'utf8|utf-8',x = Sys.getlocale(),ignore.case = T)) 'utf8' else 'latin1'
dbGetQuery(con,paste("SET names",encoding))
dbGetQuery(con,paste0("SET SESSION character_set_server=",encoding))
dbGetQuery(con,paste0("SET SESSION character_set_database=",encoding))
dbWriteTable( con, value = dfr, name = table, append = TRUE, row.names = FALSE )
dbDisconnect(con)

This works for me in Windows:
write.csv(table, file = "tmp.csv", fileEncoding = "utf8", quote = FALSE, row.names = FALSE)
db <- dbConnect(MySQL(), user = "####", password = "####", dbname ="test", host= "localhost")
dbWriteTable( db, value = "tmp.csv", name = "test1", append = TRUE, row.names = FALSE, sep = ",", quote='\"', eol="\r\n")

I ran into this problem with a data table of about 60 columns and 1.5 million rows; there were many computed values and reconciled and corrected dates and times so I didn't want to reformat anything I didn't have to reformat. Since the utf-8 issue was only coming up in character fields, I used a kludgy-but-quick approach:
1) copy the field list from the dbWriteTable statement into a word processor or text editor
2) on your copy, keep only the fields that have descriptions as VARCHAR and TEXT
3) strip those fields down to just field names
4) use paste0 to write a character vector of statements that will ensure all the fields are character fields:
dt$x <- as.character(dt$x)
5) then use paste0 again to write a character vector of statements that set the encoding to UTF-8
Encoding(dt$x) <- "UTF-8"
Run the as.character group before the Encoding group.
It's definitely a kludge and there are more elegant approaches, but if you only have to do this now and then (as I did), then it has three advantages:
1) it only changes what needs changing (important when, as with my project, there is a great deal of work already in the data table that you don't want to risk in a reformat),
2) it doesn't require a lot of space and read/writes in the intermediate stage, and
3)it's fast to write and runs at an acceptable speed for at least the size of data table I'm working with.
Not elegant, but it will get you over this particular hitch very quickly.

The function dbConnect() has a parameter called encoding that can help you easily setup the connection encoding method.
dbConnect(MySQL(), user=user, password=password,dbname=dbname, host=host, port=port, encoding="latin1")
This has allowed me to insert "ñ" characters into my tables and also inserting data into columns that have "ñ" in their name. For example, I can insert data into a column named "año".

Related

Compromised saveguard of data due to bad encoding usage?

I am using jupyter & python 3.6.4 via anaconda.
I want to be able to process and store data from python to a MySQL database.
The libraries I am using to do this arepymysql and sqlalchemy.
For now, I am testing this localy with wamp (mysql version : 5.7.21), later I will apply it to a distant server.
Database creation function:
def create_raw_mysql_db(host,user,password,db_name):
conn=pymysql.connect(host=host,user=user,password=password)
conn.cursor().execute('DROP DATABASE '+db_name)
conn.cursor().execute('CREATE DATABASE '+db_name+' CHARACTER SET utf8mb4')
Function to convert a Dataframe to a relational table in MySql:
def save_raw_to_mysql_db(df,table_name,db_name,if_exists,username,password,host_ip,port):
engine = create_engine("mysql+pymysql://"+username+":#"+host_ip+":"+port+"/"+db_name+"?charset=utf8mb4")
df.to_sql(name=table_name,con=engine,if_exists=if_exists,chunksize=10000)
The execution code:
#DB info & credentials
host = "localhost"
port = "3306"
user= "root"
password= ""
db_name= "raw_data"
exade_light_tb = "exade_light"
#A simple dataframe
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),columns=['a', 'b', 'c', 'd', 'e'])
create_raw_mysql_db(host,user,password,db_name)
save_raw_to_mysql_db(df,exade_light_tb,db_name,"replace",user,password,host,port)
The warning I receive when I run this code:
C:\Users.... : Warning: (1366, "Incorrect string value: '\x92\xE9t\xE9)' for column 'VARIABLE_VALUE' at row 481")
result = self._query(query)
From these threads: /questions/34165523/ questions/47419943 questions/2108824/, I could conclude the problem must be related to the utf8 charset, but I am using utf8mb4 to create my db and I am not using Django (which supposedly also needed to be configured according to questions/2108824/).
My questions :
How is this warning really impacting my data and its integrity?
How come even though I change charset from utf8 to utf8mb4, it
doesn't seem to solve the warning? Do I need to configure something
further? In this case, what are the parameters I should keep in mind
to apply the same configuration to my distant server?
How do I get rid of this warning?
Annex:

RMySQL - dbWriteTable() - The used command is not allowed with this MySQL version

I am trying to read a few excel files into a dataframe and then write to a MySQL database. The following program is able to read the files and create the dataframe but when it tries to write to the db using dbWriteTable command, I get an error message -
Error in .local(conn, statement, ...) :
could not run statement: The used command is not allowed with this MySQL version
library(readxl)
library(RMySQL)
library(DBI)
mydb = dbConnect(RMySQL::MySQL(), host='<ip>', user='username', password='password', dbname="db",port=3306)
setwd("<directory path>")
file.list <- list.files(pattern='*.xlsx')
print(file.list)
dat = lapply(file.list, function(i){
print(i);
x = read_xlsx(i,sheet=NULL, range=cell_cols("A:D"), col_names=TRUE, skip=1, trim_ws=TRUE, guess_max=1000)
x$file=i
x
})
df = do.call("rbind.data.frame", dat)
dbWriteTable(mydb, name="table_name", value=df, append=TRUE )
dbDisconnect(mydb)
I checked the definition of the dbWriteTable function and looks like it is using load data local inpath to store the data in the database. As per some other answered questions on Stackoverflow, I understand that the word local could be the cause for concern but since it is already in the function definition, I don't know what I can do. Also, this statement is using "," as separator. But my data has "," in some of the values and that is why I was interested in using the dataframes hoping that it would preserve the source structure. But now I am not so sure.
Is there any other way/function do write the dataframe to the MySQL tables?
I solved this on my system by adding the following line to the my.cnf file on the server (you may need to use root and vi to edit!). In my this is just below the '[mysqld]' line
local-infile=1
Then restart the sever.
Good luck!
You may need to change
dbWriteTable(mydb, name="table_name", value=df, append=TRUE )
to
dbWriteTable(mydb, name="table_name", value=df,field.types = c(artist="varchar(50)", song.title="varchar(50)"), row.names=FALSE, append=TRUE)
That way, you specify the field types in R and append data to your MySQL table.
Source:Unknown column in field list error Rmysql

SISS MSSQL to MySQL with different collation is not copying finnish letter å

I don't think title could be more described better as tl;dr, because problem is a bit deeper.
I've got two databases (finnish language):
MSSQL (collation: SQL_Latin1_General_CP437_CI_AI)
MySQL (collation: utf_general_ci)
I've created BI project in vs2017, connected two databases and transfered tables from one to another, no problem. Except for 1 letter: "å" - instead it was "?". I cannot change any database collation so I am trying to find a way to transfer words with this letter.
What I've tried:
OLD DB Source -> ODBC Destination
Point "1" with "Data Conversion" block in between (with code page 1252)
Script Component, in which I have tried:
Insert with "_latin"
sql= "INSERT INTO db.words(Name) VALUES(_latin1'å')";
byte[] b = Encoding.UTF8.GetBytes(sql);
odbcCmd = new OdbcCommand(Encoding.UTF8.GetString(b), odbcConn);
odbcCmd.ExecuteNonQuery();
Insert without it
sql= "INSERT INTO db.words(Name) VALUES('å')";
byte[] b = Encoding.UTF8.GetBytes(sql);
odbcCmd = new OdbcCommand(Encoding.UTF8.GetString(b), odbcConn);
odbcCmd.ExecuteNonQuery();
Diferent ways of encoding
byte[] bytes = Encoding.GetEncoding(1252).GetBytes("å");
var myString = Encoding.GetEncoding(1252).GetString(bytes);
byte[] bytes2 = Encoding.Default.GetBytes("å");
var myString2 = Encoding.Default.GetString(bytes2);
Insert with COLLATE which got me error
insert into db.words(Name) values ("å" COLLATE latin1_swedish_ci) ;
and error:
System.Data.Odbc.OdbcException: „ERROR [HY000] [MySQL][ODBC 5.3(a) Driver][mysqld-5.7.21-log]COLLATION 'latin1_swedish_ci' is not valid for CHARACTER SET 'cp1250'”
Here is interesting part:
I can make insert with this letter in MySQL Workbench without a problem, and it will be inserted, but when I try to pass it from one database to another it is lost. I've set Data Viewers between Data Conversion and the letter was still there, and also when debugging script it was after encoding in string that were inserted to database.
Maybe someone got any idea what else I can try, because I feel like I have tried everything, and feel that the resolve of this problem is really close, but I just don't see it.
CP1250 does not include å; CP437 and utf8 do include it.
COLLATE is irrelevant -- it applies only to comparing and sorting.
Don't use any encode/conversion functions; instead, specify how the data is encoded.
I see 'code' -- but what is the encoding for the source in that language and/or editor?
Show us the hex of any strings in question.
Which direction are you trying to transfer?
What are the connection parameters for each database?

PostgreSQL multiple CSV import and add filename to each column

I've got 200k csv files and I need to import them all to a single postgresql table. It's a list of parameters from various devices and each csv's file name contains device's serial number and I need it to be in one of the colums for each row.
So to simplify, I've got few columns of data (no headers), let's say that columns in each csv file are: Date, Variable, Value and file name contains SERIALNUMBER_and_someOtherStuffIDontNeed.csv
I'm trying to use cygwin to write a bash script to iterate over files and do it for me, however for some reason it won't work, showing 'syntax error at or near "as" '
Here's my code:
#!/bin/bash
FILELIST=/cygdrive/c/devices/files/*
for INPUT_FILE in $FILELIST
do
psql -U postgres -d devices -c "copy devicelist
(
Date,
Variable,
Value,
SN as CURRENT_LOAD_SOURCE(),
)
from '$INPUT_FILE
delimiter ',' ;"
done
I'm learning SQL so it might be an obvious mistake, but I can't see it.
Also I know that in that form I will get full file name, not just the serial number bit I want but I can probably handle that somehow later.
Please advise.
Thanks.
I dont think there is a CURRENT_LOAD_SOURCE() function in postgres. A work-around is to leave the name-column NULL on copy, and patch is to the desired value just after the copy. I prefer a shell here-document because that make quoting inside the SQL body easier. (BTW: for 10K of files, the globbing needed to obtain FILELIST might exceed argmax for the shell ...)
#!/bin/bash
FILELIST="`ls /tmp/*.c`"
for INPUT_FILE in $FILELIST
do
echo "File:" $INPUT_FILE
psql -U postgres -d devices <<OMG
-- I have a schema "tmp" for testing purposes
CREATE TABLE IF NOT EXISTS tmp.filelist(name text, content text);
COPY tmp.filelist ( content)
from '/$INPUT_FILE' delimiter ',' ;
UPDATE tmp.filelist SET name = '$FILELIST'
WHERE name IS NULL;
OMG
done
For anyone interested in an answer, I've used a python script to change file names and then another script using psycopg2 to connect to the database and then done everyting in one connection. Took 10 minutes instead of 10 hours.
Here's the code:
Renaming files (also apparently to import from CSV you need all the rows to be filled and the information I needed was in first 4 columns anyway, therefore I've put together a solution to generate whole new CSVs instead of just renaming them):
import os
import csv
path='C:/devices/files'
os.chdir(path)
i=0
for file in os.listdir(path):
try:
i+=1
if i%10000 == 0:
#just to see the progress
print(i)
serial_number = (file[:8])
creader = csv.reader(open(file))
cwriter = csv.writer(open('processed_'+file, 'w'))
for cline in creader:
new_line = [val for col, val in enumerate(cline) if col not in (4, 5, 6, 7)]
new_line.insert(0, serial_number)
#print(new_line)
cwriter.writerow(new_line)
except:
print('problem with file: ' + file)
pass
Updating database:
import os
import psycopg2
path="C:\\devices\\files"
directory_listing = os.listdir(path)
conn = psycopg2.connect("dbname='devices' user='postgres' host='localhost'")
cursor = conn.cursor()
print(len(directory_listing))
i=100001
while i < 218792:
current_file=(directory_listing[i])
i+=1
full_path = "C:/devices/files/" + current_file
with open(full_path) as f:
cursor.copy_from(file=f, table='devicelistlive', sep=",")
conn.commit()
conn.close()
Don't mind while and weird numbers, it's just because I was doing it in portions for testing purposes. Can easily be replaced with for

Multibyte string error when writing in MySQL from R with RMySQL dbWriteTable

Each time I try to use the dbWriteTable function from RMySQL package in R 3.0 (but also in R 2.15 before) I get this error "Erreur dans tolower(avail) : chaîne de charactères multioctets incorrecte 45", which means something like "Error in tolower(avail) : Multibyte string error X". And I can't find any solution, I don't even understand where this error is generated.
Here are the facts : I work on Mac OS X 10.9.1 but had this error on 10.8.x and I have it as well on Debian. Both MySQL and R are on the same machine (or not, it doesn't make any difference). For testing purpose I have created at table with only numerical values and I read it's content with RMySQL (no problem) then try to reinject it in MySQL with dbWriteTable, and boom. Here's the R script :
#!/usr/bin/Rscript
library(DBI)
library(RMySQL)
conn <- dbConnect("MySQL", user="userr", password="passworrd", dbname="dbtest")
res <- dbSendQuery(conn, statement = paste("SELECT * FROM testable"))
input <- fetch(res, n = -1)
dbWriteTable(conn, "testable2", input, row.names = T, overwrite = FALSE, append = T)
dbDisconnect(conn)
The table content being :
id testval
1 1 76
2 2 47417
The user owns the DB. The fetch works fine but not the dbWriteTable. The error is probably related to some character coding but I can't figure out what.
I'm using R version 3.0.2 (2013-09-25), RMySQL 0.9-3, DBI 0.2-7 and MySQL 5.6.14, Mac OS X 10.9.1 .
I have the same issue on Rstudio server hosted on a Debian machine.
The Mysql on Debian (don't know where to find these on Mac) log says :
- 140211 14:04:15 24 Connect userr#xxx.xxx.xxx.xxx on dbtest
- 140211 14:04:32 24 Query show tables
- 140211 14:04:52 24 Quit
So, my humble wish is that someone could put me on the good track !
Didier
Solution : it was a collation problem in the database; I created another DB with Coding UTF-8 (like the first one) but this time with collation utf8_general_ci instead of utf8_bin, and there it worked perfectly. Long question, short answer.