SISS MSSQL to MySQL with different collation is not copying finnish letter å - mysql

I don't think title could be more described better as tl;dr, because problem is a bit deeper.
I've got two databases (finnish language):
MSSQL (collation: SQL_Latin1_General_CP437_CI_AI)
MySQL (collation: utf_general_ci)
I've created BI project in vs2017, connected two databases and transfered tables from one to another, no problem. Except for 1 letter: "å" - instead it was "?". I cannot change any database collation so I am trying to find a way to transfer words with this letter.
What I've tried:
OLD DB Source -> ODBC Destination
Point "1" with "Data Conversion" block in between (with code page 1252)
Script Component, in which I have tried:
Insert with "_latin"
sql= "INSERT INTO db.words(Name) VALUES(_latin1'å')";
byte[] b = Encoding.UTF8.GetBytes(sql);
odbcCmd = new OdbcCommand(Encoding.UTF8.GetString(b), odbcConn);
odbcCmd.ExecuteNonQuery();
Insert without it
sql= "INSERT INTO db.words(Name) VALUES('å')";
byte[] b = Encoding.UTF8.GetBytes(sql);
odbcCmd = new OdbcCommand(Encoding.UTF8.GetString(b), odbcConn);
odbcCmd.ExecuteNonQuery();
Diferent ways of encoding
byte[] bytes = Encoding.GetEncoding(1252).GetBytes("å");
var myString = Encoding.GetEncoding(1252).GetString(bytes);
byte[] bytes2 = Encoding.Default.GetBytes("å");
var myString2 = Encoding.Default.GetString(bytes2);
Insert with COLLATE which got me error
insert into db.words(Name) values ("å" COLLATE latin1_swedish_ci) ;
and error:
System.Data.Odbc.OdbcException: „ERROR [HY000] [MySQL][ODBC 5.3(a) Driver][mysqld-5.7.21-log]COLLATION 'latin1_swedish_ci' is not valid for CHARACTER SET 'cp1250'”
Here is interesting part:
I can make insert with this letter in MySQL Workbench without a problem, and it will be inserted, but when I try to pass it from one database to another it is lost. I've set Data Viewers between Data Conversion and the letter was still there, and also when debugging script it was after encoding in string that were inserted to database.
Maybe someone got any idea what else I can try, because I feel like I have tried everything, and feel that the resolve of this problem is really close, but I just don't see it.

CP1250 does not include å; CP437 and utf8 do include it.
COLLATE is irrelevant -- it applies only to comparing and sorting.
Don't use any encode/conversion functions; instead, specify how the data is encoded.
I see 'code' -- but what is the encoding for the source in that language and/or editor?
Show us the hex of any strings in question.
Which direction are you trying to transfer?
What are the connection parameters for each database?

Related

Incorrect string value - MySql

I have a problem with MySql.
My version of MYSql is : 5.7.33 - MySQL Community Server (GPL)
I have create a discord Bot in node.js, and i have a mistake when a new user with pseudo like this : legoshi🌌🌧
So i have try to follow this topic : How to fix "Incorrect string value" errors?
So i convert my Database in : utf8mb4_unicode_ci
And my error is still here.
At the begin my database was in utf8 and i have the error too.
code: 'ER_TRUNCATED_WRONG_VALUE_FOR_FIELD',
errno: 1366,
sqlMessage: "Incorrect string value: '\\xF0\\x9F\\x8C\\x8C\\xF0\\x9F...' for column 'user' at row 1",
sqlState: 'HY000',
index: 0,
sql: 'INSERT INTO registre (id, user, autohit, ultimate, platinium, `Date Inscription`) VALUES (210490816542670849, "legoshi🌌🌧", 0, 0, 0, CURRENT_TIMESTAMP())'
}
So i don't no how to change this. I have see a lot of topic and all seems to be fix with utf8mb4_unicode_ci but not in my case.
Thanks for you're help.
In MySQL, there are several places where you can set up a character set:
On the server level
On the database level
On the table level (for each table)
On the field level for all character-based fields
On your connection (telling the server what charset will be used in packets you send to the server)
Basically, server-level, database-level and table-level are just defaults for newly created items: New databases are generated with the server's default. New tables are created with the database's default, new fields are created with the table's default. However, only the field-level charset is what actually counts.
So first, you should make sure that the fields you want to store the data in actually are set up to utf8mb4_unicode_ci. Then, you need to connect to the server using exactly the same charset. Be aware that also the collation should match.
You can find out what character set is in use by issuing the following query:
SHOW VARIABLES LIKE 'character_set_%'
You'll see several variables indicating which default is set for various scopes. Have a look especially to the variables character_set_client and character_set_connection. If the connection does not have the correct character set specified, you need to set it up on connection.
It's a good practice to have all character sets match identically. Mixed values will sooner or later cause trouble.
To check the character set which is set up for the field, have it displayed with the command
SHOW CREATE TABLE registre

Has Python's string formatter changes in recent editions broken the MySQL connector?

I'm writing a simple - or it should be simple - script to acquire tweets from Twitter's API (I have developer/app keys and am using the Tweepy interface, not scraping or anything of that sort - I may ditch Tweepy for something closer to the modern API but that is almost certainly not what's causing this issue here).
I have a MySQL instance which I connect to and can query just fine, until it comes time to insert the tweet - which has a lot of special characters, almost inevitably. To be clear, I am using the official Python driver/connector for MySQL.
import mysql.connector
from mysql.connector import errorcode
Now, I'm aware StackOverflow is LITTERED with threads where people get my exact error - simply stating to check the MySQL syntax manual. These threads, which aren't all that old (and I'm not using the latest Python, I use 3.7.9 for compatibility with some NLP libraries) insist the answer is to place the string that has the special characters into an old-style format string WITHIN the cursor.execute method, to enclose string variable placeholders in quotes, and to pass a tuple with an empty second value if, as in my case, only one variable is to be inserted. This is also a solution posted as part of a bug report response on the MySQL website - and yet, I have no success.
Here's what I've got - following the directions on dozens of pages here and the official database website:
for tweet in tweepy.Cursor(twilek.search, q=keyword, tweet_mode='extended').items():
twi_tweet = tweet.full_text
print(twi_tweet)
twi_tweet = twi_tweet.encode('utf8')
requests_total+=1
os.environ['TWITTER_REQUESTS'] = str(requests_total)
requests_total = int(os.environ.get('TWITTER_REQUESTS'))
# insert the archived tweet text into the database table
sql = 'USE hate_tweets'
ms_cur.execute(sql)
twi_tweet = str(twi_tweet)
insert_tweet = re.sub(r'[^A-Za-z0-9 ]+', '', twi_tweet)
ms_cur.execute("INSERT INTO tweets_lgbt (text) VALUES %s" % (insert_tweet,))
cnx.commit()
print(ms_cur.rowcount, "record inserted.")
(twilek is my cursor object because I'm a dork)
expected result: string formatter passes MySQL a modified tweet string that it can process and add as a row to the tweets_lgbt table
actual result: insertion fails on a syntax error for any tweet
I've tried going so far as to use regex to strip everything but alphanumeric and spaces - same issue. I'm wondering if the new string format features of current Python versions have broken compatibility with this connector? I prefer to use the official driver but I'll switch to an ORM if I must. (I did try the newer features like F strings, and found they caused the same result.)
I have these observations:
the VALUES clause requires parentheses VALUES (%s)
the quoting / escaping of values should be delegated to the cursor's execute method, by using unquoted placeholders in the SQL and passing the values as the second argument: cursor.execute(sql, (tweet_text,)) or cursor.executemany(sql, [(tweet_text1,), (tweet_text2,)])
once these steps are applied there's no need for encoding/stringifying/regex-ifying: assuming twi_text is a str and the database's charset/collation supports the full UTF-8 range (for example utf8mb4) then the insert should succeed.
in particular, encoding a str and then calling str on the result is to be avoided: you end up with "b'my original string'"
This modified version of the code in the question works for me:
import mysql.connector
DDL1 = """DROP TABLE IF EXISTS tweets_lgbt"""
DDL2 = """\
CREATE TABLE tweets_lgbt (
`text` VARCHAR (256))
"""
# From https://twitter.com/AlisonMitchell/status/1332567013701500928?s=20
insert_tweet = """\
Particularly pleased to see #SarahStylesAU
quoted in this piece for the work she did
👌
Thrive like a girl: Why women's cricket in Australia is setting the standard
"""
# Older connector releases don't support with...
with mysql.connector.connect(database='test') as cnx:
with cnx.cursor() as ms_cur:
ms_cur.execute(DDL1)
ms_cur.execute(DDL2)
ms_cur.execute("INSERT INTO tweets_lgbt (`text`) VALUES (%s)", (insert_tweet,))
cnx.commit()
print(ms_cur.rowcount, "record inserted.")
This is how you should insert a row to your table,
insert_tweet = "ABCEFg 9 XYZ"
"INSERT INTO tweets_lgbt (text) VALUES ('%s');"%(insert_tweet)
"INSERT INTO tweets_lgbt (text) VALUES ('ABCEFg 9 XYZ');"
Things to note
The arguments to a string formatter is just like the arguments to a
function. So, you cannot add a comma at the end to convert a string
to a tuple there.
If you are trying to insert multiple values at once, you can use cursor.executemany or this answer.

RMySQL encoding issue on Windows - Spanish Character ñ

While using RMySQL::dbWriteTable function in R to write a table to MySQL on Windows I get an error message concerning the character [ñ].
The simplified example is:
table <- data.frame(a=seq(1:3), b=c("És", "España", "Compañía"))
table
a b
1 1 És
2 2 España
3 3 Compañía
db <- dbConnect(MySQL(), user = "####", password = "####", dbname ="test", host= "localhost")
RMySQL::dbWriteTable(db, name="test1", table, overwrite=T, append=F )
Error in .local(conn, statement, ...) :
could not run statement: Invalid utf8 character string: 'Espa'
As you can see, there is no problem with the accents ("És") but there is with the ñ character ("España").
On the other hand, there is no problem with MySQL since this query works fine:
INSERT INTO test.test1 (a,b)
values (1, "España");
Things I have already tried previous to write the table:
Encoding(x) <- "UTF-8" for all table.
iconv(x, "UTF-8", "UTF-8") for all table.
Sent pre-query: dbSendQuery(db, "SET NAMES UTF8;")
Change MySQL table Collation to: "utf-8-general, latin-1, latin-1-spanish...)
*Tried "Latin-1" encoding and didn't work either.
I have been looking for an answer to this question for a while with no luck.
Please help!
Versions:
MySQL 5.7.17
R version 3.3.0
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=C"
PS: Works fine in Linux environment but I am stuck with Windows in my current project :(
At the end, it looks like it is a problem of the encoding setup of the connection. By default my connection was setup to utf-8 but my local encoding was setup to latin1. Therefore, my final solution was:
con <- dbConnect(MySQL(), user=user, password=password,dbname=dbname, host=host, port=port)
# With the next line I try to get the right encoding (it works for Spanish keyboards)
encoding <- if(grepl(pattern = 'utf8|utf-8',x = Sys.getlocale(),ignore.case = T)) 'utf8' else 'latin1'
dbGetQuery(con,paste("SET names",encoding))
dbGetQuery(con,paste0("SET SESSION character_set_server=",encoding))
dbGetQuery(con,paste0("SET SESSION character_set_database=",encoding))
dbWriteTable( con, value = dfr, name = table, append = TRUE, row.names = FALSE )
dbDisconnect(con)
This works for me in Windows:
write.csv(table, file = "tmp.csv", fileEncoding = "utf8", quote = FALSE, row.names = FALSE)
db <- dbConnect(MySQL(), user = "####", password = "####", dbname ="test", host= "localhost")
dbWriteTable( db, value = "tmp.csv", name = "test1", append = TRUE, row.names = FALSE, sep = ",", quote='\"', eol="\r\n")
I ran into this problem with a data table of about 60 columns and 1.5 million rows; there were many computed values and reconciled and corrected dates and times so I didn't want to reformat anything I didn't have to reformat. Since the utf-8 issue was only coming up in character fields, I used a kludgy-but-quick approach:
1) copy the field list from the dbWriteTable statement into a word processor or text editor
2) on your copy, keep only the fields that have descriptions as VARCHAR and TEXT
3) strip those fields down to just field names
4) use paste0 to write a character vector of statements that will ensure all the fields are character fields:
dt$x <- as.character(dt$x)
5) then use paste0 again to write a character vector of statements that set the encoding to UTF-8
Encoding(dt$x) <- "UTF-8"
Run the as.character group before the Encoding group.
It's definitely a kludge and there are more elegant approaches, but if you only have to do this now and then (as I did), then it has three advantages:
1) it only changes what needs changing (important when, as with my project, there is a great deal of work already in the data table that you don't want to risk in a reformat),
2) it doesn't require a lot of space and read/writes in the intermediate stage, and
3)it's fast to write and runs at an acceptable speed for at least the size of data table I'm working with.
Not elegant, but it will get you over this particular hitch very quickly.
The function dbConnect() has a parameter called encoding that can help you easily setup the connection encoding method.
dbConnect(MySQL(), user=user, password=password,dbname=dbname, host=host, port=port, encoding="latin1")
This has allowed me to insert "ñ" characters into my tables and also inserting data into columns that have "ñ" in their name. For example, I can insert data into a column named "año".

Using UTF and Hindi in CakePHP and MySQL

I've create a form that contains Hindi (UTF-8) data which i want to store in MySQL table. The columns corresponding to UTF data has collation value set to utf_general_ci.
I've successfully stored the data in table but when I'm executing a select-where query, it doesn't returns the data. Here is my query:
SELECT Birth.sno, Birth.bookingnumber, Birth.birth_date, Birth.baby_gender, Birth.baby_name, Birth.baby_father_name, Birth.baby_father_address, Birth.baby_mother_name, Birth.birth_place, Birth.place_type, Birth.applicant_name, Birth.applicant_address, Birth.registration_number, Birth.registration_date, Birth.registration_ward, Birth.registration_city_village, Birth.registration_district, Birth.remark, Birth.mother_place_name, Birth.mother_place_type, Birth.mother_place_district, Birth.mother_place_state, Birth.person_religion, Birth.father_education, Birth.mother_education, Birth.father_occupation, Birth.mother_occupation, Birth.mother_age_at_marriage, Birth.mother_age_at_birth, Birth.count_of_mother_child, Birth.birth_by, Birth.birth_method, Birth.mother_weight_at_birth, Birth.pregnancy_duration, Birth.date_of_issue FROM np.births AS Birth WHERE Birth.baby_name = 'd' AND Birth.baby_father_name = 'e' AND Birth.baby_mother_name = 'f' AND Birth.baby_father_address = 'g' AND Birth.person_religion = 'हिंदू' AND Birth.baby_gender = 'पुरुष'
The name of the database is np and name of the table is births
The above query was printed in the log file. I tried to copy and paste the same query in HeidiSQL (front end for MySQL) but its not running. However, if I remove the following part: ** AND Birth.person_religion = 'हिंदू' AND Birth.baby_gender = 'पुरुष'**, the query works fine.
How can I resolve this issue?
This looks like a case when your MySQL client and your MySQL server do not "talk" the same encoding.
There are 3 places where you need to take care of your encoding.
The Web Form (what the users sees) -> Your Web Application (CakePHP) -> Your Database Server (MySQL)
One of those three is NOT using the same encoding as the others. So by the time:
"'हिंदू'" and "'पुरुष'" get to your database they will be something totally different that will not be found in the database.
So, make sure that in your default.ctp file you have set your encoding:
echo $this->Html->charset(); //this will result in a UTF-8 encoding of the page.
Look at the source code of your web page (where I guess you have a search/filter form).
At the top you should see:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Then look for the code generated for your search/filter form. You should see:
<form id="your_form_id" accept-charset="utf-8" method="post" action="/your/action/">
The important part is that it that "utf-8" that MUST show up in those places.
Next, look into your database.php file and make sure this line:
'encoding' => 'utf8', is NOT commented out!
Finally, with a client that you are sure supports UTF-8 (probably HeidiSQL) have a look at your data table np.births and make sure that what data you have there actually makes sense! It's possible it got mangled because of the discrepancies in encoding before.
Once the data makes sense in the database you should be good to go!
IF this does not do it you, you'll have to read and thoroughly understand this article. Only then you will be able to locate where the problem is and get your encodings in sync.
(Obviously your PHP source files should be UTF-8 encoded as well...)

How to display unicode in MySQL result?

http://www.sqlfiddle.com/#!2/82f65/1
I tried this:
create table x(y varchar(100) character set utf8);
insert into x(y) values('爱');
But the chinese character doesn't appear:
select y from x;
Output:
Y
?
I'm the author of sqlfiddle.com. The problem was that I didn't have my connection string and default database encoding for mysql setup to properly handle UTF8. I have fixed this now, but because the fiddle you posted is still using the obsolete settings, you'll have to see it working here on my slightly-modified version of your fiddle:
http://www.sqlfiddle.com/#!2/e79e8/1
Your link might start working eventually, it just needs to clear out of the running memory and be reset. After no one hits it for a while it should be harvested and then ready to be built back up cleanly. Thanks!
FYI, the changes I had to make to get it to work were found here: http://www.compoundtheory.com/?action=displayPost&ID=421
The relavent bits where adding this to my connection string from java:
useUnicode=true&characterEncoding=UTF-8
And adding this to my create database statement:
create database my_new_database default CHARACTER SET = utf8 default COLLATE = utf8_general_ci;
It is working fine in mysql on my localhost. it may be due to mysql charset or some setting please check it.
If you have to run this query via program like php then
run query before select query
"SET NAMES utf8"
It will be return result properly
thanks
The Chinese character is not displaying in fiddle but in actual mysql database it is working fine. Kindly check your mysql version