Fixing Incorrect String Encoding From MySQL - mysql

I'm reading strings from a mysql database which isn't set up for Unicode.
Ruby gets the string as 七大洋 but I know the correct version should be 七大洋. The "wrong" string is encoded as UTF-8 because Ruby doesn't know it has it wrong. I've tried forcing every encoding on the mangled string but nothing works. I have a feeling that I might be able to do it by fiddling with the bits but I don't even know where to start.
I don't think any information has been lost because the incorrect string actually has more bytes than the correct one. I don't think Ruby is the culprit here because the strings also look mangled when I view the table outside Ruby - so I'm hoping to undo the damage that MySQL has already done.

You can use following construction to revert encoding:
"wrong_string".encode(Encoding::SOME_ENCODING).force_encoding('utf-8')
I tried all possible encodings to detect right encoding:
Encoding.constants.each_with_object({}) do |encoding_name, result|
value = "七大洋".encode(Encoding.const_get encoding_name).force_encoding('utf-8') rescue nil
result[encoding_name] = value if value == "七大洋"
end.keys
#=> [:Windows_1252, :WINDOWS_1252, :CP1252, :Windows_1254, :WINDOWS_1254, :CP1254]
Thus, to convert your string to 七大洋 you can use any encoding from above.

Alexander pointed out my main mistake (you need to encode then force_encoding to find the right encoding). The string is indeed encoded as CP1252!
The best solution is to read binary from MySQL and then force encoding:
client = Mysql2::Client.new(opts.merge encoding: 'binary')
# ...
text.force_encoding('UTF-8')
Or, if you can't change how you're getting the data, you'll be stuck with a Encoding::UndefinedConversionError when you try to encode. As detailed in this blog post, the solution is to specify encodings for the five undefined CP1252 bytes:
fallback = {
"\u0081" => "\x81".force_encoding("CP1252"),
"\u008D" => "\x8D".force_encoding("CP1252"),
"\u008F" => "\x8F".force_encoding("CP1252"),
"\u0090" => "\x90".force_encoding("CP1252"),
"\u009D" => "\x9D".force_encoding("CP1252")
}
text.encode('CP1252', fallback: fallback).force_encoding('UTF-8')

Related

JSON escape quotes on value before deserializing

I have a server written in Rust, this server gets a request in JSON, the JSON the server is getting is a string and sometimes users write quotes inside the value. For example when making a new forum thread.
The only thing I really need to do is to escape the quotes inside the value.
So this:
"{"name":""test"", "username":"tomdrc1", "date_created":"07/12/2019", "category":"Developer", "content":"awdawdasdwd"}"
Needs to be turned into this:
"{"name":"\"test\"", "username":"tomdrc1", "date_created":"07/12/2019", "category":"Developer", "content":"awdawdasdwd"}"
I tried to replace:
let data = let data = "{"name":""test"", "username":"tomdrc1", "date_created":"07/12/2019", "category":"Developer", "content":"awdawdasdwd"}".to_string().replace("\"", "\\\"");
let res: serde_json::Value = serde_json::from_str(&data).unwrap();
But it results in the following error:
thread '' panicked at 'called Result::unwrap() on an Err value: Error("key must be a string", line: 1, column: 2)
I suspect because it transforms the string to the following:
let data = "{\"name\":\"\"test\"\", \"username\":\"tomdrc1\", \"date_created\":\"07/12/2019\", \"category\":\"Developer\", \"content\":\"awdawdasdwd\"}"
If I understand your question right, the issue is that you are receiving strings which should be JSON but are in fact malformed (perhaps generated by concatenating strings).
If you are unable to fix the source of those non-JSON strings the only solution I can think of involves a lot of heavy lifting with caveat:
Writing a custom "malformed-JSON" parser
Careful inspection/testing/analysis of how the broken client is broken
Using the brokenness information to fix the "malformed-JSON"
Using the fixed JSON to do normal request processing
I would recommend not to do that except for maybe a training excercise. Fixing the client will be done in minutes but implementing this perfectly on the server will take days or weeks. The next time this one problematic client has been changed you'll have to redo all the hard work.
The real answer:
Return "400 Bad Request" with some additional "malformed json" hint
Fix the client if you have access to it
Additional notes:
Avoid unwrapping in a server
Look for ways to propagate the Result::Err to caller and use it to trigger a "400 Bad Request" response
Check out error handling chapter in the Rust book for more

how to use `charset` and `encoding` in `create_engine` of SQLAlchemy (to create pandas dataframe)?

I am very confused with the way charset and encoding work in SQLAlchemy. I understand (and have read) the difference between charsets and encodings, and I have a good picture of the history of encodings.
I have a table in MySQL in latin1_swedish_ci (Why? Possible because of this). I need to create a pandas dataframe in which I get the proper characters (and not weird symbols). Initially, this was in the code:
connect_engine = create_engine('mysql://user:password#1.1.1.1/db')
sql_query = "select * from table1"
df = pandas.read_sql(sql_query, connect_engine)
We started having troubles with the Š character (corresponding to the u'\u0160' unicode, but instead we get '\x8a'). I expected this to work:
connect_engine = create_engine('mysql://user:password#1.1.1.1/db', encoding='utf8')
but, I continue getting '\x8a', which, I realized, makes sense given that the default of the encoding parameter is utf8. So, then, I tried encoding='latin1' to tackle the problem:
connect_engine = create_engine('mysql://user:password#1.1.1.1/db', encoding='latin1')
but, I still get the same '\x8a'. To be clear, in both cases (encoding='utf8' and encoding='latin1'), I can do mystring.decode('latin1') but not mystring.decode('utf8').
And then, I rediscovered the charset parameter in the connection string, i.e. 'mysql://user:password#1.1.1.1/db?charset=latin1'. And after trying all possible combinations of charset and encoding, I found that this one work:
connect_engine = create_engine('mysql://user:password#1.1.1.1/db?charset=utf8')
I would appreciate if somebody can explain me how to correctly use the charset in the connection string, and the encoding parameter in the create_engine?
encoding parameter does not work correctly.
So, as #doru said in this link, you should add ?charset=utf8mb4 at the end of the connection string. like this:
connect_string = 'mysql+pymysql://{}:{}#{}:{}/{}?charset=utf8mb4'.format(DB_USER, DB_PASS, DB_HOST, DB_PORT, DATABASE)
I had the same problem. I just added ?charset=utf8mb4 at the end of the url.
Here is mine:
Before
SQL_ENGINE = sqlalchemy.create_engine('mysql+pymysql://'+MySQL.USER+':'+MySQL.PASSWORD+'#'+MySQL.HOST+':'+str(MySQL.PORT)+'/'+MySQL.DB_NAME)
After
SQL_ENGINE = sqlalchemy.create_engine('mysql+pymysql://'+MySQL.USER+':'+MySQL.PASSWORD+'#'+MySQL.HOST+':'+str(MySQL.PORT)+'/'+MySQL.DB_NAME + "?charset=utf8mb4")
encoding is the codec used for encoding/decoding within SQLAlchemy. From the documentation:
For those scenarios where the DBAPI is detected as not supporting a
Python unicode object, this encoding is used to determine the
source/destination encoding. It is not used for those cases where the
DBAPI handles unicode directly.
[...]
To properly configure a system to accommodate Python unicode objects,
the DBAPI should be configured to handle unicode to the greatest
degree as is appropriate [...]
mysql-python handles unicode directly, so there's no need to use this setting.
charset is a setting specific to the mysql-python driver. From the documentation:
This charset is the client character set for the connection.
This setting controls three variables on the server, specifically character_set_results, which is what you are interested in. When set, strings are returned as unicode objects.
Note that this applies only if you have latin1 encoded data in the database. If you've stored utf-8 bytes as latin1, you may have better luck using encoding instead.
This works for me .
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL
db_url = {
'database': "dbname",
'drivername': 'mysql',
'username': 'myname',
'password': 'mypassword',
'host': '127.0.0.1',
'query': {'charset': 'utf8'}, # the key-point setting
}
engine = create_engine(URL(**db_url), encoding="utf8")

AS3: Conversion to GBK charset

Using Flex (and HTTPService), I am loading data from an URL, data that is encoded with the GBK charset. A good example of such an URL is this one.
A browser gets that the data is in the GBK charset, and correctly displays the text using Chinese characters where they appear. However, Flex will hold the data in a different charset, and it happens to look like this:
({"q":"tes","p":false,"bs":"","s":["ÌØ˹À­","ÌØÊâ·ûºÅ","test","ÌØÊâÉí·Ý","tesco","ÌØ˹À­Æû³µ","ÌØÊÓÍø","ÌØÊâ·ûºÅͼ°¸´óȫ","testin","ÌØ˹À­Æ󳵼۸ñ"]});
I need to correctly change the text to the same character string that the browsers display.
What I am already doing is using ByteArray, with the best result so far by using "iso-8859-1":
var convert:String;
var byte:ByteArray = new ByteArray();
byte.writeMultiByte(event.result as String, "iso-8859-1");
byte.position = 0;
convert = byte.readMultiByte(byte.bytesAvailable, "gbk");
This creates the following string, which is very close to the browser result but not entirely:
({"q":"tes","p":false,"bs":"","s":["特?拉","特殊符号","test","特殊身份","tesco","特?拉汽车","特视网","特殊符号?案大?","testin","特?拉????]});
Some characters are still replaced by "?" marks. And when I copy the browser result into Flex and print it, it gets displayed correctly so it is not a matter of unsupported characters in Flash trace or anything like that.
Interesting fact: Notepad++ gives the same close-but-not-quite result as the bytearray approach in Flex. Also in NP++, when converting the correct/expected string, from gbk to iso-8859-1, I am getting a slightly different string than the one Flex is getting from the URL:
({"q":"tes","p":false,"bs":"","s":["ÌØ˹À­","ÌØÊâ·ûºÅ","test","ÌØÊâÉí·Ý","tesco","ÌØ˹À­Æû³µ","ÌØÊÓÍø","ÌØÊâ·ûºÅͼ°¸´óÈ«","testin","ÌØ˹À­Æû³µ¼Û¸ñ"]});
Seems to me that this string is the one that Flex should be getting, to have the ByteArray approach create the correct result (visible in browsers). So I see possible 3 causes for this:
Something is happening to the data coming from the URL to Flex, causing it to be slightly different (unlikely)
The received charset is not actually iso-8859-1, but another similar charset
I don't have a complete grasp of the difference between encoding and charset, so maybe this keeps me from understanding the problem.
Any help/idea would be greatly appreciated.
Thank you.
Managed to find the problem and solution, hope this will help anyone else in the future.
Turns out using HTTPService automatically converts the result into a String, which may compress some pair of bytes into single characters. That is why I was getting the first result (see up) instead of the third one. What I needed to do is get the result in binary form, and HTTPService does not have this type of resultFormat; however URLLoader does.
Replace HTTPService with URLLoader
Set the dataFormat property of the URLLoader to URLLoaderDataFormat.BINARY
After loading, the data property will return as a ByteArray. Tracing this byte array (or converting it into a String) will display the same result as the HTTPService is getting, which is still wrong, however in reality the byte array actually holds the correct data byte for byte (the length property of the byte array will be a bit larger than the size of the converted string).
So you can read the string from this bytearray, using the "gbk" charset:
byteArray.readMultyByte(byteArray.length, "gbk");
This returns the correct string, which the browser is also displaying.

How to submit blob data into MySQL from ruby without base64 encoding

I have searched without much success how to best submit binary data into a MySQL field of type BLOB, without doing a base64 encoding which increases the size of the data.
So far my Ruby code looks something like this:
require 'zlib'
require 'base64'
require 'mysql'
#Initialization of connection function
db = Mysql.init
db.options(Mysql::OPT_COMPRESS, true)
db.options(Mysql::SET_CHARSET_NAME, 'utf8')
dbh = db.real_connect('hostname','username','password','database') #replace with appropriate connection details
#Saving the data function
values=someVeryBigHash
values=JSON.dump(values)
values=Zlib::Deflate.deflate(values, Zlib::BEST_COMPRESSION)
values=Base64.encode64(values)
dbh.query("update `SomeTable` set Data='#{values}' where id=1")
dbh.close if dbh
#Retrieving the data function
res=dbh.query("select * from `SomeTable` where id=1")
data=res['Data']
data=Base64.decode64(data)
data=Zlib::inflate(data)
data=JSON.parse(data)
The issue is that using Base64 encoding/decoding is not very efficient and I was hopping for something a bit cleaner.
I also tried an alternative using Marhsal (which does not allow me to send the data without a base64 encoding, but is a bit more compact)
#In the saving function use Marshal.dump() instead of JSON.dump()
values=Marshal.dump(values)
#In Retrieve function use Marshal.load() (or Marshal.restore()) instead of JSON.parse(data)
data=Marshal.load(data)
However, I get some errors (perhaps someone spots what I do wrong, or has some ideas why this occurs):
incompatible marshal file format (can't be read) version 4.8 required; 34.92 given
I tried different flavor of this with/without Base64 encoding or decoding or with/without ZLib compression. But I seem to consistently get an error.
How would it be possible to send binary data using Ruby and mysql gem, without base64 encoding. Or is it simply a requirement to use base64 encoding for sending the data?
The issue is that using Base64 encoding/decoding is not very efficient and I was hopping for something a bit cleaner.
You're using JSON to convert a large hash to a string, then using ZLib compression on binary data, then Base64 encoding the resulting binary data, and you're worried about efficiency... I'm going to assume you mean spatial efficiency rather than temporal efficiency.
I guess I'm most curious about why you're Base64 encoding in the first place - a BLOB is a binary data format, and provided you pass an array of bytes to ZLib it should inflate it correctly regardless.
Have you tried writing binary data directly to the database? What issues did you experience.
Edit:
update SomeTable set Data='xڍ�]o�0��K$�k;�H��Z�*XATb�U,where id=1' resulted in an error... Obviously this has to do with the binary nature of the data. This captures the essence of my question. Hope you shine some light on this issue.
You can't just pass the binary string as a query value as you have here - I think you need to use a query with a bind variable.
I'm unsure whether the mysql gem you're using supports query bind parameters, but the format of query you'd use is something along the lines of:
#db.execute('update SomeTable set Data=? where id = 1', <binary data value>)
this will permit the mysql to properly escape or encapsulate the binary data that is to be inserted into the database table.
To sumarize mcfinningan answer. Transmitting the binary data is done via binding a parameter. In ruby this can be done with 'mysql' gem this can be done using prepared statments (cf. MySQL Ruby tutorial)
The code now looks like:
require 'zlib'
require 'mysql'
#Initialization of connection function
db = Mysql.init
db.options(Mysql::OPT_COMPRESS, true)
db.options(Mysql::SET_CHARSET_NAME, 'utf8')
dbh = db.real_connect('hostname','username','password','database') #replace with appropriate connection details
#Saving the data function (can skip Zlib compression if needed)
values=someVeryBigHash
values=Marshal.dump(values)
values=Zlib::Deflate.deflate(values, Zlib::BEST_COMPRESSION)
#Here is how to load the binary data into MySQL (assumes your schema has some tale with a Column Data of type BLOB
dbh.prepare("update `SomeTable` set Data=? where id=1")
dbh.execute(data)
#End of Data loading
dbh.close if dbh
#Retrieving the data function (can skip Zlib decompression if data is not compressed)
res=dbh.query("select * from `SomeTable` where id=1")
data=res['Data']
data=Zlib::inflate(data)
data=Marshal.restore(data)

encoding issues between python and mysql

I have a weird encoding problem from my PyQt app to my mysql database.
I mean weird in the sense that it works in one case and not the other ones, even though I seem to be doing the exact same thing for all.
My process is the following:
I have some QFocusOutTextEdit elements in which I write text possibly containing accents and stuff (é,à,è,...)
I get the text written with :
text = self.ui.text_area.toPlainText()
text = text.toUtf8()
Then to insert it in my database I do :
text= str(text).decode('unicode_escape').encode('iso8859-1').decode('utf8')
I also set the character set of my database, the specific tables and the specific columns of the table to utf8.
It is working for one my text areas, and for the other ones it puts weird characters instead in my db.
Any hint appreciated on this !
RESOLVED :
sorry for the disturbance, apparently I had some fields in my database that weren't up to date and this was blocking the process of encoding somehow.
You are doing a lot of encoding, decoding, and reencoding which is hard to follow even if you know what all of it means. You should try to simplify this down to just working natively with Unicode strings. In Python 3 that means str (normal strings) and in Python 2 that means unicode (u"this kind of string").
Arrange for your connection to the MySQL database to use Unicode on input and output. If you use something high-level like Sqlalchemy, you probably don't need to do anything. If you use MySQLdb directly make sure you pass charset="UTF8" (which implies use_unicode) to the connect() method.
Then make sure the value you are getting from PyQT is a unicode value. I don't know PyQT. Check the type of self.ui.text_area or self.ui.text_area.toPlainText(). Hopefully it is already a Unicode string. If yes: you're all set. If no: it's a byte string which is probably encoded in UTF-8 so you can decode it with theresult.decode('utf8') which will give you a Unicode object.
Once your code is dealing with all Unicode objects and no more encoded byte strings, you don't need to do any kind of encoding or decoding anymore. Just pass the strings directly from PyQT to MySQL.