Writing "latin1" encoded byte array with escape characters - mysql

I'm working on a database import/export process in VB.NET which writes data from a MySQL (5.5) database to a plain text file. The application reads the data to a DataTable, then goes through the rows/columns to actually write the data to the OutputFile (System.IO.StreamWriter object). The encoding on the tables in this database is Latin1. There is a MediumBlob field in one of the tables I've been using for testing which contains image files stored as a byte array.
In my attempts to validate the output from my application, I've exported the data directly from the database using the MySQL Workbench, then compared that with the results I get when I write the same data from my application. In the direct export from MySQL Workbench, I see some of these bytes are exported with the backslash. When I read the data through my application, however, this escape character does not appear. Viewed through Notepad++, it clearly shows some distinct differences between the two output results (see screenshot).
Obviously, while apparently very similar, the two are not completely identical. My application is not including the backslashes for escaped characters, and some characters such as NULL are coming out differently altogether. My code for writing this field to the file is:
OutputFile.Write("'" & System.Text.Encoding.GetEncoding(28591).GetString(CType(COPYRow(ColumnIndex), Byte())) & "'")
There doesn't appear to be an overload for the GetString method that allows me to specify an escape character, so I'm wondering if there's another way that, using this method, I can ensure the characters are correctly encoded, including escape characters.
I'm "assuming" that this method should also work in general when I start working with my PostgreSQL database, but with possibly a different encoding. I'm trying to build things as "generic" as possible, but I'll have to worry about specifying encodings at run-time instead of hard-coding them later.
EDIT
I just ran across another SO question, which might point me in the right direction: Convert a Unicode string to an escaped ASCII string. Obviously, it might take a bit more work to get it right, but this looks like the closest thing to what I'm trying to accomplish.

Related

Spark - load numbers from a CSV file with non-US number format

I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)

MS-Access VBA magically converting unicode strings?

First, I admit not being a VB expert, but I was asked to check our database system taking care of handling the languages of our application. The issue is that some characters with accent seem to magically be converted without them.
For example, the Polish word "przesunąć" will be stored as "przesunac" in the record field at the time of the call to Recordset.MoveNext. "Unicode Compression" is set to true on that column, but I doubt it's related. I'm trying to find out what makes this magic conversion because I don't want it.
Someone stated at http://www.pcreview.co.uk/forums/no-unicode-dao-recordset-t1102041.html that " the Recordset contains correct data but that the Debugger window and Tooltips can't display Unicode strings". Interesting. Dumb, but interesting.
Fine, but why are the strings in ANSI in the file? Well, the next post in the same thread reads "If you want to write in Unicode with VBA, my feeling would be that you must
write in binary mode; not in Text mode." This lead me to http://accessblog.net/2007/06/how-to-write-out-unicode-text-files-in.html where I got my final answer.
Case solved.

Erlang emysql iPhone Emoji Encoding Issue

I'm trying to store text (with emoji) From an iPhone Client App on a MySQL database with Erlang. (Into a varchar column)
I used to do it with a socket connection server done with C++ and mysqlpp, it was working great. (It is the exact same database, So I can assume that the issue is not coming from the database)
However, I decided to pass everything on Erlang for scalability reasons, and since, I am unable to store and retrieve correctly emojis.
I'm using emysql to communicate with my database.
When I'm storing, I'm sending this list to the database :
[240,159,152,130]
When I'm retrieving, here what I get :
<<195,176,194,159,194,152,194,130>>
There is some similarities obviously, we can see 159, 152 and 130 on both lines, but no 240. I do not know where 195, 176 and 194 come from.
I though about changing the emysql encoding when creating the connection pool.
emysql:add_pool(my_db, 3, "login", "password", "db.mydomain.com", 3306, "MyTable", utf8)
But I can seems to find the proper atom for utf32 encoding. (The interesting thing is that I have not set any encoding on C++ and mysqlpp, it worked out of the box).
I have made some test...
storing from C++, retrieving from C++ (Works fine)
storing from Erlang, retrieving from Erlang (Does not work)
storing from Erlang, retrieving from C++ (Does not work)
storing from C++, retrieving from Erlang (Does not work)
One more information, I'm using prepared statement on Erlang, while I'm not on C++
Any help would be appreciated.
AS requested, here the query for storing data :
UPDATE Table SET c=? WHERE id=?
Quite simple really...
It is all about utf-8 encoding. In Erlang a list of characters, in your case [240,159,152,130], aren't normally encoded but are the unicode code points. When you retrieved the data you got a binary containing with utf-8 encoding bytes of your characters. Exactly where this encoding occurred I don't know. From the erlang shell:
10> Bin = <<195,176,194,159,194,152,194,130>>.
<<195,176,194,159,194,152,194,130>>
11> <<M/utf8,N/utf8,O/utf8,P/utf8,R/binary>> = Bin.
<<195,176,194,159,194,152,194,130>>
12> [M,N,O,P].
[240,159,152,130]
Handling unicode in erlang is pretty simple, characters in lists are usually the unicode code points and are very rarely encoded, while storing them in binaries means you have to encode them in some way, as binaries are just arrays of bytes. The default encoding is utf-8. In the module unicode there are functions for converting between unicode lists and binaries.

encoding issues between python and mysql

I have a weird encoding problem from my PyQt app to my mysql database.
I mean weird in the sense that it works in one case and not the other ones, even though I seem to be doing the exact same thing for all.
My process is the following:
I have some QFocusOutTextEdit elements in which I write text possibly containing accents and stuff (é,à,è,...)
I get the text written with :
text = self.ui.text_area.toPlainText()
text = text.toUtf8()
Then to insert it in my database I do :
text= str(text).decode('unicode_escape').encode('iso8859-1').decode('utf8')
I also set the character set of my database, the specific tables and the specific columns of the table to utf8.
It is working for one my text areas, and for the other ones it puts weird characters instead in my db.
Any hint appreciated on this !
RESOLVED :
sorry for the disturbance, apparently I had some fields in my database that weren't up to date and this was blocking the process of encoding somehow.
You are doing a lot of encoding, decoding, and reencoding which is hard to follow even if you know what all of it means. You should try to simplify this down to just working natively with Unicode strings. In Python 3 that means str (normal strings) and in Python 2 that means unicode (u"this kind of string").
Arrange for your connection to the MySQL database to use Unicode on input and output. If you use something high-level like Sqlalchemy, you probably don't need to do anything. If you use MySQLdb directly make sure you pass charset="UTF8" (which implies use_unicode) to the connect() method.
Then make sure the value you are getting from PyQT is a unicode value. I don't know PyQT. Check the type of self.ui.text_area or self.ui.text_area.toPlainText(). Hopefully it is already a Unicode string. If yes: you're all set. If no: it's a byte string which is probably encoded in UTF-8 so you can decode it with theresult.decode('utf8') which will give you a Unicode object.
Once your code is dealing with all Unicode objects and no more encoded byte strings, you don't need to do any kind of encoding or decoding anymore. Just pass the strings directly from PyQT to MySQL.

html special characters into outfile.txt

I've got a database where for efficiency, i've put the data into the db in html encoded formats.
I do maintenance on the data, and then move it into production via an 'into outfile', so it ends up in a text file.
The special characters don't make it across cleanly, and it comes out as all messed up code.
Is there a way to maintain the format for the txt file?
Or should I be using another format?
The 'outfile' , and 'import'I find very efficient for doing a bulk transfer.
If i can't use that, any suggestions on the best way to find special characters in mysql?
The only thing I've found seems to find fields that ONLY contain non-ascii characters
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9]';
Is there a reason you're storing HTML encoded text in the database? As discussed in episode 58 of the Stack Overflow podcast, you should always try to store raw data at the highest level of precision possible.