I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)
First, I admit not being a VB expert, but I was asked to check our database system taking care of handling the languages of our application. The issue is that some characters with accent seem to magically be converted without them.
For example, the Polish word "przesunąć" will be stored as "przesunac" in the record field at the time of the call to Recordset.MoveNext. "Unicode Compression" is set to true on that column, but I doubt it's related. I'm trying to find out what makes this magic conversion because I don't want it.
Someone stated at http://www.pcreview.co.uk/forums/no-unicode-dao-recordset-t1102041.html that " the Recordset contains correct data but that the Debugger window and Tooltips can't display Unicode strings". Interesting. Dumb, but interesting.
Fine, but why are the strings in ANSI in the file? Well, the next post in the same thread reads "If you want to write in Unicode with VBA, my feeling would be that you must
write in binary mode; not in Text mode." This lead me to http://accessblog.net/2007/06/how-to-write-out-unicode-text-files-in.html where I got my final answer.
Case solved.
I'm trying to store text (with emoji) From an iPhone Client App on a MySQL database with Erlang. (Into a varchar column)
I used to do it with a socket connection server done with C++ and mysqlpp, it was working great. (It is the exact same database, So I can assume that the issue is not coming from the database)
However, I decided to pass everything on Erlang for scalability reasons, and since, I am unable to store and retrieve correctly emojis.
I'm using emysql to communicate with my database.
When I'm storing, I'm sending this list to the database :
[240,159,152,130]
When I'm retrieving, here what I get :
<<195,176,194,159,194,152,194,130>>
There is some similarities obviously, we can see 159, 152 and 130 on both lines, but no 240. I do not know where 195, 176 and 194 come from.
I though about changing the emysql encoding when creating the connection pool.
emysql:add_pool(my_db, 3, "login", "password", "db.mydomain.com", 3306, "MyTable", utf8)
But I can seems to find the proper atom for utf32 encoding. (The interesting thing is that I have not set any encoding on C++ and mysqlpp, it worked out of the box).
I have made some test...
storing from C++, retrieving from C++ (Works fine)
storing from Erlang, retrieving from Erlang (Does not work)
storing from Erlang, retrieving from C++ (Does not work)
storing from C++, retrieving from Erlang (Does not work)
One more information, I'm using prepared statement on Erlang, while I'm not on C++
Any help would be appreciated.
AS requested, here the query for storing data :
UPDATE Table SET c=? WHERE id=?
Quite simple really...
It is all about utf-8 encoding. In Erlang a list of characters, in your case [240,159,152,130], aren't normally encoded but are the unicode code points. When you retrieved the data you got a binary containing with utf-8 encoding bytes of your characters. Exactly where this encoding occurred I don't know. From the erlang shell:
10> Bin = <<195,176,194,159,194,152,194,130>>.
<<195,176,194,159,194,152,194,130>>
11> <<M/utf8,N/utf8,O/utf8,P/utf8,R/binary>> = Bin.
<<195,176,194,159,194,152,194,130>>
12> [M,N,O,P].
[240,159,152,130]
Handling unicode in erlang is pretty simple, characters in lists are usually the unicode code points and are very rarely encoded, while storing them in binaries means you have to encode them in some way, as binaries are just arrays of bytes. The default encoding is utf-8. In the module unicode there are functions for converting between unicode lists and binaries.
I have a weird encoding problem from my PyQt app to my mysql database.
I mean weird in the sense that it works in one case and not the other ones, even though I seem to be doing the exact same thing for all.
My process is the following:
I have some QFocusOutTextEdit elements in which I write text possibly containing accents and stuff (é,à,è,...)
I get the text written with :
text = self.ui.text_area.toPlainText()
text = text.toUtf8()
Then to insert it in my database I do :
text= str(text).decode('unicode_escape').encode('iso8859-1').decode('utf8')
I also set the character set of my database, the specific tables and the specific columns of the table to utf8.
It is working for one my text areas, and for the other ones it puts weird characters instead in my db.
Any hint appreciated on this !
RESOLVED :
sorry for the disturbance, apparently I had some fields in my database that weren't up to date and this was blocking the process of encoding somehow.
You are doing a lot of encoding, decoding, and reencoding which is hard to follow even if you know what all of it means. You should try to simplify this down to just working natively with Unicode strings. In Python 3 that means str (normal strings) and in Python 2 that means unicode (u"this kind of string").
Arrange for your connection to the MySQL database to use Unicode on input and output. If you use something high-level like Sqlalchemy, you probably don't need to do anything. If you use MySQLdb directly make sure you pass charset="UTF8" (which implies use_unicode) to the connect() method.
Then make sure the value you are getting from PyQT is a unicode value. I don't know PyQT. Check the type of self.ui.text_area or self.ui.text_area.toPlainText(). Hopefully it is already a Unicode string. If yes: you're all set. If no: it's a byte string which is probably encoded in UTF-8 so you can decode it with theresult.decode('utf8') which will give you a Unicode object.
Once your code is dealing with all Unicode objects and no more encoded byte strings, you don't need to do any kind of encoding or decoding anymore. Just pass the strings directly from PyQT to MySQL.
I've got a database where for efficiency, i've put the data into the db in html encoded formats.
I do maintenance on the data, and then move it into production via an 'into outfile', so it ends up in a text file.
The special characters don't make it across cleanly, and it comes out as all messed up code.
Is there a way to maintain the format for the txt file?
Or should I be using another format?
The 'outfile' , and 'import'I find very efficient for doing a bulk transfer.
If i can't use that, any suggestions on the best way to find special characters in mysql?
The only thing I've found seems to find fields that ONLY contain non-ascii characters
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9]';
Is there a reason you're storing HTML encoded text in the database? As discussed in episode 58 of the Stack Overflow podcast, you should always try to store raw data at the highest level of precision possible.