Spark - load numbers from a CSV file with non-US number format - csv

I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?

For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)

Related

Cannot identify proper format for a json request body stored and used in csv file for use in a karate scenario

Am having trouble identifying the propert format to store a json request body in csv format, then use the csv file value in a scenario.
This works properly within a scenario:
And request '{"contextURN":"urn:com.myco.here:env:booking:reservation:0987654321","individuals":[{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:12345678","name":{"firstName":"NUNYA","lastName":"BIDNESS"},"dateOfBirth":"1980-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"LANDBRANCH","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"},{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:23456789","name":{"firstName":"NUNYA","lastName":"BIZNESS"},"dateOfBirth":"1985-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"BRANCHLAND","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"}]}'
However, when stored in csv file as follows (I've tried quite a number other formatting variations)
'{"contextURN":"urn:com.myco.here:env:booking:reservation:0987654321","individuals":[{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:12345678","name":{"firstName":"NUNYA","lastName":"BIDNESS"},"dateOfBirth":"1980-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"LANDBRANCH","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"},{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:23456789","name":{"firstName":"NUNYA","lastName":"BIZNESS"},"dateOfBirth":"1985-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"BRANCHLAND","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"}]}',
and used in scenario as:
And request requestBody
my test returns an "javascript evaluation failed: " & the json above & :1:63 Missing close quote ^ in at line number 1 at column number 63
Can you please identify correct formatting or the usage errors I am missing? Thanks
We just use a basic CSV library behind the scenes. I suggest you roll your own Java helper class that does whatever processing / pre-processing you need.
Do read this answer as well: https://stackoverflow.com/a/54593057/143475
I can't make sense of your JSON but if you are trying to fit JSON into CSV, sorry - that's not a good idea. See this answer: https://stackoverflow.com/a/62449166/143475

Solr CSV responses writer: How to get no encapsulator

I need to get back CSV output from my Solr queries, so I am using Solr's CSV responses writer.
All works fine using wt=csv without changing default values for CSV output, but I have one requirement: I need tab-separated CSV with no text value quoting at all.
The tab-separation is easy as I can specify a tab as csv.separator in the Solr csv responses writer.
The problem is how to get rid of encapsulation:
The default values for encapsulation of csv fields is ".
But setting encapsulator='' or encapsulator=None returns the error Invalid encapsulator.
There seems to be no documentation for this in the Solr Wiki.
How can I suppress encapsulation at all?
You are not going to be able to, the java source expects a 1 char length encapsulator:
String encapsulator = params.get(CSV_ENCAPSULATOR);
String escape = params.get(CSV_ESCAPE);
if (encapsulator!=null) {
if (encapsulator.length()!=1) throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,"Invalid encapsulator:'"+encapsulator+"'");
strat.setEncapsulator(encapsulator.charAt(0));
}
What you can do:
Write your own custom NoEncapsulatorCSVResponseWriter, by inheriting from CSVResponseWriter probably, and modify the code so it does not use the encapsulator. Not difficult, but mostly a hassle.
Use some unique encapsulator (for example ΓΈ) and then add a postprocess step on your client side that just removes it. Easier but you need that extra step...

Json-Opening Yelp Data Challenge's data set

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can be opened in phyton using mrjob, but I am also not very good with programming. I searched online and looked some of the codes yelp provided in github however I couldn't seem to find an article or something which explains how to open the dataset, clearly.
Can you please tell me step by step how to open this file and maybe how to convert it to csv?
https://www.yelp.com.tr/dataset_challenge
https://github.com/Yelp/dataset-examples
data is in .tar format when u extract it again it has another file,rename it to .tar and then extract it.you will get all the json files
yes you can use pandas. Take a look:
import pandas as pd
# read the entire file into a python array
with open('yelp_academic_dataset_review.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
Now 'data_df' contains the yelp data ;)
Case, you want convert it directly to csv, you can use this script
https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py
I hope it can help you
To process huge json files, use a streaming parser.
Many of these files aren't a single json, but a stream of jsons (known as "jsons format"). Then a regular json parser will consider everything but the first entry to be junk.
With a streaming parser, you can start reading the file, process parts, and wrote them to the desired output; then continue writing.
There is no single json-to-csv conversion.
Thus, you will not find a general conversion utility, you have to customize the conversion for your needs.
The reason is that a JSON is a tree but a CSV is not. There exists no ultimative and efficient conversion from trees to table rows. I'd stick with JSON unless you are always extracting only the same x attributes from the tree.
Start coding, to become a better programmer. To succeed with such amounts of data, you need to become a better programmer.

CSV parsing nested quotes

I try to parse a fairly complex CSV with apache sparks CSV reader which internally relies on the apache commons library (https://github.com/databricks/spark-csv).
I tried different combination of
quoteMode and escape but could not get it to work e.g. prevent the exceptions. Do you have any hints which parameters would support such a nested Structure?
ERROR CsvRelation$: Exception while parsing line: "Gabriella's Song" From The Motion Picture "The Mission";
java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
I know that sed could be used to pre-process the data. However, it would be great if was integrated into Spark e.g. if no further pre-processing was needed. I did not find the possibility to specify a regex or so.
The CSV file looks like:
"Gabriella's Song" From The Motion Picture "The Mission";
This is related to https://github.com/databricks/spark-csv/issues/295
some more special fields like
&
Or "Eccoli; attenti ben (Don Pasquale)"
Cause these Problems. We will write our own CSV Pre-Processor for Apache Camel.
Try this,it worked very well for me -
HDFS file -
spark.read.option("WholeFile", true).option("delimiter", ",").csv(s"hdfs://{my-hdfs-file-path}")
Non-HDFS file -
spark.read.option("WholeFile", true).option("delimiter", ",").csv(my-hdfs-file-path)
Above approach works for any delimited file, just change the delimiter value.
You can also use Regex but that will very in-efficient for large files.
Hope this is helpful.

Delimiter for multiple json strings

I'd like to save multiple json strings to a file and separate them by a delimiter, such that it will be easy to read this list in, split on the delimiter and work with each json doc separately.
Serializing using a json array is not an option due to external reasons.
I would like to use a delimiter that is illegal in JSON (e.g. delimiting using a comma would be a bad idea since there are commas within the json strings).
Are there any characters that are not considered legal in JSON serialized strings?
I know it's not exactly what you needed, but you can use this SO answer to write the json string to a CSV, then read it on the other side by using a good streaming CSV reader such as this one
NDJSON
Have a look at NDJSON (Newline delimited JSON).
http://ndjson.org/
It seems to me to be exactly how you should do things, though its not exactly what you asked for. (If you can't flatten your JSON objects into single lines then it's not for you though!) You asked for a delimiter that is not allowed in JSON. Newline is allowed in JSON, but it is not necessary for JSON to contain newlines.
The format is used for log files amongst other things. I discovered it when looking at the Lichess API documentation.
You can start listening in to a broadcast stream of NDJSON data part way through, wait for the next newline character and then start processing objects as and when they arrive.
If you go for NDJSON, you are at least following a standard and I think you'd be hard pressed to find an alternative standard to follow.
Example NDJSON
{"some":"thing"}
{"foo":17,"bar":false,"quux":true}
{"may":{"include":"nested","objects":["and","arrays"]}}
An old question, but hopefully this answer will be useful.
Most JSON readers crash on this character: , which is information separator two. They declare it "unexpected token", so I guess it has to be wrapped to pass or something.