I try to parse a fairly complex CSV with apache sparks CSV reader which internally relies on the apache commons library (https://github.com/databricks/spark-csv).
I tried different combination of
quoteMode and escape but could not get it to work e.g. prevent the exceptions. Do you have any hints which parameters would support such a nested Structure?
ERROR CsvRelation$: Exception while parsing line: "Gabriella's Song" From The Motion Picture "The Mission";
java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
I know that sed could be used to pre-process the data. However, it would be great if was integrated into Spark e.g. if no further pre-processing was needed. I did not find the possibility to specify a regex or so.
The CSV file looks like:
"Gabriella's Song" From The Motion Picture "The Mission";
This is related to https://github.com/databricks/spark-csv/issues/295
some more special fields like
&
Or "Eccoli; attenti ben (Don Pasquale)"
Cause these Problems. We will write our own CSV Pre-Processor for Apache Camel.
Try this,it worked very well for me -
HDFS file -
spark.read.option("WholeFile", true).option("delimiter", ",").csv(s"hdfs://{my-hdfs-file-path}")
Non-HDFS file -
spark.read.option("WholeFile", true).option("delimiter", ",").csv(my-hdfs-file-path)
Above approach works for any delimited file, just change the delimiter value.
You can also use Regex but that will very in-efficient for large files.
Hope this is helpful.
Related
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON --autodetect ERIC_KOLOTYLUK_BQ_POC_DATASET.Test2 small_data_clean.jsonl
seems to work well if the JSON is very clean, but very fragile with abstruse diagnostics when the JSON is not very clean. Well, the feature is still experimental, so no point in complaining. For example, if JSON property names contain - characters, while these are valid JSON, they are not valid BigQuery Column Names.
My questions is, are there some existing tools/utilities for ingesting generic JSON into BigQuery that works better than --autodetect?
Presumably Google will improve --autodetect over time, but for now I am looking for any advice/experience people may have. I have already written some code to replace - with _ in property names, so I was wondering if other people have similarly creates tools/utilities...
As you've mentioned for a sample scenario, you have a JSON property that has "-" for a column name, however per this Specifying a Schema Documentation,
A column name must contain only letters (a-z, A-Z), numbers (0-9), or
underscores (_), and it must start with a letter or underscore.
Any characters non-compliant with the above will result to error on creation of columns during the definition of schema.
On the other hand, when using --autodetect, BigQuery's official documentation on Auto-Detection already has a disclaimer saying,
When BigQuery detects schemas, it might, on rare occasions, change a
field name to make it compatible with BigQuery SQL syntax.
Since there is no other tools yet available to auto-correct/format JSON data to fit the BigQuery requirements for schema definition, the best approach for this kind of scenario is to write a code to replace unwanted characters on JSON data's column names, in which you already did.
Am having trouble identifying the propert format to store a json request body in csv format, then use the csv file value in a scenario.
This works properly within a scenario:
And request '{"contextURN":"urn:com.myco.here:env:booking:reservation:0987654321","individuals":[{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:12345678","name":{"firstName":"NUNYA","lastName":"BIDNESS"},"dateOfBirth":"1980-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"LANDBRANCH","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"},{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:23456789","name":{"firstName":"NUNYA","lastName":"BIZNESS"},"dateOfBirth":"1985-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"BRANCHLAND","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"}]}'
However, when stored in csv file as follows (I've tried quite a number other formatting variations)
'{"contextURN":"urn:com.myco.here:env:booking:reservation:0987654321","individuals":[{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:12345678","name":{"firstName":"NUNYA","lastName":"BIDNESS"},"dateOfBirth":"1980-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"LANDBRANCH","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"},{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:23456789","name":{"firstName":"NUNYA","lastName":"BIZNESS"},"dateOfBirth":"1985-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"BRANCHLAND","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"}]}',
and used in scenario as:
And request requestBody
my test returns an "javascript evaluation failed: " & the json above & :1:63 Missing close quote ^ in at line number 1 at column number 63
Can you please identify correct formatting or the usage errors I am missing? Thanks
We just use a basic CSV library behind the scenes. I suggest you roll your own Java helper class that does whatever processing / pre-processing you need.
Do read this answer as well: https://stackoverflow.com/a/54593057/143475
I can't make sense of your JSON but if you are trying to fit JSON into CSV, sorry - that's not a good idea. See this answer: https://stackoverflow.com/a/62449166/143475
I'm using the Spring Cloud Dataflow server and I am polling csv files with the Time source and the http client processor.
Now I want to split the polled csv file and pipeline single line-by-line messages. Since the HttpClientProcessor polls entire files only, I'm using a FileSplitter processor in order to archieve that. But I'm stuck on that. The relevant FileSplitter options are delimiters and expression.
The delimiters options hint says
When expression is null, delimiters to use when tokenizing {#link String} payloads.
The expression options hint says
A SpEL expression for splitting payloads.
I've tried lots of possibilities more than just a simple \n for both options without success.
The expression option literally always fire the following error:
Failed to bind properties under 'splitter.expression' to org.springframework.expression.Expression:
Property: splitter.expression
Value: \n
Origin: System Environment Property "SPRING_APPLICATION_JSON"
Reason: failed to convert java.lang.String to org.springframework.expression.Expression
Approaches on the delimiters option lead into a successful start of the applications, but my log-sink is not getting any input. My delimiter options are different from what is recognized as the real 'new line'-character in csv.
Does anyone have an idea on what option(s) I have to input for delimiters or expression in order to split the csv message line-by-line?
Implementing my own FileSplitter processor app seems to be an overkill but I will do it if need be...
Here is an example working syntax with the expression argument providing \\n.
stream create --name name --definition “http --port=9090|splitter
--expression=payload.split(‘\\n’) |custom|log
I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)
I'd like to save multiple json strings to a file and separate them by a delimiter, such that it will be easy to read this list in, split on the delimiter and work with each json doc separately.
Serializing using a json array is not an option due to external reasons.
I would like to use a delimiter that is illegal in JSON (e.g. delimiting using a comma would be a bad idea since there are commas within the json strings).
Are there any characters that are not considered legal in JSON serialized strings?
I know it's not exactly what you needed, but you can use this SO answer to write the json string to a CSV, then read it on the other side by using a good streaming CSV reader such as this one
NDJSON
Have a look at NDJSON (Newline delimited JSON).
http://ndjson.org/
It seems to me to be exactly how you should do things, though its not exactly what you asked for. (If you can't flatten your JSON objects into single lines then it's not for you though!) You asked for a delimiter that is not allowed in JSON. Newline is allowed in JSON, but it is not necessary for JSON to contain newlines.
The format is used for log files amongst other things. I discovered it when looking at the Lichess API documentation.
You can start listening in to a broadcast stream of NDJSON data part way through, wait for the next newline character and then start processing objects as and when they arrive.
If you go for NDJSON, you are at least following a standard and I think you'd be hard pressed to find an alternative standard to follow.
Example NDJSON
{"some":"thing"}
{"foo":17,"bar":false,"quux":true}
{"may":{"include":"nested","objects":["and","arrays"]}}
An old question, but hopefully this answer will be useful.
Most JSON readers crash on this character: , which is information separator two. They declare it "unexpected token", so I guess it has to be wrapped to pass or something.