I need to process a csv file obtained from a government site. The file has two different format issues that cannot both be handled by Camel CsvDataFormat unmarshal. Minimal test file:
Registration No,Trade Name
"A009928","Rotagen "Combo""
"A010343","Vet Direct Abamectin Wormer, Bot + Tape"
Using this code to unmarshal:
CsvDataFormat csv = new CsvDataFormat();
csv.setDelimiter(",");
csv.setQuoteDisabled(true);
csv.setUseMaps(false);
from("file://c:/temp?fileName=test.csv&noop=true")
.unmarshal(csv)
.process(new Processor() {
public void process(Exchange exchange) throws Exception {
List<List<String>> rows = (List<List<String>>) exchange.getIn().getBody();
for (int j = 0; j< rows.size();j++) {
List<String> row = rows.get(j);
for (int i = 0; i< row.size();i++) {
log.info("ITEM["+row.get(i)+"]");
}
}
}
});
When setQuoteDisabled(false) I get:
java.lang.IllegalStateException: IOException reading next record: java.io.IOException: (line 2) invalid char between encapsulated token and delimiter
When setQuoteDisabled(true) the file is unmarshaled but the 3rd line ends an additional split at the extra ','
Here's the output:
13:10| INFO | MainRoute.java 54 | ITEM[Registration No]
13:10| INFO | MainRoute.java 54 | ITEM[Trade Name]
13:10| INFO | MainRoute.java 54 | ITEM["A009928"]
13:10| INFO | MainRoute.java 54 | ITEM["Rotagen "Combo""]
13:10| INFO | MainRoute.java 54 | ITEM["A010343"]
13:10| INFO | MainRoute.java 54 | ITEM["Vet Direct Abamectin Wormer]
13:10| INFO | MainRoute.java 54 | ITEM[ Bot + Tape"]
How to configure CsvDataFormat to unmarshall both rows correctly?
Well, this is a problem of CSV as a "soft standard". Rows and delimiters are more or less standardized, but when it comes to quotes, it gets complicated.
Since your data is quoted (i.e. every field value is in quotes), the correct configuration would be
setQuoteDisabled(false)
The second record works fine with this configuration.
"A010343","Vet Direct Abamectin Wormer, Bot + Tape"
Because the fields are enclosed in quotes, the comma inside the data is no problem.
However, the first record contains quotes inside the data.
"A009928","Rotagen "Combo""
According to RFC-4180, Paragraph 2.7 such quotes must be escaped with an additional quote.
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
"A009928","Rotagen ""Combo"""
You could try to fix this manually in one record to see if it works like this.
Generally, you have multiple options:
Inform the data provider that his data is not RFC-4180 compliant and ask him to fix it
Fix the data upfront before you read it with Camel
Parse the data by yourself and compensate the quote problem
The second line of your .csv file violates the rules of quoting in csv, or at least as it is understood by the default options of commons-csv (the library camel uses under the hood for these types of things).
The default way of dealing with quotes inside quotes, is to escape the inner quotes by repeating it twice. Keep setQuoteDisabled(false) and correct the second line in your .csv-file to:
"A009928","Rotagen ""Combo"""
Related
let mut file = Writer::from_path(output_path)?;
file.write_record([5.34534536546, 34556.456456467567567, 345.56465456])?;
Produces the following error:
error[E0277]: the trait bound `{float}: AsRef<[u8]>` is not satisfied
--> src/main.rs:313:27
|
313 | file.write_record([5.34534536546, 34556.456456467567567, 345.56465456])?;
| ------------ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `AsRef<[u8]>` is not implemented for `{float}`
| |
| required by a bound introduced by this call
|
= help: the following implementations were found:
<&T as AsRef<U>>
<&mut T as AsRef<U>>
<Arc<T> as AsRef<T>>
<Box<T, A> as AsRef<T>>
and 44 others
note: required by a bound in `Writer::<W>::write_record`
--> /home/mlueder/.cargo/registry/src/github.com-1ecc6299db9ec823/csv-1.1.6/src/writer.rs:896:12
|
896 | T: AsRef<[u8]>,
| ^^^^^^^^^^^ required by this bound in `Writer::<W>::write_record`
Is there any way to use the csv crate with numbers instead of structs or characters?
Only strings or raw bytes can be written to a file; if you try to give it something else, it isn't sure how to handle the data (as #SilvioMayolo mentioned). You can map your float array to one with strings, and then you will be able to write the string array to the file.
let float_arr = [5.34534536546, 34556.456456467567567, 345.56465456];
let string_arr = float_arr.map(|e| e.to_string();
This can obviously be combined to one line without using the extra variable, but it is a little easier to see the extra step that we need to take if it is split apart.
I have a csv similar to this (original file is proprietary, cannot share). Separator is Tab.
It contains a description column, whose text is wrapped in double quotes, can contain quoted strings, where, wait for it, escape sequence is also double quote.
id description other_field
12 "Some Description" 34
56 "Some
Multiline
""With Escaped Stuff""
Description" 78
I am parsing the file with this code
let mut reader = csv::ReaderBuilder::new()
.from_reader(file)
.deserialize().unwrap();
I'm consistently getting CSV deserialize error :
CSV deserialize error: record 43747 (line: 43748, byte: 21082563): missing field 'id'
I tried using flexible(true), double_quotes(true) with no luck.
Is it possible to parse this type of field, and if yes, how ?
Actually the issue was unrelated, rust-serde perfectly parses this. Just forgot to define the delimiter (tab in this case). This code works :
let mut reader = csv::ReaderBuilder::new()
.delimiter(b'\t')
.from_reader(file)
.deserialize().unwrap();
I have a small cluster setup of Spark 3.x. I have read some data and after transformations, I have to save it as JSON. But the problem I am facing is that, in array type of columns, Spark is adding extra double quotes when written as json file.
Sample data-frame data
I am saving this data frame as JSON with following command
df.write.json("Documents/abc")
The saved output is as follows
Finally, the schema info is as follows
The elements of the string array contain double quotes within the data, e.g. the first element is "Saddar Cantt, Lahore Punjab Pakistan" instead of Saddar Cantt, Lahore Punjab Pakistan. You can remove the extra double quotes from the strings before writing the json with transform and replace:
df.withColumn("ADDRESS", F.expr("""transform(ADDRESS, a -> replace(a, '"'))""")) \
.write.json("Documents/abc")
If we enforce schema before writing the dataframe as json, I believe we can work around such issues without having to replace/change any characters.
Without schema ->
df.show()
id | address
1 | ['Saddar Cantt, Lahore Punjab Pakistan', 'Shahpur']
df.write.json("path")
{"id":"1","address":["'Saddar Cantt, Lahore Punjab Pakistan'","'Shahpur'"]}
With Schema ->
df = df.withColumn('address', F.col('address').cast(StringType()))
df = df.withColumn('address', F.from_json(F.col('address'), ArrayType(StringType())))
df.write.json("path")
{"id":"1","address":["Saddar Cantt, Lahore Punjab Pakistan","Shahpur"]}
from_json only takes string as input. Hence we need to first cast array to string.
I am using grades.csv data from the link below,
https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html
I noticed that all the strings in the csv file were in "" and it causes
error messages:
Neo.ClientError.Statement.SemanticError: Cannot merge node using null property value for Test1
so I removed the "" in the headers
the code I was trying to run:
LOAD CSV WITH HEADERS FROM 'file:///grades.csv' AS row
MERGE (t:Test1 {Test1: row.Test1})
RETURN count(t);
error message:
Neo.ClientError.Statement.SyntaxError: Type mismatch: expected Any, Map, Node, Relationship, Point, Duration, Date, Time, LocalTime, LocalDateTime or DateTime but was List<String> (line 2, column 24 (offset: 65))
"MERGE (t:Test1 {Test1: row.Test1})
Basically you can not merge node using null property value. In your case, Test1 must be null for one or more lines in your file. If you don't see blank values for Test1, please check is there is any blank line at the end of file.
You can also handle null check before MERGE using WHERE, like
LOAD CSV ...
WHERE row.Test1 IS NOT NULL
MERGE (t:Test1 {Test1: row.Test1})
RETURN count(t);
The issues are:
The file is missing a comma after the Test1 value in the row for "Airpump".
The file has white spaces between the values in each row. (Search for the regexp ", +" and replace with ",".)
Your query should work after fixing the above issues.
In Flink, parsing a CSV file using readCsvFile raises an exception when encountring a field containing quotes like "Fazenda São José ""OB"" Airport":
org.apache.flink.api.common.io.ParseException: Line could not be parsed: '191,"SDOB","small_airport","Fazenda São José ""OB"" Airport",-21.425199508666992,-46.75429916381836,2585,"SA","BR","BR-SP","Tapiratiba","no","SDOB",,"SDOB",,,'
I've found in this mailing list thread and this JIRA issue that quoting inside the field should be realized through the \ character, but I don't have control over the data to modify it. Is there a way to work around this?
I've also tried using ignoreInvalidLines() (which is the less preferable solution) but it gave me the following error:
08:49:05,737 INFO org.apache.flink.api.common.io.LocatableInputSplitAssigner - Assigning remote split to host localhost
08:49:05,765 ERROR org.apache.flink.runtime.operators.BatchTask - Error in task code: CHAIN DataSource (at main(Job.java:53) (org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at main(Job.java:54)) -> Combine(SUM(1), at main(Job.java:56) (2/8)
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.flink.api.common.io.GenericCsvInputFormat.skipFields(GenericCsvInputFormat.java:443)
at org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:412)
at org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:111)
at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:454)
at org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:79)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:176)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Here is my code:
DataSet<Tuple2<String, Integer>> csvInput = env.readCsvFile("resources/airports.csv")
.ignoreFirstLine()
.ignoreInvalidLines()
.parseQuotedStrings('"')
.includeFields("100000001")
.types(String.class, String.class)
.map((Tuple2<String, String> value) -> new Tuple2<>(value.f1, 1))
.groupBy(0)
.sum(1);
If you cannot change the input data, then you should turn off parseQuotedString(). This will simply look for the next field delimiter and return everything in between as a string (including the quotations marks). Then you can remove the leading and trailing quotation mark in a subsequent map operation.