How to avoid quoted commas parsing a CSV in JRuby - csv

I'm trying to use the following line to parse a CSV into a table in JRuby.
# Parse the CSV file into a table
table = CSV.parse(File.read(tempFileName), headers: true)
The CSV file I'm using as input could have text columns that include commas. For these cases, I've included double quotation marks on the columns, to indicate that the internal commas on these columns should not be considered as delimiters.
For example, I could have the following CSV:
Address No, Alpha Name
1, Marcelo
2, "Surname, Name"
However, when I execute the code, I obtain the following error:
"Exception": "Message": "org.jruby.embed.InvokeFailedException:
(MalformedCSVError) Illegal quoting in line 2.: tat RUBY.main
Is there any way to avoid this error indicating the correct quoting character and, also, is it going to avoid considering the internal quotes as column separators?

Related

How do I convince Splunk that a backslash inside a CSV field is not an escape character?

I have the following row in a CSV file that I am ingesting into a Splunk index:
"field1","field2","field3\","field4"
Excel and the default Python CSV reader both correctly parse that as 4 separate fields. Splunk does not. It seems to be treating the backslash as an escape character and interpreting field3","field4 as a single mangled field. It is my understanding that the standard escape character for double quotes inside a quoted CSV field is another double quote, according to RFC-4180:
"If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
Why is Splunk treating the backslash as an escape character, and is there any way to change that configuration via props.conf or any other way? I have set:
INDEXED_EXTRACTIONS = csv
KV_MODE = none
for this sourcetype in props.conf, and it is working fine for rows without backslashes in them.
UPDATE: Yeah so Splunk's CSV parsing is indeed not RFC-4180 compliant, and there's not really any workaround that I could find. In the end I changed the upstream data pipeline to output JSON instead of CSVs for ingestion by Splunk. Now it works fine. Let this be a cautionary tale if anyone stumbles across this question while trying to parse CSVs in Splunk!

Snowflake how to escape all special characters in a string of an array of objects before we parse it as JSON?

We are loading data into Snowflake using a JavaScript procedure.
The script will loop over an array of objects to load some data. These objects contain string that may have special characters.
i.e.:
"Description": "This file contain "sensitive" information."
The double quotes on sensitive word will become:
"Description": "This file contain \"sensitive\" information."
Which broke the loading script.
The same issue happened when we used HTML tags within description key:
"Description": "Please use <b>specific fonts</b> to update the file".
This is another example on the Snowflake community site.
Also this post recommended setting FIELD_OPTIONALLY_ENCLOSED_BY equal to the special characters, but I am handling large data set which might have all the special characters.
How can we escape special characters automatically without updating the script and use JavaScript to loop over the whole array to anticipate and replace each special character with something else?
EDIT
I tried using JSON_EXTRACT_PATH_TEXT:
select JSON_EXTRACT_PATH_TEXT(parse_json('{
"description": "Please use \"Custom\" fonts"
}'), 'description');
and got the following error:
Error parsing JSON: missing comma, line 2, pos 33.
I think the escape characters generated by the JS procedure are escaped when passing to SQL functions.
'{"description": "Please use \"Custom\" fonts"}'
becomes
'{"description": "Please use "Custom" fonts"}'
Therefore parsing them as JSON/fetching a field from JSON fails. To avoid error, the JavaScript procedure should generate a double backslash instead of a backslash:
'{"description": "Please use \\"Custom\\" fonts"}'
I do not think there is a way to prevent this error without modifying the JavaScript procedure.
I came across this today, Gokhan is right you need the double backslashes to properly escape the quote.
Here are a couple links that explain it a little more:
https://community.snowflake.com/s/article/Escaping-new-line-character-in-JSON-to-avoid-data-loading-errors
https://community.snowflake.com/s/article/Unable-to-Insert-Data-Containing-Back-Slash-from-Stored-Procedure
For my case I found that I could address this challenge by disabling the escaping and then manually replacing the using replace function.
For your example the replace is not necessary.
select parse_json($${"description": "Please use \"Custom\" fonts"}$$);
select parse_json($${"description": "Please use \"Custom\" fonts"}$$):description;

Spark reading multiple files : double quotes replaced by %22

I have requirements to read random json files in different folders where data has changed. So I can't apply regex to read pattern . I know which are those files and I could list them .But when I form string with all the file path and try reading json in spark. The double quotes are replaced by %22 and reading files via spark fails. Could any one please help ?
val FilePath = "\"/path/2019/02/01/*\"" + ","+ "\"path/2019/02/05/*\"" + "\"path/2019/02/24/*\""
FilePath:String = "path/2019/02/20/*","path/2019/02/05/*","path/2019/02/24/*"
Now when I use this variable to read josn files, it fails with error and quotes are replaced by %22.
spark.read.json(FilePath)
java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: "/path/2019/02/01/*%22,%22/path/2019/02/05/*%22,%22/path/2019/02/24/*%22
I've just tried this with an older version of Spark (1.6.0) and it works fine if you supply separate paths or wildcard patterns as varargs to the json method, i.e.:
sqlContext.read.json("foo/*", "bar/*")
When you pass multiple patterns in a single string, Spark is trying to construct a single URI from them, which is incorrect, and it will try to URL-encode the quotes characters as %22.
As an aside, trying to create a URI is failing because your string starts with a double-quote, which is an illegal character in that position (RFC 3986):
Scheme names consist of a sequence of characters beginning with a
letter and followed by any combination of letters, digits, plus
("+"), period ("."), or hyphen ("-").
Add the paths in a list(ex: pathList) and use as below
spark.read.option("basePath", basePath).json(pathList: _*)

Is CSV data with missing leading quotations considered malformed?

I am using OpenCSV to read CSV files. Looking over the docs, I don't see guidelines on how to handle malformed data.
I have a CSV File. Comes with all the expected features: each field is separated by a comma, and each field is surrounded by quotes in case one of the values may contain a comma. However, every line (except the headers) is missing a leading quote. Here is an example
"Header 1","Header2"
value1","value2"
value1","value2"
The CSV parser ended up skipping every other line due to the way the quotes were lined up, which obviously causes problems.
I would consider this to be an error, because the first column is missing quotation marks since I know what the data should look like, but as far as the CSV spec is considered, this may be considered valid? If so, I suppose I would have to build extra checks myself to make sure that I am not missing any lines, despite it containing valid CSV data.
According of the rfc for CSV files:
While there are various specifications and implementations for the
CSV format, there is no formal
specification in existence, which allows for a wide variety of
interpretations of CSV files.
So simply put, malformed? No. Informal? No. Even this article (Linked in the RFC) mentions that lines can be mixmatched with quotes and no quotes.
For the data you show:
"Header 1","Header2"
value1","value2"
value1","value2"
we could argue the data is not malformed if the fields would be considered as being not quoted and the fields never contain a separator and there are no multiline fields, which would give the values:
"Header 1" "Header2"
value1" "value2"
value1" "value2"
Of course it's obvious this data was meant to have quoted fields. In that case the data is certainly malformed, and could be parsed differently with different parsers (maybe even as multiline fields).
Valid options would be:
value1,value2 // no quotes at all
"value1","value2" // all quoted
value1,"value2,more data" // only quoted when there is a separator inside

Weka and CSV files

I'm currently trying to import some data into weka. Currently the data is in a CSV file, and consists of a numerical ID and then some string data(Tweets). I'm getting an error where it is reading "Wrong number of values, Read 1, expected 2 Token[EOL], line 17". I'm using quotes as my enclosure characters for the String data. I understand that something(presumably an EOL character?) is causing weka to incorrectly separate some of the String data into multiple entries on the same line, but I'm not sure how to fix the EOL token problem.
My data set can be viewed here. The current data set is on Sheet 2:
https://docs.google.com/spreadsheets/d/1Yclu0t4ITFWn6itYBsVtkGalmP9BPaWFFP6U6jAeLMU/edit?usp=sharing
The text file itself may be found here:
https://drive.google.com/file/d/0B433FqC3TscQQkRxZklQclA3Z3M/view?usp=sharing
Current error is now on the 3rd line, with the same error. The only newline character there is the one at the end of the line denoting a new entry, so I'm not sure why its having issues.
In its datasets, Weka considers a newline character as an indication of the end of instance. Your line 17 is actually a multi-line tweet which confuses Weka. You can use either
a RegEx to get rid of the newline characters in every single tweet or
during downloading the tweets, clean the tweets to get rid of any newline character in them.
Unfortunately, Weka does not have a mechanism to get rid of this problem by itself (as far as I know).
EDIT
Okay, here are some other things that need to be fixed (according to your EDITS in the question):
Replace ' with \'
Replace grave accent with \grave accent
Many tweets contain quotes inside quotes. The inside double quotes (") should be replaced by \"
If you put your tweets inside double quotes, then your header should be id, "text"
Some tweets contain two consecutive double quotes, get rid of them or replace them with \".
I cannot say exactly where, because I lost trace, but I think still some tweets contain new lines in them (or at least one tweet has it still)
These are just a few things that I noticed. There might be more. Time will tell.