Spark reading multiple files : double quotes replaced by %22 - json

I have requirements to read random json files in different folders where data has changed. So I can't apply regex to read pattern . I know which are those files and I could list them .But when I form string with all the file path and try reading json in spark. The double quotes are replaced by %22 and reading files via spark fails. Could any one please help ?
val FilePath = "\"/path/2019/02/01/*\"" + ","+ "\"path/2019/02/05/*\"" + "\"path/2019/02/24/*\""
FilePath:String = "path/2019/02/20/*","path/2019/02/05/*","path/2019/02/24/*"
Now when I use this variable to read josn files, it fails with error and quotes are replaced by %22.
spark.read.json(FilePath)
java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: "/path/2019/02/01/*%22,%22/path/2019/02/05/*%22,%22/path/2019/02/24/*%22

I've just tried this with an older version of Spark (1.6.0) and it works fine if you supply separate paths or wildcard patterns as varargs to the json method, i.e.:
sqlContext.read.json("foo/*", "bar/*")
When you pass multiple patterns in a single string, Spark is trying to construct a single URI from them, which is incorrect, and it will try to URL-encode the quotes characters as %22.
As an aside, trying to create a URI is failing because your string starts with a double-quote, which is an illegal character in that position (RFC 3986):
Scheme names consist of a sequence of characters beginning with a
letter and followed by any combination of letters, digits, plus
("+"), period ("."), or hyphen ("-").

Add the paths in a list(ex: pathList) and use as below
spark.read.option("basePath", basePath).json(pathList: _*)

Related

Double quote handling when exporting JSON field with BigQuery

I am making use of the JSON datatype in BigQuery and I have a table that looks like this:
myStringField | myJSONField
----------------|----------------------------------
someStringValue | {"key1":"value1", "key1":"value2"}
In SQL, everything works fine. But, when it comes to exporting data, it gets messy. For instance, if I click the "Save results" button and if I choose the "CSV (local file)" option, I obtain the following content in my CSV:
myStringField,myJSONField
someStringValue,"{""key1"":""value1"", ""key1"":""value2""}"
As you can see, I get "double double quotes" inside my JSON and it makes things complicated to parse for the downstream system that receives the file.
I tried to fix it by using different combinations of JSON functions such as PARSE_JSON(), TO_JSON_STRING(), STRING() but nothing worked and, in some cases, it even made things worse ("triple double quotes").
Ideally, the expected output of my CSV should resemble this:
myStringField,myJSONField
someStringValue,{"key1":"value1", "key1":"value2"}
Any workaround?
According to the docs, exporting JSON fields to a CSV format has some limitations:
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
When you export a table in JSON format, the symbols <, >, and & are converted by using the unicode notation \uNNNN, where N is a hexadecimal digit. For example, profit&loss becomes profit\u0026loss. This unicode conversion is done to avoid security vulnerabilities.
Check out the export limitations here: https://cloud.google.com/bigquery/docs/exporting-data#export_limitations
Regarding the export format you mentioned, that is the expected way to escape the double quote characters in CSV. So this is the expected output.
First quotes are there because of the CSV encode mechanism for strings and every other double quote inside that string will be escaped with another double quote.
"{""key1"":""value1""}"
If you are parsing this csv with any parser out there, this format should be supported with the right setup.

How do I convince Splunk that a backslash inside a CSV field is not an escape character?

I have the following row in a CSV file that I am ingesting into a Splunk index:
"field1","field2","field3\","field4"
Excel and the default Python CSV reader both correctly parse that as 4 separate fields. Splunk does not. It seems to be treating the backslash as an escape character and interpreting field3","field4 as a single mangled field. It is my understanding that the standard escape character for double quotes inside a quoted CSV field is another double quote, according to RFC-4180:
"If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
Why is Splunk treating the backslash as an escape character, and is there any way to change that configuration via props.conf or any other way? I have set:
INDEXED_EXTRACTIONS = csv
KV_MODE = none
for this sourcetype in props.conf, and it is working fine for rows without backslashes in them.
UPDATE: Yeah so Splunk's CSV parsing is indeed not RFC-4180 compliant, and there's not really any workaround that I could find. In the end I changed the upstream data pipeline to output JSON instead of CSVs for ingestion by Splunk. Now it works fine. Let this be a cautionary tale if anyone stumbles across this question while trying to parse CSVs in Splunk!

spark csv writer - escape string without using quotes

I am trying to escape delimiter character that appears inside data. Is there a way to do it by passing option parameters? I can do it from udf, but I am hoping it is possible using options.
val df = Seq((8, "test,me\nand your", "other")).toDF("number", "test", "t")
df.coalesce(1).write.mode("overwrite").format("csv").option("quote", "\u0000").option("delimiter", ",").option("escape", "\\").save("testcsv1")
But the escape is not working. The output file is written as
8,test,me
and your,other
I want the output file to be written as.
8,test\,me\\nand your,other
I'm not certain, but I think if you had your sequence as
Seq((8, "test\\,me\\\\nand your", "other"))
and did not specify a custom escape character, it would behave as you are expecting and give you 8,test\,me\\nand your,other as the output. This is because \\ acts simply as the character '\' rather than an escape, so they are printed where you want and the n immediately after is not interpreted as part of a newline character.

Interpolating a JSON string removes JSON quotations

I have the following two lines of code:
json_str = _cases.to_json
path += " #{USER} #{PASS} #{json_str}"
When I use the debugger, I noticed that json_str appears to be formatted as JSON:
"[["FMCE","Wiltone","Wiltone","04/10/2018","Marriage + - DOM"]]"
However, when I interpolate it into another string, the quotes are removed:
"node superuser 123456 [["FMCE","Wiltone","Wiltone","04/10/2018","Marriage + - DOM"]]"
Why does string interpolation remove the quotes from JSON string and how can I resolve this?
I did find one solution to the problem, which was manually escaping the string:
json_str = _cases.to_json.gsub('"','\"')
path += " #{USER} #{PASS} \"#{json_str}\""
So basically I escape the double quotes generated in the to_json call. Then I manually add two escaped quotes around the interpolated variable. This will produce a desired result:
node superuser 123456 "[[\"FMCE\",\"Wiltone\",\"Wiltone\",\"04/10/2018\",\"Marriage + - DOM\"]]"
Notice how the outer quotes around the collection are not escaped, but the strings inside the collection are escaped. That will enable JavaScript to parse it with JSON.parse.
It is important to note that in this part:
json_str = _cases.to_json.gsub('"','\"')
it is adding a LITERAL backslash. Not an escape sequence.
But in this part:
path += " #{USER} #{PASS} \"#{json_str}\""
The \" wrapping the interpolated variable is an escape sequence and NOT a literal backslash.
Why do you think the first and last quote marks are part of the string? They do not belong to the JSON format. Your program’s behavior looks correct to me.
(Or more precisely, your program seems to be doing exactly what you told it to. Whether your instructions are any good is a question I can’t answer without more context.)
It's hard to tell with the small sample, but it looks like you might be getting quotes from your debugger output. assuming the output of .to_json is a string (usually is), then "#{json_str}" should be exactly equal to json_str. If it isn't, that's a bug in ruby somehow (doubtful).
If you need the quotes, you need to either add them manually or escape the string using whatever escape function is appropriate for your use case. You could use .to_json as your escape function even ("#{json_str.to_json}", for example).

Import huge data from CSV into Neo4j

When I try to import huge data into neo4j it gives following error:
there's a field starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: 'Hello! I am trying to combine 2 variables to one variable. The variables are Public Folder Names and the ParentPath. Both can be found using Get-PublicFolder
Basically I want an array of Public Folders Path and Name so I will have an array like /Engineering/NewUsers
Below is my code
$parentpath = Get-PublicFolder -ResultSize Unlimited -Identity """ "'
It seems that there may be some information lacking from your question, especially about the data that is getting parsed, stack trace a.s.o.
Anyway, I think you can get around this by changing which character is treated as quote character. How are you calling the import tool and which version of Neo4j are you doing this on?
Try including argument --quote %, and I'm making this up by just using another character % as quote character. Would that help you?