spark csv writer - escape string without using quotes - csv

I am trying to escape delimiter character that appears inside data. Is there a way to do it by passing option parameters? I can do it from udf, but I am hoping it is possible using options.
val df = Seq((8, "test,me\nand your", "other")).toDF("number", "test", "t")
df.coalesce(1).write.mode("overwrite").format("csv").option("quote", "\u0000").option("delimiter", ",").option("escape", "\\").save("testcsv1")
But the escape is not working. The output file is written as
8,test,me
and your,other
I want the output file to be written as.
8,test\,me\\nand your,other

I'm not certain, but I think if you had your sequence as
Seq((8, "test\\,me\\\\nand your", "other"))
and did not specify a custom escape character, it would behave as you are expecting and give you 8,test\,me\\nand your,other as the output. This is because \\ acts simply as the character '\' rather than an escape, so they are printed where you want and the n immediately after is not interpreted as part of a newline character.

Related

How do I convince Splunk that a backslash inside a CSV field is not an escape character?

I have the following row in a CSV file that I am ingesting into a Splunk index:
"field1","field2","field3\","field4"
Excel and the default Python CSV reader both correctly parse that as 4 separate fields. Splunk does not. It seems to be treating the backslash as an escape character and interpreting field3","field4 as a single mangled field. It is my understanding that the standard escape character for double quotes inside a quoted CSV field is another double quote, according to RFC-4180:
"If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
Why is Splunk treating the backslash as an escape character, and is there any way to change that configuration via props.conf or any other way? I have set:
INDEXED_EXTRACTIONS = csv
KV_MODE = none
for this sourcetype in props.conf, and it is working fine for rows without backslashes in them.
UPDATE: Yeah so Splunk's CSV parsing is indeed not RFC-4180 compliant, and there's not really any workaround that I could find. In the end I changed the upstream data pipeline to output JSON instead of CSVs for ingestion by Splunk. Now it works fine. Let this be a cautionary tale if anyone stumbles across this question while trying to parse CSVs in Splunk!

How to get PostgreSQL to escape text from jsonb_array_element?

I'm loading some JSON from Postgres 13 into Elasticsearch using Logstash and ran into some errors caused by text not being escaped with reverse solidus. I tracked my problem down to this behavior:
SELECT
json_build_object(
'literal_text', 'abc\ndef'::text,
'literal_text_type', pg_typeof('abc\ndef'::text),
'text_from_jsonb_array_element', a->>0,
'jsonb_array_element_type', pg_typeof(a->>0)
)
FROM jsonb_array_elements('["abc\ndef"]') jae (a);
{
"literal_text": "abc\\ndef",
"literal_text_type": "text",
"text_from_jsonb_array_element": "abc\ndef",
"jsonb_array_element_type":"text"
}
db-fiddle
json_build_object encodes the literal text as expected (turning \n into \\n); however, it doesn't encode the text retrieved via jsonb_array_element even though both are text.
Why is the text extracted from jsonb_array_element being treated differently (not getting escaped by jsonb_build_object)? I've tried casting, using jsonb_array_elements_text (though my actual use case involves an array of arrays, so I need to split to a set of jsonb), and various escaping/encoding/formatting functions, but haven't found a solution yet.
Is there a trick to cast text pulled from jsonb_array_element so it will get properly encoded by jsonb_build_object?
Thanks for any hints or solutions.
Those strings look awfully similar, but they're actually different. When you create a string literal like '\n', that's a backslash character followed by an "n" character. So when you put that into json_build_object, it needs to add a backslash to escape the backslash you're giving it.
On the other hand, when you call jsonb_array_elements('["abc\ndef"]'), you're saying that the JSON has precisely a \n encoded in it with no second backslash, and therefore when it's converted to text, that \n is interpreted as a newline character, not two separate characters. You can see this easily by running the following:
SELECT a->>0 FROM jsonb_array_elements('["abc\ndef"]') a;
?column?
----------
abc +
def
(1 row)
On encoding that back into a JSON, you get a single backslash again, because it's once again encoding a newline character.
If you want to escape it with an extra backslash, I suggest a simple replace:
SELECT
json_build_object(
'text_from_jsonb_with_replace', replace(a->>0, E'\n', '\n')
)
FROM jsonb_array_elements('["abc\ndef"]') jae (a);
json_build_object
------------------------------------------------
{"text_from_jsonb_with_replace" : "abc\\ndef"}

Remove backslash from nested json [duplicate]

When I create a string containing backslashes, they get duplicated:
>>> my_string = "why\does\it\happen?"
>>> my_string
'why\\does\\it\\happen?'
Why?
What you are seeing is the representation of my_string created by its __repr__() method. If you print it, you can see that you've actually got single backslashes, just as you intended:
>>> print(my_string)
why\does\it\happen?
The string below has three characters in it, not four:
>>> 'a\\b'
'a\\b'
>>> len('a\\b')
3
You can get the standard representation of a string (or any other object) with the repr() built-in function:
>>> print(repr(my_string))
'why\\does\\it\\happen?'
Python represents backslashes in strings as \\ because the backslash is an escape character - for instance, \n represents a newline, and \t represents a tab.
This can sometimes get you into trouble:
>>> print("this\text\is\not\what\it\seems")
this ext\is
ot\what\it\seems
Because of this, there needs to be a way to tell Python you really want the two characters \n rather than a newline, and you do that by escaping the backslash itself, with another one:
>>> print("this\\text\is\what\you\\need")
this\text\is\what\you\need
When Python returns the representation of a string, it plays safe, escaping all backslashes (even if they wouldn't otherwise be part of an escape sequence), and that's what you're seeing. However, the string itself contains only single backslashes.
More information about Python's string literals can be found at: String and Bytes literals in the Python documentation.
As Zero Piraeus's answer explains, using single backslashes like this (outside of raw string literals) is a bad idea.
But there's an additional problem: in the future, it will be an error to use an undefined escape sequence like \d, instead of meaning a literal backslash followed by a d. So, instead of just getting lucky that your string happened to use \d instead of \t so it did what you probably wanted, it will definitely not do what you want.
As of 3.6, it already raises a DeprecationWarning, although most people don't see those. It will become a SyntaxError in some future version.
In many other languages, including C, using a backslash that doesn't start an escape sequence means the backslash is ignored.
In a few languages, including Python, a backslash that doesn't start an escape sequence is a literal backslash.
In some languages, to avoid confusion about whether the language is C-like or Python-like, and to avoid the problem with \Foo working but \foo not working, a backslash that doesn't start an escape sequence is illegal.

Interpolating a JSON string removes JSON quotations

I have the following two lines of code:
json_str = _cases.to_json
path += " #{USER} #{PASS} #{json_str}"
When I use the debugger, I noticed that json_str appears to be formatted as JSON:
"[["FMCE","Wiltone","Wiltone","04/10/2018","Marriage + - DOM"]]"
However, when I interpolate it into another string, the quotes are removed:
"node superuser 123456 [["FMCE","Wiltone","Wiltone","04/10/2018","Marriage + - DOM"]]"
Why does string interpolation remove the quotes from JSON string and how can I resolve this?
I did find one solution to the problem, which was manually escaping the string:
json_str = _cases.to_json.gsub('"','\"')
path += " #{USER} #{PASS} \"#{json_str}\""
So basically I escape the double quotes generated in the to_json call. Then I manually add two escaped quotes around the interpolated variable. This will produce a desired result:
node superuser 123456 "[[\"FMCE\",\"Wiltone\",\"Wiltone\",\"04/10/2018\",\"Marriage + - DOM\"]]"
Notice how the outer quotes around the collection are not escaped, but the strings inside the collection are escaped. That will enable JavaScript to parse it with JSON.parse.
It is important to note that in this part:
json_str = _cases.to_json.gsub('"','\"')
it is adding a LITERAL backslash. Not an escape sequence.
But in this part:
path += " #{USER} #{PASS} \"#{json_str}\""
The \" wrapping the interpolated variable is an escape sequence and NOT a literal backslash.
Why do you think the first and last quote marks are part of the string? They do not belong to the JSON format. Your program’s behavior looks correct to me.
(Or more precisely, your program seems to be doing exactly what you told it to. Whether your instructions are any good is a question I can’t answer without more context.)
It's hard to tell with the small sample, but it looks like you might be getting quotes from your debugger output. assuming the output of .to_json is a string (usually is), then "#{json_str}" should be exactly equal to json_str. If it isn't, that's a bug in ruby somehow (doubtful).
If you need the quotes, you need to either add them manually or escape the string using whatever escape function is appropriate for your use case. You could use .to_json as your escape function even ("#{json_str.to_json}", for example).

Should I encode curly or square brackets inside a JSON?

In fact, the title says it all. But, I'll go more into details:
I am sending a JSON string from my JS script to the server and vice versa. The JSON contains things as some content the user wrote into a textfield, but I know that some user will manage to break the JSON array this way sooner or later, so I decided to encode it with encodeURIComponent().
But I see, that when I try to encode curly brackets, that they aren't encoded at all. Is this going to be a problem?
More precisely, I'm afraid that if someone writes: } , {, the JSON will break. This shouldn't happen, since all of it is inside doublequotes like this: "} , {", and if a user write doublequotes or singlequotes they are going to be encoded, and from what I know, JSON should handle all of that just fine, but I am not entirely sure.
So, should I encode those brackets?
(Another thing is that the data is inserted into MySQL inside prepared statements, so that shouldn't be a problem, or I am wrong with that?)
A quick quote from the JSON specifications:
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes.
As you can see in the image that follows the paragraph quoted above, any Unicode character except for ", \ and control characters is represented as-is; no escape is required.
why will brake? Your JSON string will be inside ""
So will be something like
{"postcontent": "Shouldnt { } be escaped"}