Double quote handling when exporting JSON field with BigQuery - json

I am making use of the JSON datatype in BigQuery and I have a table that looks like this:
myStringField | myJSONField
----------------|----------------------------------
someStringValue | {"key1":"value1", "key1":"value2"}
In SQL, everything works fine. But, when it comes to exporting data, it gets messy. For instance, if I click the "Save results" button and if I choose the "CSV (local file)" option, I obtain the following content in my CSV:
myStringField,myJSONField
someStringValue,"{""key1"":""value1"", ""key1"":""value2""}"
As you can see, I get "double double quotes" inside my JSON and it makes things complicated to parse for the downstream system that receives the file.
I tried to fix it by using different combinations of JSON functions such as PARSE_JSON(), TO_JSON_STRING(), STRING() but nothing worked and, in some cases, it even made things worse ("triple double quotes").
Ideally, the expected output of my CSV should resemble this:
myStringField,myJSONField
someStringValue,{"key1":"value1", "key1":"value2"}
Any workaround?

According to the docs, exporting JSON fields to a CSV format has some limitations:
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
When you export a table in JSON format, the symbols <, >, and & are converted by using the unicode notation \uNNNN, where N is a hexadecimal digit. For example, profit&loss becomes profit\u0026loss. This unicode conversion is done to avoid security vulnerabilities.
Check out the export limitations here: https://cloud.google.com/bigquery/docs/exporting-data#export_limitations
Regarding the export format you mentioned, that is the expected way to escape the double quote characters in CSV. So this is the expected output.
First quotes are there because of the CSV encode mechanism for strings and every other double quote inside that string will be escaped with another double quote.
"{""key1"":""value1""}"
If you are parsing this csv with any parser out there, this format should be supported with the right setup.

Related

How do I convince Splunk that a backslash inside a CSV field is not an escape character?

I have the following row in a CSV file that I am ingesting into a Splunk index:
"field1","field2","field3\","field4"
Excel and the default Python CSV reader both correctly parse that as 4 separate fields. Splunk does not. It seems to be treating the backslash as an escape character and interpreting field3","field4 as a single mangled field. It is my understanding that the standard escape character for double quotes inside a quoted CSV field is another double quote, according to RFC-4180:
"If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
Why is Splunk treating the backslash as an escape character, and is there any way to change that configuration via props.conf or any other way? I have set:
INDEXED_EXTRACTIONS = csv
KV_MODE = none
for this sourcetype in props.conf, and it is working fine for rows without backslashes in them.
UPDATE: Yeah so Splunk's CSV parsing is indeed not RFC-4180 compliant, and there's not really any workaround that I could find. In the end I changed the upstream data pipeline to output JSON instead of CSVs for ingestion by Splunk. Now it works fine. Let this be a cautionary tale if anyone stumbles across this question while trying to parse CSVs in Splunk!

SPSS Syntax to import RFC 4180 CSV file with escaped double quotes

How do I read an RFC4180-standard CSV file into SPSS? Specifically, how to handle string values that have embedded double quotes which are (properly) escaped with a second double quote?
Here's one instance of a record with a problematic value:
2985909844,,3,3,3,3,3,3,1,2,2,"I recall an ad for ""RackSpace"", but I don't recall if this was here or in another page.",200,1,1,1,0,1,0,Often
The SPSS syntax I used is as follows:
GET DATA
/TYPE=TXT
/FILE="/Users/pieter/Work/Stackoverflow/2013_StackOverflowRecoded.csv"
/IMPORTCASE=ALL
/ARRANGEMENT=DELIMITED
/DELCASE=LINE
/FIRSTCASE=2
/DELIMITERS=","
/QUALIFIER='"'
/VARIABLES= ... list of column names...
The import succeeds, but gets off track and throws warnings after encountering such values.
I'm afraid this is a bug in SPSS and therefore not possible to solve.
You might want to ask the IBM Support team about this issue and post their answer here, if you find it helpful.
One Workaround would be to change the escaped double quotes in your *.csv file(s) to some other quote type. This should be only little work if you use an advanced text editor such as notepad++ or the "sed" command line tool on UNIX like operation systems.
Trying an example in the current version of Statistics (22) doubled identifiers are handled correctly, however, if you generate the syntax with the Text Wizard, the fields are too short in the generated syntax, so you would need to increase the widths.

Reading unescaped backslashes in JSON into R

I'm trying to read some data from the Facebook Graph API Explorer into R to do some text analysis. However, it looks like there are unescaped backslashes in the JSON feed, which is causing rjson to barf. The following is a minimal example of the kind of input that's causing problems.
library(rjson)
txt <- '{"data":[{"id":2, "value":"I want to \\"post\\" a picture\\video"}]}'
fromJSON(txt)
(Note that the double backslashes at \\" and \\video will convert to single backslashes after parsing, which is what's in my actual data.)
I also tried the RJSONIO package which also gave errors, and even crashed R at times.
Has anyone come across this problem before? Is there a way to fix this short of manually hunting down every error that crops up? There's potentially megabytes of JSON being parsed, and the error messages aren't very informative about where exactly the problematic input is.
Just replace backslashes that aren't escaping double quotes, tabs or newlines with double backslashes.
In the regular expression, '\\\\' is converted to one backslash (two levels of escaping are needed, one for R, one for the regular expression engine). We need the perl regex engine in order to use lookahead.
library(stringr)
txt2 <- str_replace_all(txt, perl('\\\\(?![tn"])'), '\\\\\\\\')
fromJSON(txt2)
The problem is that you are trying to parse invalid JSON:
library(jsonlite)
txt <- '{"data":[{"id":2, "value":"I want to \\"post\\" a picture\\video"}]}'
validate(txt)
The problem is the picture\\video part because \v is not a valid JSON escape sequence, even though it is a valid escape sequence in R and some other languages. Perhaps you mean:
library(jsonlite)
txt <- '{"data":[{"id":2, "value":"I want to \\"post\\" a picture\\/video"}]}'
validate(txt)
fromJSON(txt)
Either way to problem is at the JSON data source that is generating invalid JSON. If this data really comes form Facebook, you found a bug in their API. But more likely you are not retrieving it correctly.

Choosing between tsv and csv

I have a program that outputs a table, and I was wondering if there are any advantages/disadvantages between the csv and tsv formats.
TSV is a very efficient for Javascript/Perl/Python to process, without losing
any typing information, and also easy for humans to read.
The format has been supported in 4store since its public release, and
it's reasonably widely used.
The way I look at it is: CSV is for loading into spreadsheets, TSV is
for processing by bespoke software.
You can see here the technical specification of each here.
The choice depends on the application. In a nutshell, if your fields don't contain commas, use CSV; otherwise TSV is the way to go.
TL;DR
In both formats, the problem arises when the delimiter can appear within the fields, so it is necessary to indicate that the delimiter is not working as a field separator but as a value within the field, which can be somewhat painful.
For example, using CSV: Kalman, Rudolf, von Neumann, John, Gabor, Dennis
Some basic approaches are:
Delete all the delimiters that appear within the field.
E.g. Kalman Rudolf, von Neumann John, Gabor Dennis
Escape the character (usually pre-appending a backslash \).
E.g. Kalman\, Rudolf, von Neumann\, John, Gabor\, Dennis
Enclose each field with other character (usually double quotes ").
E.g. "Kalman, Rudolf", "von Neumann, John", "Gabor, Dennis"
CSV
The fields are separated by a comma ,.
For example:
Name,Score,Country
Peter,156,GB
Piero,89,IT
Pedro,31415,ES
Advantages:
It is more generic and useful when sharing with non-technical people,
as most of software packages can read it without playing with the
settings.
Disadvantages:
Escaping the comma within the fields can be frustrating because not
everybody follows the standards.
All the extra escaping characters and quotes add weight to the final file size.
TSV
The fields are separated by a tabulation <TAB> or \t
For example:
Name<TAB>Score<TAB>Country
Peter<TAB>156<TAB>GB
Piero<TAB>89<TAB>IT
Pedro<TAB>31415<TAB>ES
Advantages:
It is not necessary to escape the delimiter as it is not usual to have the tab-character within a field. Otherwise, it should be removed.
Disadvantages:
It is less widespread.
TSV-utils makes an interesting comparison, copied here after. In a nutshell, use TSV.
Comparing TSV and CSV formats
The differences between TSV and CSV formats can be confusing. The obvious distinction is the default field delimiter: TSV uses TAB, CSV uses comma. Both use newline as the record delimiter.
By itself, using different field delimiters is not especially significant. Far more important is the approach to delimiters occurring in the data. CSV uses an escape syntax to represent comma and newlines in the data. TSV takes a different approach, disallowing TABs and newlines in the data.
The escape syntax enables CSV to fully represent common written text. This is a good fit for human edited documents, notably spreadsheets. This generality has a cost: reading it requires programs to parse the escape syntax. While not overly difficult, it is still easy to do incorrectly, especially when writing one-off programs. It is good practice is to use a CSV parser when processing CSV files. Traditional Unix tools like cut, sort, awk, and diff do not process CSV escapes, alternate tools are needed.
By contrast, parsing TSV data is simple. Records can be read using the typical readline routines found in most programming languages. The fields in each record can be found using split routines. Unix utilities can be called by providing the correct field delimiter, e.g. awk -F "\t", sort -t $'\t'. No special parser is needed. This is much more reliable. It is also faster, no CPU time is used parsing the escape syntax.
The speed advantages are especially pronounced for record oriented operations. Record counts (wc -l), deduplication (uniq, tsv-uniq), file splitting (head, tail, split), shuffling (GNU shuf, tsv-sample), etc. TSV is faster because record boundaries can be found using highly optimized newline search routines (e.g. memchr). Identifying CSV record boundaries requires fully parsing each record.
These characteristics makes TSV format well suited for the large tabular data sets common in data mining and machine learning environments. These data sets rarely need TAB and newline characters in the fields.
The most common CSV escape format uses quotes to delimit fields containing delimiters. Quotes must also be escaped, this is done by using a pair of quotes to represent a single quote. Consider the data in this table:
Field-1
Field-2
Field-3
abc
hello, world!
def
ghi
Say "hello, world!"
jkl
In Field-2, the first value contains a comma, the second value contain both quotes and a comma. Here is the CSV representation, using escapes to represent commas and quotes in the data.
Field-1,Field-2,Field-3
abc,"hello, world!",def
ghi,"Say ""hello, world!""",jkl
In the above example, only fields with delimiters are quoted. It is also common to quote all fields whether or not they contain delimiters. The following CSV file is equivalent:
"Field-1","Field-2","Field-3"
"abc","hello, world!","def"
"ghi","Say ""hello, world!""","jkl"
Here's the same data in TSV. It is much simpler as no escapes are involved:
Field-1 Field-2 Field-3
abc hello, world! def
ghi Say "hello, world!" jkl
The similarity between TSV and CSV can lead to confusion about which tools are appropriate. Furthering this confusion, it is somewhat common to have data files using comma as the field delimiter, but without comma, quote, or newlines in the data. No CSV escapes are needed in these files, with the implication that traditional Unix tools like cut and awk can be used to process these files. Such files are sometimes referred to as "simple CSV". They are equivalent to TSV files with comma as a field delimiter. Traditional Unix tools and tsv-utils tools can process these files correctly by specifying the field delimiter. However, "simple csv" is a very ad hoc and ill defined notion. A simple precaution when working with these files is to run a CSV-to-TSV converter like csv2tsv prior to other processing steps.
Note that many CSV-to-TSV conversion tools don't actually remove the CSV escapes. Instead, many tools replace comma with TAB as the record delimiter, but still use CSV escapes to represent TAB, newline, and quote characters in the data. Such data cannot be reliably processed by Unix tools like sort, awk, and cut. The csv2tsv tool in tsv-utils avoids escapes by replacing TAB and newline with a space (customizable). This works well in the vast majority of data mining scenarios.
To see what a specific CSV-to-TSV conversion tool does, convert CSV data containing quotes, commas, TABs, newlines, and double-quoted fields. For example:
$ echo $'Line,Field1,Field2\n1,"Comma: |,|","Quote: |""|"\n"2","TAB: |\t|","Newline: |\n|"' | <csv-to-tsv-converter>
Approaches that generate CSV escapes will enclose a number of the output fields in double quotes.
References:
Wikipedia: Tab-separated values - Useful description of TSV format.
IANA TSV specification - Formal definition of the tab-separated-values mime type.
Wikipedia: Comma-separated-values - Describes CSV and related formats.
RFC 4180 - IETF CSV format description, the closest thing to an actual standard for CSV.
brendano/tsvutils: The philosophy of tsvutils - Brendan O'Connor's discussion of the rationale for using TSV format in his open source toolkit.
So You Want To Write Your Own CSV code? - Thomas Burette's humorous, and accurate, blog post describing the troubles with ad-hoc CSV parsing. Of course, you could use TSV and avoid these problems!
You can use any delimiter you want, but tabs and commas are supported by many applications, including Excel, MySQL, PostgreSQL. Commas are common in text fields, so if you escape them, more of them need to be escaped. If you don't escape them and your fields might contain commas, then you can't confidently run "sort -k2,4" on your file. You might need to escape some characters in fields anyway (null bytes, newlines, etc.). For these reasons and more, my preference is to use TSVs, and escape tabs, null bytes, and newlines within fields. Additionally, it is usually easier to work with TSVs. Just split each line by the tab delimiter. With CSVs there are quoted fields, possibly fields with newlines, etc. I only use CSVs when I'm forced to.
I think that generally csv, are supported more often than the tsv format.

Get Flat File Schema as CSV output when input data has newlines

Consider my input data as below:
<xmlnode>line1
line2
line3
</xmlnode>
Right now, I have a map which maps input data to a flatfile schema. I am saving the flatfile as CSV.
Issue is :if input data is having newlines, then the csv format is getting corrupted. The content of 'xmlnode' should go to one single csv column.
Is there is any setting I need to handle this at flat file schema?
Create a functoid with code like the following:
return input.Replace("\r", "").Replace("\n", " ");
The idea is to replace any \r\n with a single space (and handle cases where there's a newline with no carriage return). Should fix your problem.
If this is a problem that will occur routinely on multiple/all nodes from your input, then you might consider running that as a regular expression on the entire message as a string after mapping (rather than having every node pass through your scripting functoid).
As Dan suggessted in Comments, double quotes is also required to save data with \n (new line) in one cell of a csv.
You need to set the "Wrap Character" and "Wrap Character Type" settings in your flat file schema for that field to quote (") and 'Character' respectively. I've used this for the same issue.
Note: There is a "Default Wrap Character" and "Default Wrap Character Type" in the schema settings but BizTalk cleverly defaults the type on fields to "None" rather than "Default" so you still have to go and change the fields even if you set the default.