Get Flat File Schema as CSV output when input data has newlines - csv

Consider my input data as below:
<xmlnode>line1
line2
line3
</xmlnode>
Right now, I have a map which maps input data to a flatfile schema. I am saving the flatfile as CSV.
Issue is :if input data is having newlines, then the csv format is getting corrupted. The content of 'xmlnode' should go to one single csv column.
Is there is any setting I need to handle this at flat file schema?

Create a functoid with code like the following:
return input.Replace("\r", "").Replace("\n", " ");
The idea is to replace any \r\n with a single space (and handle cases where there's a newline with no carriage return). Should fix your problem.
If this is a problem that will occur routinely on multiple/all nodes from your input, then you might consider running that as a regular expression on the entire message as a string after mapping (rather than having every node pass through your scripting functoid).
As Dan suggessted in Comments, double quotes is also required to save data with \n (new line) in one cell of a csv.

You need to set the "Wrap Character" and "Wrap Character Type" settings in your flat file schema for that field to quote (") and 'Character' respectively. I've used this for the same issue.
Note: There is a "Default Wrap Character" and "Default Wrap Character Type" in the schema settings but BizTalk cleverly defaults the type on fields to "None" rather than "Default" so you still have to go and change the fields even if you set the default.

Related

Double quote handling when exporting JSON field with BigQuery

I am making use of the JSON datatype in BigQuery and I have a table that looks like this:
myStringField | myJSONField
----------------|----------------------------------
someStringValue | {"key1":"value1", "key1":"value2"}
In SQL, everything works fine. But, when it comes to exporting data, it gets messy. For instance, if I click the "Save results" button and if I choose the "CSV (local file)" option, I obtain the following content in my CSV:
myStringField,myJSONField
someStringValue,"{""key1"":""value1"", ""key1"":""value2""}"
As you can see, I get "double double quotes" inside my JSON and it makes things complicated to parse for the downstream system that receives the file.
I tried to fix it by using different combinations of JSON functions such as PARSE_JSON(), TO_JSON_STRING(), STRING() but nothing worked and, in some cases, it even made things worse ("triple double quotes").
Ideally, the expected output of my CSV should resemble this:
myStringField,myJSONField
someStringValue,{"key1":"value1", "key1":"value2"}
Any workaround?
According to the docs, exporting JSON fields to a CSV format has some limitations:
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
When you export a table in JSON format, the symbols <, >, and & are converted by using the unicode notation \uNNNN, where N is a hexadecimal digit. For example, profit&loss becomes profit\u0026loss. This unicode conversion is done to avoid security vulnerabilities.
Check out the export limitations here: https://cloud.google.com/bigquery/docs/exporting-data#export_limitations
Regarding the export format you mentioned, that is the expected way to escape the double quote characters in CSV. So this is the expected output.
First quotes are there because of the CSV encode mechanism for strings and every other double quote inside that string will be escaped with another double quote.
"{""key1"":""value1""}"
If you are parsing this csv with any parser out there, this format should be supported with the right setup.

How do I convince Splunk that a backslash inside a CSV field is not an escape character?

I have the following row in a CSV file that I am ingesting into a Splunk index:
"field1","field2","field3\","field4"
Excel and the default Python CSV reader both correctly parse that as 4 separate fields. Splunk does not. It seems to be treating the backslash as an escape character and interpreting field3","field4 as a single mangled field. It is my understanding that the standard escape character for double quotes inside a quoted CSV field is another double quote, according to RFC-4180:
"If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
Why is Splunk treating the backslash as an escape character, and is there any way to change that configuration via props.conf or any other way? I have set:
INDEXED_EXTRACTIONS = csv
KV_MODE = none
for this sourcetype in props.conf, and it is working fine for rows without backslashes in them.
UPDATE: Yeah so Splunk's CSV parsing is indeed not RFC-4180 compliant, and there's not really any workaround that I could find. In the end I changed the upstream data pipeline to output JSON instead of CSVs for ingestion by Splunk. Now it works fine. Let this be a cautionary tale if anyone stumbles across this question while trying to parse CSVs in Splunk!

concern while importing/linking csv to access database

I have a csv file with delimiter as , (comma) and few of the data column of same file has comma in it .
Hence while linking / importing the file, data is getting jumbled in next column.
I have tried all possible means like skip column etc , but not getting any fruitful results.
Please let me know if this can be handled through VBA function in ms-access.
If the CSV file contains text fields that contain commas and are not surrounded by a text qualifier (usually ") then the file is malformed and cannot be parsed in a bulletproof way. That is,
1,Hello world!,1.414
2,"Goodbye, cruel world!",3.142
can be reliably parsed, but
1,Hello world!,1.414
2,Goodbye, cruel world!,3.142
cannot. However, if you have additional information about the file, e.g., that it should contain three columns
a Long Integer column,
a Short Text column, and
a Double column
then your VBA code could read the file line-by-line and split the string on commas into an array. The first array element would be the Long Integer, the last array element would be the Double value, and the remaining "columns" in between could be concatenated together to reconstruct the string.
As you can imagine, that approach could easily be confounded (e.g., if there was more than one text field that might contain commas). Therefore it is not particularly appealing.
(Also worth noting is that the CSV parser in Access has never been able to properly handle text fields that contain line breaks, but at least we can import those CSV files into Excel and then import into Access from the Excel file.)
TL;DR - If the CSV file contains unqualified text containing commas then the system that produced it is broken and should be fixed.

Reading csv without specifying enclosure characters in Weka

I have a dataset that I want to open in Weka, so I converted it as csv file. (The file contains some text including commas/apostrophes/quotation marks, while its seperator is pipeline character.)
When I try to read this csv file, in options window, I specify pipeline (|) as my fieldSeperator, leave enclosureCharacters empty, and don't touch the rest of the options. This can be seen in the screenshot:
Then I get this error:
File not recognised as an 'CSV data files' file. Reason: Enclosures
can only be single characters.
Seems like Weka's csv loader does not accept enclosureCharacters field empty? What can I write into this field? I think my file does not have enclosures for its text data.

csv parsing, quotes as characters

I have a csv file that contains fields such as this:
""#33CCFF"
I would imagine that should be the text value:
"#33CCFF
But both excel and open office calc will display:
#33CCFF"
What rule of csv am I missing?
When Excel parses the value, it does not first remove the outer quotes, to then proceed reading what's in between them. Even if it would, what would it do with the remaining " in front of #? It can not show this as you expect "#33CCFF, because for it to appear like that, the double quote should have been escaped by duplicating it. (That might be the 'csv' rule you are missing.)
Excel reads the value from left to right, interpreting "" as a single ", it then reads on, and finds an unexpected double quote at the end, 'panics' and simply displays it.
The person/application creating the csv file made the mistake of adding encapsulation without escaping the encapsulation character within the values. Now your data is malformed. Since when using encapsulation, the value should be """#33CCFF", and when not using encapsulation, it should be "#33CCFF.
This might clarify some things