Spark read csv with comma inside string

Spark read csv with comma inside string - csv

536381,22411,JUMBO SHOPPER VINTAGE RED PAISLEY,10,12/1/2010 9:41,1.95,15311,United Kingdom
"536381,82567,""AIRLINE LOUNGE,METAL SIGN"",2,12/1/2010 9:41,2.1,15311,United Kingdom"
536381,21672,WHITE SPOT RED CERAMIC DRAWER KNOB,6,12/1/2010 9:41,1.25,15311,United Kingdom
These lines are example of rows in a csv file.
I'm trying to read it in Databricks, using:
df = spark.read.csv ('file.csv', sep=',', inferSchema = 'true', quote = '"')
but, the line in the middle and other similar are not getting into the right column because of the comma within the string. How can I workaround it?

Set the quote to:
'""'
df = spark.read.csv('file.csv', sep=',', inferSchema = 'true', quote = '""')
It looks like your data has double quotes - so when it's being read it sees the double quotes as being the start and end of the string.
Edit: I'm also assuming the problem comes in with this part:
""AIRLINE LOUNGE,METAL SIGN""

this is not only related to Excel; I have the same issue when retrieving data from a source into Azure Synapse. the comma within one column causes the process to enclose entire column data with double quotes and including double quotes get doubled as shown above, second line (see Retrieve CSV format over https)

Related

Import RFC4180 files (CSV spec) into snowflake? (Unable to create file format that matches CSV RFC spec)

Summary:
Original question from a year ago: How to escape double quotes within a data when it is already enclosed by double quotes
I have the same need as the original poster: I have a CSV file that matches the CSV RFC spec (my data has double quotes that are properly qualified, my data has commas in it, and my data also has line feeds in it. Excel is able to read it just fine because the file matches the spec and excel properly reads the spec).
Unfortunately I can't figure out how to import files that match the CSV RFC 4180 spec into snowflake. Any ideas?
Details:
We've been creating CSV files that match the RFC 4180 spec for years in order to maximize compatibility across applications and OSes.
Here is a sample of what my data looks like:
KEY,NAME,DESCRIPTION
1,AFRICA,This is a simple description
2,NORTH AMERICA,"This description has a comma, so I have to wrap the whole field in double quotes"
3,ASIA,"This description has ""double quotes"" in it, so I have to qualify the double quotes and wrap the field in double quotes"
4,EUROPE,"This field has a carriage
return so it is wrapped in double quotes"
5,MIDDLE EAST,Simple descriptoin with single ' quote
When opening this file in Excel, Excel properly reads the rows/columns (because excel follows the RFC spec):
In order to import this file into Snowflake, I first try to create a file format and I set the following:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
1
Field optionally enclosed by
Double Quote
Escape Character
"
Escape Unenclosed Field
None
But when go to save the file format, I get this error:
Unable to create file format "CSV_SPEC".
SQL compilation error: value ["] for parameter 'FIELD_OPTIONALLY_ENCLOSED_BY' conflict with parameter 'ESCAPE'
It would appear that I'm missing something? I would think that I must be getting the snowflake configuration wrong. (

While writing up this question and testing all the scenarios I could think of, I found a file format that seems to work:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
1
Field optionally enclosed by
Double Quote
Escape Character
None
Escape Unenclosed Field
None
Same information, but for those that prefer screenshots:
Same information again, but in SQL form:
ALTER FILE FORMAT "DB_NAME"."SCHEMA_NAME"."CSV_SPEC3" SET COMPRESSION = 'NONE' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '\042' TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
I don't know why this works, but it does, so, there you.

Exporting data from SSRS to a .csv file adds lots of quotation marks how do I get just one set?

I have a report which is just a simple SELECT statement which generates a list of columns full of data. I want this data to be exported as a CSV file with each datum being enclosed in " quotation marks. I have created a table and used this as my expression
=""""+Fields!Activity_Code.Value+""""
When I run the report inside ReportBuilder 3.0 I get exactly what I'm looking for
No headers and each datum has quotation marks, perfect.
But when I hit export to csv, and then open with notepad I see this.
The headers are in there where they shouldn't be and each datum has 3 quotation marks on each side. What am I doing wrong?

This is perfectly normal.
When csv fields contain a separator or double quotes, the fields are enclosed in double quotes and the quotes inside the fields are escaped with another quote.
Example - the fields:
123
"27" monitor"
456
become:
123,"""27"" monitor""",456
or:
"123","""27"" monitor""","456"
A csv reader/parser should handle this correctly when reading the data (or you could provide a parameter telling the parser that the fields are quoted).
On the other hand, if you just want your fields to be quoted inside the csv (and not visible after opening the file), you can tell the csv generator to quote the fields (or in this case do nothing since the generator seems to be adding quotes already).

How to replace commas within double quotes in a CSV using Power Automate?

I am trying to convert a CSV to JSON in Power Automate. I am stuck because some of my CSV values contain commas within double quotes. I'm splitting each line with
split(outputs('CurrentTableRow'), ',')
which of course splits the value between the quotes.
I have not figured out a way of replacing the commas. I've been browsing forums for days...
I need to replace the commas within the double quoted values or prevent the split of values within the double quotes.

My solution to bypass this problem was by decodeBase64() function :
replace(body('Json_object')?['Description'],decodeBase64('LA=='), '*$*')
the 'LA==' matches to , in Base64
The box before running
The result after running the flow

Escape "\" backslash from dataweave for csv output Mule

CSV output is generated from java Map in Dataweave,
Output response adds "\" to every "," present within the values.
All the map values are added inside the double quotes, ex: map.put('key',"key-Value");
Response Received :
Header1, Header2
1234,ABC \,text
7890,XYZ \,text
Expected Response :
Header1, Header2
1234,ABC ,text
7890,XYZ ,text
Header2 should contain "ABC,text" as value without quotes ""
Tried using %output application/csv escape=" ", but this adds extra space to each blank space in the values i.e if the value is "ABC XYZ" then output is "ABC XYZ" (2 spaces in between)
Any suggestion will be helpful...

Embedded commas in data in a comma separated value file must be escaped or there is no way to tell those values apart from field separators. If you want some way to have the commas in your CSV file without escaping them, then you need to use a separator other than a comma.
You Expected Response as shown would not be valid, as you have a two field header, but data lines that would be interpreted as having 3 fields, not 2 fields, one with an embedded comma which is what the data has and is shown in the Response Received table.

I've got the same scenario, where we have comma in the data itself, like in your case Header2 .
To solve the issue, I've just added below
%output application/csv quoteValues=true
above solved my problem, and we got the expected output.

How to remove double quotes surrounding the text while importing a CSV file?

I have data which resembles the following:
"D.STEIN","DS","01","ALTRES","TTTTTTFFTT"
"D.STEIN","DS","01","APCASH","TTTTTTFFTT"
"D.STEIN","DS","01","APINH","TTTTTTFFTT"
"D.STEIN","DS","01","APINV","TTTTTTFFTT"
"D.STEIN","DS","01","APMISC","TTTTTTFFTT"
"D.STEIN","DS","01","APPCHK","TTTTTTFFTT"
"D.STEIN","DS","01","APWLNK","TTTTTTFFTT"
"D.STEIN","DS","01","ARCOM","TTTTTTFFTT"
"D.STEIN","DS","01","ARINV","TTTTTTFFTT"
I've used a Flat File Source Editor to load the data. What is the easiest way to remove all of the double quotes?

Further searching revealed that I should use the Text Qualifier on the General Tab of the Flat File Source.
Flat file content when viewed in a Notepad++. CRLF denotes that the lines end with Carriage Return and Line Feed.
On the flat file connection manager, enter the double quotes in the Text qualifier text box.
Once the text qualifier is set, the data would be parsed correctly and displayed as shown below:

while loading CSV with double quotes and comma there is one limitation that extra double quotes has been added and the data also enclosed with the double quotes you can check in the preview of source file.
So, add the derived column task and give the below expression:-
(REPLACE(REPLACE(RIGHT(SUBSTRING(TRIM(COL2),1,LEN(COL2) - 1),LEN(COL2) - 2)," ","#"),"\"\"","\""),"#"," ")
the bold part removes the data enclosed with double quotes.
Try this and do let me know if this is helpful

substring([column 5], 2,(len([column 5])-2) )

I would rather use the following statement....
REPLACE(REPLACE(REPLACE(ColumnName, '""', '[YourOwnuniqueString]'), '"', ''), '[YourOwnuniqueString]', '"')
Note: please make sure your YourOwnuniqueString should be unique and not used any where in the columns as data. E.x: SQL#RT2#myCode -It is case sensitive-

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Spark read csv with comma inside string - csv

Related

Import RFC4180 files (CSV spec) into snowflake? (Unable to create file format that matches CSV RFC spec)

Exporting data from SSRS to a .csv file adds lots of quotation marks how do I get just one set?

How to replace commas within double quotes in a CSV using Power Automate?

Escape "\" backslash from dataweave for csv output Mule

How to remove double quotes surrounding the text while importing a CSV file?

Categories

Resources