Spark File Format Escaping \n Loading CSV - csv

I'm reading a CSV pipe delimited data file using spark. It's quote qualified. A block of text has a /n in it and it's causing the read to corrupt. What I don't understand is that it's quote qualified text so surely it should just skip that!? The rows themselves are CR+LN delimited.
Anyhow it's not. How do I get around this? I can cleanse them out on extract but doesn't seem that elegant to me.
This is what I'm using to load the data
val sch = spark.table("db.mytable").schema
val df = spark.read
.format("csv")
.schema(sch)
.option("header", "true")
.option("delimiter", "|")
.option("mode", "PERMISSIVE")
.option("quote", "\"")
.load("/yadaydayda/mydata.txt")

Glad to know I'm not the only one who's dealt with this issue in Spark!
Spark reads files line-by-line, so CSVs with newlines in them cause problems for the parser. Reading line-by-line makes it easier for Spark to handle large CSV files, rather than trying to parse all of the content for quotes, which would significantly impair performance for a case is more likely to not be an issue when trying to have high-performing analytics.
For cases where I knew newlines were a possibility, I've used a third party CSV parsing library, run the CSV "lines" through that (which would handle the newlines correctly), strip the newlines, write/cache the file somewhere, and read from that cached file. For a production use case, those files would be loaded into a database, or for log files or something where you don't want them in a database, using Parquet like you suggested works pretty well, or really just enforcing the lack of newlines somewhere before the files get to Spark.

Got around this by initially striping them on extract. The final solution I settled on however was to use a parquet format on extract then all these problems just go away.

Related

Pyspark: how to read CSV file with additional lines

so I am having the following CSV file. It has some additional strings between the valid rows. Excel seems to do a good job when reading those(and just ignores the additional ones).
However, the story with spark is a bit different.
I have set it as spark.read.csv(path, header=True, multiLine=True, sep='|')
Is there some simple way to handle it?

Reading a .dat file in Julia, issues with variable delimeter spacing

I am having issues reading a .dat file into a dataframe. I think the issue is with the delimiter. I have included a screen shot of what the data in the file looks like below. My best guess is that it is tab delimited between columns and then new-line delimited between rows. I have tried reading in the data with the following commands:
df = CSV.File("FORCECHAIN00046.dat"; header=false) |> DataFrame!
df = CSV.File("FORCECHAIN00046.dat"; header=false, delim = ' ') |> DataFrame!
My result either way is just a DataFrame with only one column including all the data frome each column concatenated into one string. I tried to even specify the types with the following code:
df = CSV.File("FORCECHAIN00046.dat"; types=[Float64,Float64,Float64,Float64,
Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64]) |> DataFrame!
And I received an the following error:
┌ Warning: 2; something went wrong trying to determine row positions for multithreading; it'd be very helpful if you could open an issue at https://github.com/JuliaData/CS
V.jl/issues so package authors can investigate
I can work around this by uploading it into google sheets and then downloading a csv, but I would like to find a way to make the original .dat file work.
Part of the issue here is that .dat is not a proper file format—it's just something that seems to be written out in a somewhat human-readable format with columns of numbers separated by variable numbers of spaces so that the numbers line up when you look at them in an editor. Google Sheets has a lot of clever tricks built in to "do what you want" for all kinds of ill-defined data files, so I'm not too surprised that it manages to parse this. The CSV package on the other hand supports using a single character as a delimiter or even a multi-character string, but not a variable number of spaces like this.
Possible solutions:
if the files aren't too big, you could easily roll your own parser that splits each line and then builds a matrix
you can also pre-process the file turning multiple spaces into single spaces
That's probably the easiest way to do this and here's some Julia code (untested since you didn't provide test data) that will open your file and convert it to a more reasonable format:
function dat2csv(dat_path::AbstractString, csv_path::AbstractString)
open(csv_path, write=true) do io
for line in eachline(dat_path)
join(io, split(line), ',')
println(io)
end
end
return csv_path
end
function dat2csv(dat_path::AbstractString)
base, ext = splitext(dat_path)
ext == ".dat" ||
throw(ArgumentError("file name doesn't end with `.dat`"))
return dat2csv(dat_path, "$base.csv")
end
You would call this function as dat2csv("FORCECHAIN00046.dat") and it would create the file FORCECHAIN00046.csv, which would be a proper CSV file using commas as delimiters. That won't work well if the files contain any values with commas in them, but it looks like they are just numbers, in which case it should be fine. So you can use this function to convert the files to proper CSV and then load that file with the CSV package.
A little explanation of the code:
the two-argument dat2csv method opens csv_path for writing and then calls eachline on dat_path to read one line form it at a time
eachline strips any trailing newline from each line, so each line will be bunch of numbers separated by whitespace with some leading and/or trailing whitespace
split(line) does the default splitting of line which splits it on whitespace, dropping any empty values—this leaves just the non-whitespace entries as strings in an array
join(io, split(line), ',') joins the strings in the array together, separated by the , character and writes that to the io write handle for csv_path
println(io) writes a newline after that—otherwise everything would just end up on a single very long line
the one-argument dat2csv method calls splitext to split the file name into a base name and an extension, checking that the extension is the expected .dat and calling the two-argument version with the .dat replaced by .csv
Try using the readdlm function in DelimitedFiles library, and convert to DataFrame afterwards:
using DelimitedFiles, DataFrames
df = DataFrame(readdlm("FORCECHAIN00046.dat"), :auto)

How to load CSV dataset with corrupted columns?

I've exported a client database to a csv file, and tried to import it to Spark using:
spark.sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("table.csv")
After doing some validations, I find out that some ids were null because a column sometimes has a carriage return. And that dislocated all next columns, with a domino effect, corrupting all the data.
What is strange is that when calling printSchema the resulting table structure is good.
How to fix the issue?
You seemed to have had a lot of luck with inferSchema that it worked fine (since it only reads few records to infer the schema) and so printSchema gives you a correct result.
Since the CSV export file is broken and assuming you want to process the file using Spark (given its size for example) read it using textFile and fix the ids. Save it as CSV format and load it back.
I'm not sure what version of spark you are using, but beginning in 2.2 (I believe), there is a 'multiLine' option that can be used to keep fields together that have line breaks in them. From some other things I've read, you may need to apply some quoting and/or escape character options to get it working just how you want it.
spark.read
.csv("table.csv")
.option("header", "true")
.option("inferSchema", "true")
**.option("multiLine", "true")**

How do I deal with commas/tabs that are part of the data in CSV/TSV in MarkLogic

I am trying to load a CSV file that have commas as part of the data into MarkLogic using RecordLoader. The data loads but MarkLogic takes commas that are part of the data as delimiters. I tried to escape commas by using backslashes but didn't work and the data remains dirty with the backslashes. I thought about replacing the data commas with other symbols so that I can change them back to commas after I load but I don't know if there is a way to modify the data after I load and I would have to reposition the XML tags line by line.
How can I load a CSV/TSV file and keep the commas/tabs that are part of the data as part of the data and not as delimiters?
Thanks in advance.
RecordLoader's DelimitedDataLoader doesn't support any escaping today. If you want to add it as a patch, https://github.com/marklogic/recordloader/blob/master/src/java/com/marklogic/recordloader/xcc/DelimitedDataLoader.java#L102 is the place to start looking at the code.
Although you asked about RecordLoader, you could also use the MarkLogic Content Pump. See Creating Documents from Delimited Text Files.

how to use ascii character for quote in COPY in cqlsh

I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.