Pyspark: how to read CSV file with additional lines - csv

so I am having the following CSV file. It has some additional strings between the valid rows. Excel seems to do a good job when reading those(and just ignores the additional ones).
However, the story with spark is a bit different.
I have set it as spark.read.csv(path, header=True, multiLine=True, sep='|')
Is there some simple way to handle it?

Related

Spark File Format Escaping \n Loading CSV

I'm reading a CSV pipe delimited data file using spark. It's quote qualified. A block of text has a /n in it and it's causing the read to corrupt. What I don't understand is that it's quote qualified text so surely it should just skip that!? The rows themselves are CR+LN delimited.
Anyhow it's not. How do I get around this? I can cleanse them out on extract but doesn't seem that elegant to me.
This is what I'm using to load the data
val sch = spark.table("db.mytable").schema
val df = spark.read
.format("csv")
.schema(sch)
.option("header", "true")
.option("delimiter", "|")
.option("mode", "PERMISSIVE")
.option("quote", "\"")
.load("/yadaydayda/mydata.txt")
Glad to know I'm not the only one who's dealt with this issue in Spark!
Spark reads files line-by-line, so CSVs with newlines in them cause problems for the parser. Reading line-by-line makes it easier for Spark to handle large CSV files, rather than trying to parse all of the content for quotes, which would significantly impair performance for a case is more likely to not be an issue when trying to have high-performing analytics.
For cases where I knew newlines were a possibility, I've used a third party CSV parsing library, run the CSV "lines" through that (which would handle the newlines correctly), strip the newlines, write/cache the file somewhere, and read from that cached file. For a production use case, those files would be loaded into a database, or for log files or something where you don't want them in a database, using Parquet like you suggested works pretty well, or really just enforcing the lack of newlines somewhere before the files get to Spark.
Got around this by initially striping them on extract. The final solution I settled on however was to use a parquet format on extract then all these problems just go away.

split big json files into small pieces without breaking the format

I'm using spark.read() to read a big json file on databricks. And it failed due to: spark driver has stopped unexpectedly and is restarting after a long time of runing.I assumed it is because the file is too big, so i decided to split it. So I used command:
split -b 100m -a 1 test.json
This actually split my files into small pieces and I can now read that on databricks. But then I found what I got is a set of null values. I think that is because i splitted the file only by the size,and some files might become files that are not in json format. For example , i might get something like this in the end of a file.
{"id":aefae3,......
Then it can't be read by spark.read.format("json").So is there any way i can seperate the json file into small pieces without breaking the json format?

Java.io.IOException: wrong number of values (WEKA CSV to ARFF)

Currently working on a Data Mining project using my own dataset I had found using Weka. The only issue is that taking my file from csv format and converting it into arff format is causing issues.
java.io.IOException: wrong number of values. Read 2, expected 5, Read Token[EOL], line 3
This is the error I am getting. I have browsed around online looking for similar issues and have tried removing all quotes and special characters that throw this exception. Every place I looked told me to remove special characters and I believe there are none left. The link to my dataset is here : https://docs.google.com/spreadsheets/d/1xqEe7MZE9SdKB_yvFSgWeSVYuDrq0b31Eu5oECNbGH0/edit#gid=1736568367&vpid=A1
This is the first three lines of my file where the first is the attribute names, file is separated by commas in note
Inequality Adjusted HPI Rank,Sub Region,Inequality Adjusted Life Expectancy,Inquality Adjusted Well being,Footprint
,Inequality adjusted HPI
1,1,73.1,6.9,2.5,48.2
2,6,65.17333333,5.487667631,1.390974448,45.97489063
If you open your file with a text editor, you will see that Footprint has quotes around it. Delete the quotes and you are good to go!
Weka is normally not that good in reading CSV files that include special characters, and ARFF files are normally easier to use. Therefore, in such cases, the easiest way is to convert your CSV file to an ARFF file using R ("RWeka" and "foreign" libraries can handle this conversion).
There is also another possibility. I was creating my CSV file and the header had a different number of elements compared to the rest of the data. So, check the header as well...!

How can hadoop mapreduce get data input from CSV file?

I want to implement hadoop mapreduce, and I use the csv file for it's input. So, I want to ask, is there any method that hadoop provide for use to get the value of csv file, or we just do it with Java Split String function?
Thanks all.....
By default Hadoop uses a Text Input reader that feeds the mapper line by line from the input file. The key in the mapper is the number of lines read. Be careful with CSV files though, as single columns/fields can contain a line break. You might want to look for a CSV input reader like this one:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
But, you have to split your line in your code.

how to use ascii character for quote in COPY in cqlsh

I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.