How to load CSV dataset with corrupted columns? - csv

I've exported a client database to a csv file, and tried to import it to Spark using:
spark.sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("table.csv")
After doing some validations, I find out that some ids were null because a column sometimes has a carriage return. And that dislocated all next columns, with a domino effect, corrupting all the data.
What is strange is that when calling printSchema the resulting table structure is good.
How to fix the issue?

You seemed to have had a lot of luck with inferSchema that it worked fine (since it only reads few records to infer the schema) and so printSchema gives you a correct result.
Since the CSV export file is broken and assuming you want to process the file using Spark (given its size for example) read it using textFile and fix the ids. Save it as CSV format and load it back.

I'm not sure what version of spark you are using, but beginning in 2.2 (I believe), there is a 'multiLine' option that can be used to keep fields together that have line breaks in them. From some other things I've read, you may need to apply some quoting and/or escape character options to get it working just how you want it.
spark.read
.csv("table.csv")
.option("header", "true")
.option("inferSchema", "true")
**.option("multiLine", "true")**

Related

Pyspark: how to read CSV file with additional lines

so I am having the following CSV file. It has some additional strings between the valid rows. Excel seems to do a good job when reading those(and just ignores the additional ones).
However, the story with spark is a bit different.
I have set it as spark.read.csv(path, header=True, multiLine=True, sep='|')
Is there some simple way to handle it?

How to read a csv in pyspark using error_bad_line = False as we use in pandas

I am trying to read a csv into pyspark but the problem is that it has a text column due to which there are some bad line in the data
This text column also contains the new line characters due to which the data in further columns is getting corrupted
I have tried using pandas and use some extra parameters to load my csv
a = pd.read_csv("Mycsvname.csv",sep = '~',quoting=csv.QUOTE_NONE, dtype = str,error_bad_lines=False, quotechar='~', lineterminator='\n' )
It is working fine in pandas but I want to load the csv in pyspark
So, is there any similar way to load a csv in pyspark with all the above parameters?
In the current version of spark (I think it is even there from spark 2.2 onwards), you can also read multi-line from csv.
If the newline is your only problem with the text column you can use a read command like this:
spark.read.csv("YOUR_FILE_NAME", header="true", escape="\"", quote="\"", multiLine=True)
Note: in our case the escape and quotation characters where both " so you might want to edit those options with your ~ and include sep = '~'.
You can also look at the documentation (http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=csv#pyspark.sql.DataFrameReader.csv) for more details

Spark File Format Escaping \n Loading CSV

I'm reading a CSV pipe delimited data file using spark. It's quote qualified. A block of text has a /n in it and it's causing the read to corrupt. What I don't understand is that it's quote qualified text so surely it should just skip that!? The rows themselves are CR+LN delimited.
Anyhow it's not. How do I get around this? I can cleanse them out on extract but doesn't seem that elegant to me.
This is what I'm using to load the data
val sch = spark.table("db.mytable").schema
val df = spark.read
.format("csv")
.schema(sch)
.option("header", "true")
.option("delimiter", "|")
.option("mode", "PERMISSIVE")
.option("quote", "\"")
.load("/yadaydayda/mydata.txt")
Glad to know I'm not the only one who's dealt with this issue in Spark!
Spark reads files line-by-line, so CSVs with newlines in them cause problems for the parser. Reading line-by-line makes it easier for Spark to handle large CSV files, rather than trying to parse all of the content for quotes, which would significantly impair performance for a case is more likely to not be an issue when trying to have high-performing analytics.
For cases where I knew newlines were a possibility, I've used a third party CSV parsing library, run the CSV "lines" through that (which would handle the newlines correctly), strip the newlines, write/cache the file somewhere, and read from that cached file. For a production use case, those files would be loaded into a database, or for log files or something where you don't want them in a database, using Parquet like you suggested works pretty well, or really just enforcing the lack of newlines somewhere before the files get to Spark.
Got around this by initially striping them on extract. The final solution I settled on however was to use a parquet format on extract then all these problems just go away.

Dealing with commas within a field in a csv file using pyspark

I have a csv data file containing commas within a column value. For example,
value_1,value_2,value_3
AAA_A,BBB,B,CCC_C
Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".
How to get the right values after splitting the line by commas in PySpark?
Use spark-csv class from databriks.
Delimiters between quotes, by default ("), are ignored.
Example:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For more info, review https://github.com/databricks/spark-csv
If your quote is (') instance of ("), you could configure with this class.
EDIT:
For python API:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
Best regards.
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
I'm (really) new to Pyspark, but have been using Pandas for the past years. What I'm going to put here might not be ultimately the best solution, but it works for me so I think it's worth posting here.
I'm encountering the same issue loading in a CSV file with extra comma embedded in one special field, which triggered an error if using Pyspark, but had no problem if using Pandas. So I looked around for a solution to deal with this extra delimiter, and the following piece of code solved my issue:
df = sqlContext.read.format('csv').option('header','true').option('maxColumns','3').option('escape','"').load('cars.csv')
I personally like to force the 'maxColumns' parameter to allow only a specific number of columns. So if the "BBB,B" somehow got parsed into two strings, spark is going to give an error message and print the whole line for you. And the 'escape' option is the one that really fixed my issue. I don't know if this helps, but hopefully that's something to run experiments with.

Spark SQL - loading csv/psv files with some malformed records

We are loading hierarchies of directories of files with Spark and converting them to Parquet. There are tens of gigabytes in hundreds of pipe-separated files. Some are pretty big themselves.
Every, say, 100th file has a row or two that has an extra delimiter that makes the whole process (or the file) abort.
We are loading using:
sqlContext.read
.format("com.databricks.spark.csv")
.option("header", format("header"))
.option("delimiter", format("delimeter"))
.option("quote", format("quote"))
.option("escape", format("escape"))
.option("charset", "UTF-8")
// Column types are unnecessary for our current use cases.
//.option("inferschema", "true")
.load(glob)
Is there any extension or a event handling mechanism with Spark that we could attach to the logic that reads rows, that, if the malformed row is encountered, just skips the row instead of failing the process on it?
(We are planning to do more pre-processing, but this would be the most immediate and critical fix.)
In your case it may not be the Spark parsing part of it which fails, but rather the fact that the default is actually PERMISSIVE such that it parses best-effort into a malformed record that then causes problems further downstream in your processing logic.
You should be able to simply add the option:
.option("mode", "DROPMALFORMED")
like this:
sqlContext.read
.format("com.databricks.spark.csv")
.option("header", format("header"))
.option("delimiter", format("delimeter"))
.option("quote", format("quote"))
.option("escape", format("escape"))
.option("charset", "UTF-8")
// Column types are unnecessary for our current use cases.
//.option("inferschema", "true")
.option("mode", "DROPMALFORMED")
.load(glob)
and it'll skip the lines with incorrect number of delimiters or which don't match the schema, rather than letting them cause errors later on in the code.