Dealing with commas within a field in a csv file using pyspark - csv

I have a csv data file containing commas within a column value. For example,
value_1,value_2,value_3
AAA_A,BBB,B,CCC_C
Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".
How to get the right values after splitting the line by commas in PySpark?

Use spark-csv class from databriks.
Delimiters between quotes, by default ("), are ignored.
Example:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For more info, review https://github.com/databricks/spark-csv
If your quote is (') instance of ("), you could configure with this class.
EDIT:
For python API:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
Best regards.

If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])

I'm (really) new to Pyspark, but have been using Pandas for the past years. What I'm going to put here might not be ultimately the best solution, but it works for me so I think it's worth posting here.
I'm encountering the same issue loading in a CSV file with extra comma embedded in one special field, which triggered an error if using Pyspark, but had no problem if using Pandas. So I looked around for a solution to deal with this extra delimiter, and the following piece of code solved my issue:
df = sqlContext.read.format('csv').option('header','true').option('maxColumns','3').option('escape','"').load('cars.csv')
I personally like to force the 'maxColumns' parameter to allow only a specific number of columns. So if the "BBB,B" somehow got parsed into two strings, spark is going to give an error message and print the whole line for you. And the 'escape' option is the one that really fixed my issue. I don't know if this helps, but hopefully that's something to run experiments with.

Related

PySpark: How to specify column with comma as decimal

I am working with PySpark and loading a csv file. I have a column with numbers in European format, which means that comma replaces the dot and vice versa.
For example: I have 2.416,67 instead of 2,416.67.
My data in .csv file looks like this -
ID; Revenue
21; 2.645,45
23; 31.147,05
.
.
55; 1.009,11
In pandas, such a file can easily be read by specifying decimal=',' and thousands='.' options inside pd.read_csv() to read European formats.
Pandas code:
import pandas as pd
df=pd.read_csv("filepath/revenues.csv",sep=';',decimal=',',thousands='.')
I don't know how can this be done in PySpark.
PySpark code:
from pyspark.sql.types import StructType, StructField, FloatType, StringType
schema = StructType([
StructField("ID", StringType(), True),
StructField("Revenue", FloatType(), True)
])
df=spark.read.csv("filepath/revenues.csv",sep=';',encoding='UTF-8', schema=schema, header=True)
Can anyone suggest as to how we can load such a file in PySpark using the above mentioned .csv() function?
You won't be able to read it as a float because the format of the data. You need to read it as a string, clean it up and then cast to float:
from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import FloatType
df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = df.withColumn('revenue', regexp_replace('revenue', '\\.', ''))
df = df.withColumn('revenue', regexp_replace('revenue', ',', '.'))
df = df.withColumn('revenue', df['revenue'].cast("float"))
You can probably just chain these all together too:
df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = (
df
.withColumn('revenue', regexp_replace('revenue', '\\.', ''))
.withColumn('revenue', regexp_replace('revenue', ',', '.'))
.withColumn('revenue', df['revenue'].cast("float"))
)
Please note this I haven't tested this so there may be a typo or two in there.
If your dataset has lots of float columns, but the size of the dataset is still small enough to preprocess it first with pandas, I found it easier to just do the following.
import pandas as pd
df_pandas = pd.read_csv('yourfile.csv', sep=';', decimal=',')
df_pandas.to_csv('yourfile__dot_as_decimal_separator.csv', sep=';', decimal='.') # optionally also header=True of course.
df_spark = spark.csv.read('yourfile__dot_as_decimal_separator.csv', sep=';', inferSchema=True) # optionally also header=True of course.
I did find jhole89's answer very useful, but found it a pain to apply it on a dataset with a lot of columns (multiple hundreds).
I mean:
manually specifying float columns and converting them is a lot of effort,
trying to find them dynamically by checking which columns are string-typed and contain a comma, avoiding that datetime columns with millesecond separators aren't taken into account etc., casting to float that fails on certain columns because they are text containing comma's but aren't intended to be parsed as float numbers: this causes headaches.
Therefore, if there are multiple float columns and your dataset can be preprocessed with pandas, you can apply the above code.
Make sure your SQL table is pre-formatted to read NUMERIC instead of INTEGER. I had a big trouble trying to figure out all about encoding and the different formats of dots and commas and etc. and in the end the problem was much more primitive, it was pre-formatted to read only INTEGER numbers, and therefore no decimals would ever be accepted, no matter if with commas or dots. Then I just had to change my SQL table to accept real numbers (NUMERIC) instead and that was it.

How to read a csv in pyspark using error_bad_line = False as we use in pandas

I am trying to read a csv into pyspark but the problem is that it has a text column due to which there are some bad line in the data
This text column also contains the new line characters due to which the data in further columns is getting corrupted
I have tried using pandas and use some extra parameters to load my csv
a = pd.read_csv("Mycsvname.csv",sep = '~',quoting=csv.QUOTE_NONE, dtype = str,error_bad_lines=False, quotechar='~', lineterminator='\n' )
It is working fine in pandas but I want to load the csv in pyspark
So, is there any similar way to load a csv in pyspark with all the above parameters?
In the current version of spark (I think it is even there from spark 2.2 onwards), you can also read multi-line from csv.
If the newline is your only problem with the text column you can use a read command like this:
spark.read.csv("YOUR_FILE_NAME", header="true", escape="\"", quote="\"", multiLine=True)
Note: in our case the escape and quotation characters where both " so you might want to edit those options with your ~ and include sep = '~'.
You can also look at the documentation (http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=csv#pyspark.sql.DataFrameReader.csv) for more details

Spark - How to write a single csv file WITHOUT folder?

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)
You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it
df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work

How to load CSV dataset with corrupted columns?

I've exported a client database to a csv file, and tried to import it to Spark using:
spark.sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("table.csv")
After doing some validations, I find out that some ids were null because a column sometimes has a carriage return. And that dislocated all next columns, with a domino effect, corrupting all the data.
What is strange is that when calling printSchema the resulting table structure is good.
How to fix the issue?
You seemed to have had a lot of luck with inferSchema that it worked fine (since it only reads few records to infer the schema) and so printSchema gives you a correct result.
Since the CSV export file is broken and assuming you want to process the file using Spark (given its size for example) read it using textFile and fix the ids. Save it as CSV format and load it back.
I'm not sure what version of spark you are using, but beginning in 2.2 (I believe), there is a 'multiLine' option that can be used to keep fields together that have line breaks in them. From some other things I've read, you may need to apply some quoting and/or escape character options to get it working just how you want it.
spark.read
.csv("table.csv")
.option("header", "true")
.option("inferSchema", "true")
**.option("multiLine", "true")**

Saving Pandas DataFrame and meta-data to JSON format

I have a need to save a Pandas DataFrame, along with some metadata to a file in JSON format. (The JSON format is a requirement.)
Background
A) I can successfully read/write my rather large Pandas Dataframe from/to JSON using DataFrame.to_json() and DataFrame.from_json(). No problems.
B) I have no problems saving my metadata (dict) to JSON using json.dump()/json.load()
My first attempt
Since Pandas does not support DataFrame metadata directly, my first thought was to
top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
json.dump(top_level_dict, fp)
Failure modes
C) I have found that even the simplified case of
df_dict = df.to_dict()
json.dump(df_dict, fp)
fails with:
TypeError: key (u'US', 112, 5, 80, 'wl') is not a string
D) Investigating, I've found that the complement also fails.
df.to_json(fp)
json.load(fp)
fails with
384 raise ValueError("No JSON object could be decoded")
ValueError: Expecting : delimiter: line 1 column 17 (char 16)
So it appears that Pandas JSON format and the Python's JSON library are not compatible.
My first thought is to chase down a way to modify the df.to_dict() output of C to make it amenable to Python's JSON library, but I keep hearing "If you're struggling to do something in Python, you're probably doing it wrong." in my head.
Question
What is the cannonical/recommended method for adding metadata to a Pandas DataFrame and storing to a JSON-formatted file?
Python 2.7.10
Pandas 0.17
Edit 1:
While trying out Evan Wright's great answer, I found the source of my problems: Pandas (as of 0.17) does not like saving Multi-Indexed DataFrames to JSON. The library I had created to save my (Multi-Indexed) DataFrames is quietly performing a df.reset_index() before calling DataFrame.to_json(). My newer code was not. So it was DataFrame.to_json() burping on the MultiIndex.
Lesson: Read the documentation kids, even when it's your own documentation.
Edit 2:
If you need to store both the DataFrame and the metadata in a single JSON object, see my answer below.
You should be able to just put the data on separate lines.
Writing:
f = open('test.json', 'w')
df.to_json(f)
print >> f
json.dump(metadata, f)
Reading:
f = open('test.json')
df = pd.read_json(next(f))
metdata = json.loads(next(f))
In my question, I erroneously stated that I needed the JSON in a file. In that situation, Evan Wright's answer is my preferred solution.
In my case, I actually need to store the JSON output as a single "blob" in a database, so my dictionary-wrangling approach appears to be necessary.
If you similarly need to store the data and metadata in a single JSON blob, the following code will work:
top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
with open(FILENAME, 'w') as outfile:
json.dump(top_level_dict, outfile)
Just make sure DataFrame is singly-indexed. If it's Multi-Indexed, reset the index (i.e. df.reset_index()) before doing the above.
Reading the data back in:
with open(FILENAME, 'r') as infile:
top_level_dict = json.load(infile)
df_as_dict = top_level_dict.pop('data', {})
df = pandas.DataFrame().as_dict(df_as_dict)
meta = top_level_dict['metadata']
At this point, you'll need to re-create your Multi-Index (if applicable)