PySpark: How to specify column with comma as decimal - csv

I am working with PySpark and loading a csv file. I have a column with numbers in European format, which means that comma replaces the dot and vice versa.
For example: I have 2.416,67 instead of 2,416.67.
My data in .csv file looks like this -
ID; Revenue
21; 2.645,45
23; 31.147,05
.
.
55; 1.009,11
In pandas, such a file can easily be read by specifying decimal=',' and thousands='.' options inside pd.read_csv() to read European formats.
Pandas code:
import pandas as pd
df=pd.read_csv("filepath/revenues.csv",sep=';',decimal=',',thousands='.')
I don't know how can this be done in PySpark.
PySpark code:
from pyspark.sql.types import StructType, StructField, FloatType, StringType
schema = StructType([
StructField("ID", StringType(), True),
StructField("Revenue", FloatType(), True)
])
df=spark.read.csv("filepath/revenues.csv",sep=';',encoding='UTF-8', schema=schema, header=True)
Can anyone suggest as to how we can load such a file in PySpark using the above mentioned .csv() function?

You won't be able to read it as a float because the format of the data. You need to read it as a string, clean it up and then cast to float:
from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import FloatType
df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = df.withColumn('revenue', regexp_replace('revenue', '\\.', ''))
df = df.withColumn('revenue', regexp_replace('revenue', ',', '.'))
df = df.withColumn('revenue', df['revenue'].cast("float"))
You can probably just chain these all together too:
df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = (
df
.withColumn('revenue', regexp_replace('revenue', '\\.', ''))
.withColumn('revenue', regexp_replace('revenue', ',', '.'))
.withColumn('revenue', df['revenue'].cast("float"))
)
Please note this I haven't tested this so there may be a typo or two in there.

If your dataset has lots of float columns, but the size of the dataset is still small enough to preprocess it first with pandas, I found it easier to just do the following.
import pandas as pd
df_pandas = pd.read_csv('yourfile.csv', sep=';', decimal=',')
df_pandas.to_csv('yourfile__dot_as_decimal_separator.csv', sep=';', decimal='.') # optionally also header=True of course.
df_spark = spark.csv.read('yourfile__dot_as_decimal_separator.csv', sep=';', inferSchema=True) # optionally also header=True of course.
I did find jhole89's answer very useful, but found it a pain to apply it on a dataset with a lot of columns (multiple hundreds).
I mean:
manually specifying float columns and converting them is a lot of effort,
trying to find them dynamically by checking which columns are string-typed and contain a comma, avoiding that datetime columns with millesecond separators aren't taken into account etc., casting to float that fails on certain columns because they are text containing comma's but aren't intended to be parsed as float numbers: this causes headaches.
Therefore, if there are multiple float columns and your dataset can be preprocessed with pandas, you can apply the above code.

Make sure your SQL table is pre-formatted to read NUMERIC instead of INTEGER. I had a big trouble trying to figure out all about encoding and the different formats of dots and commas and etc. and in the end the problem was much more primitive, it was pre-formatted to read only INTEGER numbers, and therefore no decimals would ever be accepted, no matter if with commas or dots. Then I just had to change my SQL table to accept real numbers (NUMERIC) instead and that was it.

Related

How does Pyarrow read_csv handle different file encodings?

I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarrow.csv.read_csv I dont see a parameter to select the encoding of the file but it still works without issue(which is great! but i dont understand why / if it only auto handles certain encodings). The only parameters im using are setting the delimiter="|" (with ParseOptions) and auto_dict_encode=True with (ConvertOptions).
How is pyarrow handling different encoding types?
pyarrow currently has no functionality to deal with different encodings, and assumes UTF8 for string/text data.
But the reason it doesn't raise an error is that pyarrow will read any non-UTF8 strings as a "binary" type column, instead of "string" type.
A small example:
# writing a small file with latin encoding
with open("test.csv", "w", encoding="latin") as f:
f.writelines(["col1,col2\n", "u,ù"])
Reading with pyarrow gives string for the first column (which only contains ASCII characters, thus also valid UTF8), but reads the second column as binary:
>>> from pyarrow import csv
>>> csv.read_csv("test.csv")
pyarrow.Table
col1: string
col2: binary
With pandas you indeed get an error by default (because pandas has no binary data type, and will try to read all text columns as python strings, thus UTF8):
>>> pd.read_csv("test.csv")
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 0: invalid start byte
>>> pd.read_csv("test.csv", encoding="latin")
col1 col2
0 u ù
It's now possible to specify encodings with pyarrow.read_csv.
According to the pyarrow docs for read_csv:
The encoding can be changed using the ReadOptions class.
A minimal example follows:
from pyarrow import csv
options = csv.ReadOptions(encoding='latin1')
table = csv.read_csv('path/to/file', options)
From what I can tell, the functionality was added in this PR, so it should work starting with pyarrow 1.0.

How to read a csv in pyspark using error_bad_line = False as we use in pandas

I am trying to read a csv into pyspark but the problem is that it has a text column due to which there are some bad line in the data
This text column also contains the new line characters due to which the data in further columns is getting corrupted
I have tried using pandas and use some extra parameters to load my csv
a = pd.read_csv("Mycsvname.csv",sep = '~',quoting=csv.QUOTE_NONE, dtype = str,error_bad_lines=False, quotechar='~', lineterminator='\n' )
It is working fine in pandas but I want to load the csv in pyspark
So, is there any similar way to load a csv in pyspark with all the above parameters?
In the current version of spark (I think it is even there from spark 2.2 onwards), you can also read multi-line from csv.
If the newline is your only problem with the text column you can use a read command like this:
spark.read.csv("YOUR_FILE_NAME", header="true", escape="\"", quote="\"", multiLine=True)
Note: in our case the escape and quotation characters where both " so you might want to edit those options with your ~ and include sep = '~'.
You can also look at the documentation (http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=csv#pyspark.sql.DataFrameReader.csv) for more details

Dealing with commas within a field in a csv file using pyspark

I have a csv data file containing commas within a column value. For example,
value_1,value_2,value_3
AAA_A,BBB,B,CCC_C
Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".
How to get the right values after splitting the line by commas in PySpark?
Use spark-csv class from databriks.
Delimiters between quotes, by default ("), are ignored.
Example:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For more info, review https://github.com/databricks/spark-csv
If your quote is (') instance of ("), you could configure with this class.
EDIT:
For python API:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
Best regards.
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
I'm (really) new to Pyspark, but have been using Pandas for the past years. What I'm going to put here might not be ultimately the best solution, but it works for me so I think it's worth posting here.
I'm encountering the same issue loading in a CSV file with extra comma embedded in one special field, which triggered an error if using Pyspark, but had no problem if using Pandas. So I looked around for a solution to deal with this extra delimiter, and the following piece of code solved my issue:
df = sqlContext.read.format('csv').option('header','true').option('maxColumns','3').option('escape','"').load('cars.csv')
I personally like to force the 'maxColumns' parameter to allow only a specific number of columns. So if the "BBB,B" somehow got parsed into two strings, spark is going to give an error message and print the whole line for you. And the 'escape' option is the one that really fixed my issue. I don't know if this helps, but hopefully that's something to run experiments with.

Saving Pandas DataFrame and meta-data to JSON format

I have a need to save a Pandas DataFrame, along with some metadata to a file in JSON format. (The JSON format is a requirement.)
Background
A) I can successfully read/write my rather large Pandas Dataframe from/to JSON using DataFrame.to_json() and DataFrame.from_json(). No problems.
B) I have no problems saving my metadata (dict) to JSON using json.dump()/json.load()
My first attempt
Since Pandas does not support DataFrame metadata directly, my first thought was to
top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
json.dump(top_level_dict, fp)
Failure modes
C) I have found that even the simplified case of
df_dict = df.to_dict()
json.dump(df_dict, fp)
fails with:
TypeError: key (u'US', 112, 5, 80, 'wl') is not a string
D) Investigating, I've found that the complement also fails.
df.to_json(fp)
json.load(fp)
fails with
384 raise ValueError("No JSON object could be decoded")
ValueError: Expecting : delimiter: line 1 column 17 (char 16)
So it appears that Pandas JSON format and the Python's JSON library are not compatible.
My first thought is to chase down a way to modify the df.to_dict() output of C to make it amenable to Python's JSON library, but I keep hearing "If you're struggling to do something in Python, you're probably doing it wrong." in my head.
Question
What is the cannonical/recommended method for adding metadata to a Pandas DataFrame and storing to a JSON-formatted file?
Python 2.7.10
Pandas 0.17
Edit 1:
While trying out Evan Wright's great answer, I found the source of my problems: Pandas (as of 0.17) does not like saving Multi-Indexed DataFrames to JSON. The library I had created to save my (Multi-Indexed) DataFrames is quietly performing a df.reset_index() before calling DataFrame.to_json(). My newer code was not. So it was DataFrame.to_json() burping on the MultiIndex.
Lesson: Read the documentation kids, even when it's your own documentation.
Edit 2:
If you need to store both the DataFrame and the metadata in a single JSON object, see my answer below.
You should be able to just put the data on separate lines.
Writing:
f = open('test.json', 'w')
df.to_json(f)
print >> f
json.dump(metadata, f)
Reading:
f = open('test.json')
df = pd.read_json(next(f))
metdata = json.loads(next(f))
In my question, I erroneously stated that I needed the JSON in a file. In that situation, Evan Wright's answer is my preferred solution.
In my case, I actually need to store the JSON output as a single "blob" in a database, so my dictionary-wrangling approach appears to be necessary.
If you similarly need to store the data and metadata in a single JSON blob, the following code will work:
top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
with open(FILENAME, 'w') as outfile:
json.dump(top_level_dict, outfile)
Just make sure DataFrame is singly-indexed. If it's Multi-Indexed, reset the index (i.e. df.reset_index()) before doing the above.
Reading the data back in:
with open(FILENAME, 'r') as infile:
top_level_dict = json.load(infile)
df_as_dict = top_level_dict.pop('data', {})
df = pandas.DataFrame().as_dict(df_as_dict)
meta = top_level_dict['metadata']
At this point, you'll need to re-create your Multi-Index (if applicable)

Python 3 code to read CSV file, manipulate then create new file....works, but looking for improvements

This is my first ever post here. I am trying to learn a bit of Python. Using Python 3 and numpy.
Did a few tutorials then decided to dive in and try a little project I might find useful at work as thats a good way to learn for me.
I have written a program that reads in data from a CSV file which has a few rows of headers, I then want to extract certain columns from that file based on the header names, then output that back to a new csv file in a particular format.
The program I have works fine and does what I want, but as I'm a newbie I would like some tips as to how I can improve my code.
My main data file (csv) is about 57 columns long and about 36 rows deep so not big.
It works fine, but looking for advice & improvements.
import csv
import numpy as np
#make some arrays..at least I think thats what this does
A=[]
B=[]
keep_headers=[]
#open the main data csv file 'map.csv'...need to check what 'r' means
input_file = open('map.csv','r')
#read the contents of the file into 'data'
data=csv.reader(input_file, delimiter=',')
#skip the first 2 header rows as they are junk
next(data)
next(data)
#read in the next line as the 'header'
headers = next(data)
#Now read in the numeric data (float) from the main csv file 'map.csv'
A=np.genfromtxt('map.csv',delimiter=',',dtype='float',skiprows=5)
#Get the length of a column in A
Alen=len(A[:,0])
#now read the column header values I want to keep from 'keepheader.csv'
keep_headers=np.genfromtxt('keepheader.csv',delimiter=',',dtype='unicode_')
#Get the length of keep headers....i.e. how many headers I'm keeping.
head_len=len(keep_headers)
#Now loop round extracting all the columns with the keep header titles and
#append them to array B
i=0
while i < head_len:
#use index to find the apprpriate column number.
item_num=headers.index(keep_headers[i])
i=i+1
#append the selected column to array B
B=np.append(B,A[:,item_num])
#now reshape the B array
B=np.reshape(B,(head_len,36))
#now transpose it as thats the format I want.
B=np.transpose(B)
#save the array B back to a new csv file called 'cmap.csv'
np.savetxt('cmap.csv',B,fmt='%.3f',delimiter=",")
Thanks.
You can greatly simplify your code using more of numpy capabilities.
A = np.loadtxt('stack.txt',skiprows=2,delimiter=',',dtype=str)
keep_headers=np.loadtxt('keepheader.csv',delimiter=',',dtype=str)
headers = A[0,:]
cols_to_keep = np.in1d( headers, keep_headers )
B = np.float_(A[1:,cols_to_keep])
np.savetxt('cmap.csv',B,fmt='%.3f',delimiter=",")