how to load large csv with many fields to Spark - csv

Happy New Year!!!
I know this type of similar question has been asked/answered before, however, mine is different:
I have large size csv with 100+ fields and 100MB+, I want to load it to Spark (1.6) for analysis, the csv's header looks like the attached sample (only one line of the data)
Thank you very much.
UPDATE 1(2016.12.31.1:26pm EST):
I use the following approach and was able to load data (sample data with limited columns), however, I need to auto assign the header (from the csv) as the field's name in the DataFrame, BUT, the DataFrame looks like:
Can anyone tell me how to do it? Note, any manual manner is what I want to avoid.
>>> import csv
>>> rdd = sc.textFile('file:///root/Downloads/data/flight201601short.csv')
>>> rdd = rdd.mapPartitions(lambda x: csv.reader(x))
>>> rdd.take(5)
>>> df = rdd.toDF()
>>> df.show(5)

As noted in the comments you can use spark.read.csv for spark 2.0.0+ (https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html)
df = spark.read.csv('your_file.csv', header=True, inferSchema=True)
Setting header to True will parse the header to column names of the dataframe. Setting inferSchema to True will get the table schema (but will slow down reading).
See also here:
Load CSV file with Spark

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

How do I specify a dtype for all columns when reading a CSV file with pyarrow?

I wanna read a big CSV file with pyarrow. All my columns are float64's. But pyarrow seems to be inferring int64.
How do I specify a dtype for all columns?
import gcsfs
import pyarrow.dataset as ds
fs = gcsfs.GCSFileSystem(project='my-google-cloud-project')
my_dataset = ds.dataset("bucket/foo/bar.csv", format="csv", filesystem=fs)
my_dataset.to_table()
which produces:
ArrowInvalid Traceback (most recent call last)
........py in <module>
----> 65 my_dataset.to_table()
File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/_dataset.pyx:491, in pyarrow._dataset.Dataset.to_table()
File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/_dataset.pyx:3235, in pyarrow._dataset.Scanner.to_table()
File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/error.pxi:143, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/error.pxi:99, in pyarrow.lib.check_status()
ArrowInvalid: In CSV column #172: Row #28: CSV conversion error to int64: invalid value '6.58841482364418'
Pyarrow's dataset module reads CSV files in chunks (the default is 1MB I think) and it processes those chunks in parallel. This makes column inference a bit tricky and it handles this by using the first chunk to infer data types. So the error you are getting is very common when the first chunk of the file has a column that looks integral but in future chunks the column has decimal values.
If you know the column names in advance then you can specify the data types of the columns:
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds
column_types = {'a': pa.float64(), 'b': pa.float64(), 'c': pa.float64()}
convert_options = csv.ConvertOptions(column_types=column_types)
custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
dataset = ds.dataset('/tmp/foo.csv', format=custom_csv_format)
If you don't know the column names then things are a bit trickier. However, it sounds like ALL columns are float64. In that case, since you only have one file, you can probably do something like this as a workaround:
dataset = ds.dataset('/tmp/foo.csv', format='csv')
column_types = {}
for field in dataset.schema:
column_types[field.name] = pa.float64()
# Now use column_types as above
This works because we call pa.dataset(...) twice and it will have a small bit of overhead. This is because each time we call pa.dataset(...) pyarrow will open the first chunk of the first file in the dataset to determine the schema (this is why we can use dataset.schema)
If you have multiple files with different columns then this approach won't work. In that case I'd recommend mailing the Arrow user# mailing list and we can have a more general discussion about different ways to solve the problem.

Spark Option: inferSchema vs header = true

Reference to pyspark: Difference performance for spark.read.format("csv") vs spark.read.csv
I thought I needed .options("inferSchema" , "true") and .option("header", "true") to print my headers but apparently I could still print my csv with headers.
What is the difference between header and schema? I don't really understand the meaning of "inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default".
The header and schema are separate things.
Header:
If the csv file have a header (column names in the first row) then set header=true. This will use the first row in the csv file as the dataframe's column names. Setting header=false (default option) will result in a dataframe with default column names: _c0, _c1, _c2, etc.
Setting this to true or false should be based on your input file.
Schema:
The schema refered to here are the column types. A column can be of type String, Double, Long, etc. Using inferSchema=false (default option) will give a dataframe where all columns are strings (StringType). Depending on what you want to do, strings may not work. For example, if you want to add numbers from different columns, then those columns should be of some numeric type (strings won't work).
By setting inferSchema=true, Spark will automatically go through the csv file and infer the schema of each column. This requires an extra pass over the file which will result in reading a file with inferSchema set to true being slower. But in return the dataframe will most likely have a correct schema given its input.
As an alternative to reading a csv with inferSchema you can provide the schema while reading. This have the advantage of being faster than inferring the schema while giving a dataframe with the correct column types. In addition, for csv files without a header row, column names can be given automatically. To provde schema see e.g.: Provide schema while reading csv file as a dataframe
There are two ways we can specify schema while reading the csv file.
Way1: Specify the inferSchema=true and header=true.
val myDataFrame = spark.read.options(Map("inferSchema"->"true", "header"->"true")).csv("/path/csv_filename.csv")
Note: Using this approach while reading data, it will create one more additional stage.
Way2: Specify the schema explicitly.
val schema = new StructType()
.add("Id",IntegerType,true)
.add("Name",StringType,true)
.add("Age",IntegerType,true)
val myDataFrame = spark.read.option("header", "true")
.schema(schema)
.csv("/path/csv_filename.csv")

use dask to store larger then memory csv file(s) to hdf5 file

Task: read larger than memory csv files, convert to arrays and store in hdf5.
One simple way is to use pandas to read the files in chunks
but I wanted to use dask, so far without success:
Latest attempt:
fname='test.csv'
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None)
dset.to_records().to_hdf5('/tmp/test.h5', '/x')
How could I do this?
Actually, I have a set of csv files representing 2D slices of a 3D array
that I would like to assemble and store. A suggestion on how to do the latter
would be welcome as well.
Given the comments below, here is one of many variations I tried:
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None, dtype='f8')
shape = (num_csv_records(fname), num_csv_cols(fname))
arr = da.Array( dset.dask, 'arr12345', (500*10, shape[1]), 'f8', shape)
da.to_hdf5('/tmp/test.h5', '/x', arr)
which results in the error:
KeyError: ('arr12345', 77, 0)
You will probably want to do something like the following. The real crux of the problem is, that in the read-csv case, dask doesn't know the number of rows of the data before a full load, and therefore the resultant data-frame has an unknown length (as is the usual case for data-frames). Arrays, on the other hand, generally need to know their complete shape for most operations. In your case you have extra information, so you can sidestep the problem.
Here is an example.
Data
0,1,2
2,3,4
Code
dset = dd.read_csv('data', sep=',', skiprows=0, header=None)
arr = dset.astype('float').to_dask_array(True)
arr.to_hdf5('/test.h5', '/x')
Where "True" means "find the lengths", or you can supply your own set of values.
You should use the to_hdf method on dask dataframes instead of on dask arrays
import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_hdf('myfile.hdf', '/data')
Alternatively, you might consider using parquet. This will be faster and is simpler in many ways
import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_parquet('myfile.parquet')
For more information, see the documentation on creating and storing dask dataframes: http://docs.dask.org/en/latest/dataframe-create.html
For arrays
If for some reason you really want to convert to a dask array first then you'll need to figure out how many rows each chunk of your data has and assign that to chunks attribute. See http://docs.dask.org/en/latest/array-chunks.html#unknown-chunks . I don't recommend this approach though, it's needlessly complex.

PySpark: How to Read Many JSON Files, Multiple Records Per File

I have a large dataset stored in a S3 bucket, but instead of being a single large file, it's composed of many (113K to be exact) individual JSON files, each of which contains 100-1000 observations. These observations aren't on the highest level, but require some navigation within each JSON to access.
i.e.
json["interactions"] is a list of dictionaries.
I'm trying to utilize Spark/PySpark (version 1.1.1) to parse through and reduce this data, but I can't figure out the right way to load it into an RDD, because it's neither all records > one file (in which case I'd use sc.textFile, though added complication here of JSON) nor each record > one file (in which case I'd use sc.wholeTextFiles).
Is my best option to use sc.wholeTextFiles and then use a map (or in this case flatMap?) to pull the multiple observations from being stored under a single filename key to their own key? Or is there an easier way to do this that I'm missing?
I've seen answers here that suggest just using json.loads() on all files loaded via sc.textFile, but it doesn't seem like that would work for me because the JSONs aren't simple highest-level lists.
The previous answers are not going to read the files in a distributed fashion (see reference). To do so, you would need to parallelize the s3 keys and then read in the files during a flatMap step like below.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=s3Key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
for dicts in content['interactions']
yield Row(**dicts)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.flatMap(distributedJsonRead)
Boto3 Reference
What about using DataFrames?
does
testFrame = sqlContext.read.json('s3n://<bucket>/<key>')
give you what you want from one file?
Does every observation have the same "columns" (# of keys)?
If so you could use boto to list each object you want to add, read them in and union them with each other.
from pyspark.sql import SQLContext
import boto3
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
s3 = boto3.resource('s3')
bucket = s3.Bucket('<bucket>')
aws_secret_access_key = '<secret>'
aws_access_key_id = '<key>'
#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_secret_access_key)
object_list = [k for k in bucket.objects.all() ]
key_list = [k.key for k in bucket.objects.all()]
paths = ['s3n://'+o.bucket_name+'/'+ o.key for o in object_list ]
dataframes = [sqlContext.read.json(path) for path in paths]
df = dataframes[0]
for idx, frame in enumerate(dataframes):
df = df.unionAll(frame)
I'm new to spark myself so I'm wondering if there's a better way to use dataframes with a lot of s3 files, but so far this is working for me.
The name is misleading (because it's singular), but sparkContext.textFile() (at least in the Scala case) also accepts a directory name or a wildcard path, so you just be able to say textFile("/my/dir/*.json").