pyspark: reduce size of JSON variable - json

I'm trying to analyse a JSON file that contains data from the Twitter API. The file is 2GB so it takes ages to load or attempt to run any analysis on.
So in pyspark I load it up:
df = sqlContext.read.json('/data/statuses.log.2014-12-30.gz')
this takes about 20 minutes as does any further analysis so I want to look at just a small section of the dataset so I can test my scripts quickly and easily. I tried
df = df.head(1000)
but this seems to alter the dataset somehow, so when I try
print(df.groupby('lang').count().sort(desc('count')).show())
I get the error
AttributeError: 'list' object has no attribute 'groupby'
Is there any way I can reduce to size of the data so I can play around with it without having to wait ages each time?

Solved it:
df = df.limit(1000)

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

Neo4J Cypher - Is it quicker to load from 100k Json Files or 1 file with 100k entries?

I am performing a daily load of 100k+ json files into a neo4j database which is taking approximately 2 to 3 hours each day.
I would like to know whether neo4j would run quicker if the files were all rolled into one large file and then iterated through by the database?
I will need to learn how to do this in Python if so, but I would just like to know this before embarking on the work.
Current code snippet I use to load files, the range can change each day based on generated filenames which are based on IDs in the json records.
UNWIND range(215300000,215457000) as id
WITH DISTINCT id+"_20220103.json" as file
CALL apoc.load.json("file:///output/"+file,null, {failOnError:false})
YIELD value
Thank you!
The json construction in Python was updated to include all 150k+ json objects into one file and then Cypher was updated to iterate over the file and run the code against each json object. I initially tried a batch size of 1000 and then 100 but they resulted in many exception locks where the code must have been attempting to update the same nodes at the same time, so I have reduced the batch size down to 1 and it loads about 99% of the json objects on a first pass in 7 minutes.... much better than the initial 2 to 3 hours :-)
Code I am now using:
CALL apoc.periodic.iterate(
'CALL apoc.load.json("file:///20220107.json") YIELD value',
'UNWIND value as item.... perform other actions...
',{ batchSize:1, parallel:true})

Why does Spark read data even no actions are called?

I have a confusion of the lazy load on Spark while using spark.read.json.
I have the following code:
df_location_user_profile = [
f"hdfs://hdfs_cluster:8020/data/*/*"
]
df_json = spark.read.json(json_data_files)
While the JSON data on HDFS is partitioned by year and month (year=yyyy, month=mm) and I want to retrieve all data of that dataset.
For this code block, I only read data from the defined location and there are no actions is executed. But I found on the Spark UI the following stage with giant input data.
As I understand, the lazy load fashion of Spark will not read data until an action is called. Then this makes me confused.
After that, I call the count() action then the new stage is created and Spark read data again.
My question is why does Spark read data when no action is called (on the first job, stage)? How can I optimize this?
It is doing a pass to evaluate the schema as it was not supplied. Aka infer schema.

use dask to store larger then memory csv file(s) to hdf5 file

Task: read larger than memory csv files, convert to arrays and store in hdf5.
One simple way is to use pandas to read the files in chunks
but I wanted to use dask, so far without success:
Latest attempt:
fname='test.csv'
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None)
dset.to_records().to_hdf5('/tmp/test.h5', '/x')
How could I do this?
Actually, I have a set of csv files representing 2D slices of a 3D array
that I would like to assemble and store. A suggestion on how to do the latter
would be welcome as well.
Given the comments below, here is one of many variations I tried:
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None, dtype='f8')
shape = (num_csv_records(fname), num_csv_cols(fname))
arr = da.Array( dset.dask, 'arr12345', (500*10, shape[1]), 'f8', shape)
da.to_hdf5('/tmp/test.h5', '/x', arr)
which results in the error:
KeyError: ('arr12345', 77, 0)
You will probably want to do something like the following. The real crux of the problem is, that in the read-csv case, dask doesn't know the number of rows of the data before a full load, and therefore the resultant data-frame has an unknown length (as is the usual case for data-frames). Arrays, on the other hand, generally need to know their complete shape for most operations. In your case you have extra information, so you can sidestep the problem.
Here is an example.
Data
0,1,2
2,3,4
Code
dset = dd.read_csv('data', sep=',', skiprows=0, header=None)
arr = dset.astype('float').to_dask_array(True)
arr.to_hdf5('/test.h5', '/x')
Where "True" means "find the lengths", or you can supply your own set of values.
You should use the to_hdf method on dask dataframes instead of on dask arrays
import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_hdf('myfile.hdf', '/data')
Alternatively, you might consider using parquet. This will be faster and is simpler in many ways
import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_parquet('myfile.parquet')
For more information, see the documentation on creating and storing dask dataframes: http://docs.dask.org/en/latest/dataframe-create.html
For arrays
If for some reason you really want to convert to a dask array first then you'll need to figure out how many rows each chunk of your data has and assign that to chunks attribute. See http://docs.dask.org/en/latest/array-chunks.html#unknown-chunks . I don't recommend this approach though, it's needlessly complex.

running nested jobs in spark

I am using PySpark. I have a list of gziped json files on s3 which I have to access, transform and then export in parquet to s3. Each json file contains around 100k lines so parallelizing it wont make much sense(but i am open to parallelizing it), but there are around 5k files which I have parallelize. My approach is pass the json file list to script -> run parallelize on the list -> run map(? this is where I am getting blocked). how do I access and transform the json create a DF out of the transformed json and dump it as parquet into s3.
To read json in a distributed fashion, you will need to parallelize your keys as you mention. To do this while reading from s3, you'll need to use boto3. Below is a skeleton sketch of how to do so. You'll likely need to modify distributedJsonRead to fit your use case.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
return Row(**contents)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.map(distributedJsonRead)
Boto3 Reference: http://boto3.readthedocs.org/en/latest/guide/quickstart.html
Edit: to address the 1:1 mapping of input files to output files
Later on, it's likely that having a merged parquet data set would be easier to work with. But if this is the way you need to do it, you could try something like this
for k in keyList:
rawtext = sc.read.json(k) # or whichever method you need to use to read in the data
outpath = k[:-4]+'parquet'
rawtext.write.parquet(outpath)
I don't think you will not be able to parallelize these operations if you want a 1:1 mapping of json to parquet files. Spark's read/write functionality is designed to be called by the driver, and needs access to sc and sqlContext. This is another reason why having 1 parquet directory is likely the way to go.