How to decompress LZ4 JSON in Spark 2 - json

I have downloaded a xxxx.json.lz4 file from https://censys.io/ however when I try to read the file using the following line I get no data out/count of 0.
metadata_lz4 = spark.read.json("s3n://file.json.lz4")
it returns no results although decompressing manually works fine and can be imported into Spark.
I have also tried
val metadata_lz4_2 = spark.sparkContext.newAPIHadoopFile("s3n://file.json.lz4", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
Which also returns no results.
I have multiple of these files which are 100GBs each so really keen on not having to decompress each one manually.
Any ideas?

According to this open issue spark LZ4 decompressor uses different specification then the standard LZ4 decompressor.
Hence until this issue will be solved in apache-spark, you won't be able to use spark LZ4 in order to decompress standard LZ4 compressed files.
I don't think our Lz4Codec implementation actually uses the FRAME
specification (http://cyan4973.github.io/lz4/lz4_Frame_format.html)
when creating text based files. It seems it was added in as a codec
for use inside block compression formats such as
SequenceFiles/HFiles/etc., but wasn't oriented towards Text files from
the looks of it, or was introduced at a time when there was no FRAME
specification of LZ4.
Therefore, fundamentally, we are not interoperable with the lz4
utility. The difference is very similar to the GPLExtras' LzoCodec vs.
LzopCodec, the former is just the data compressing algorithm, but the
latter is an actual framed format interoperable with lzop CLI utility.
To make ourselves interoperable, we'll need to introduce a new frame
wrapping codec such as LZ4FrameCodec, and users could use that when
they want to decompress or compress text data produced/readable by
lz4/lz4cat CLI utilities.

I achieved to parse lz4 compression in Pyspark in this way:
import lz4.frame
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContext
list_paths = ['/my/file.json.lz4', '/my/beautiful/file.json.lz4']
rdd = sc.binaryFiles(','.join(list_paths))
df = rdd.map(lambda x: lz4.frame.decompress(x[1])).map(lambda x: str(x)).map(lambda x: (x, )).toDF()
this is usually enough for non complex objects. But if the compressed JSON you are
parsing has nested structures, then it is necessary to do extra cleaning on the parsed file before calling the function F.from_json():
schema = spark.read.json("/my/uncompressed_file.json").schema
df = df.select(F.regexp_replace(F.regexp_replace(F.regexp_replace(F.regexp_replace(F.regexp_replace("_1", "None", "null"), "False", "false"), "True", "true"), "b'", ""), "'", "").alias("json_notation"))
result_df = df.select(F.from_json("json_notation", schema)
where /my/uncompressed_file.json is the /my/file.json.lz4 that you have previously decompressed (unless you want to provide the schema manually if not too complex, it will work anyway)

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

use dask to store larger then memory csv file(s) to hdf5 file

Task: read larger than memory csv files, convert to arrays and store in hdf5.
One simple way is to use pandas to read the files in chunks
but I wanted to use dask, so far without success:
Latest attempt:
fname='test.csv'
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None)
dset.to_records().to_hdf5('/tmp/test.h5', '/x')
How could I do this?
Actually, I have a set of csv files representing 2D slices of a 3D array
that I would like to assemble and store. A suggestion on how to do the latter
would be welcome as well.
Given the comments below, here is one of many variations I tried:
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None, dtype='f8')
shape = (num_csv_records(fname), num_csv_cols(fname))
arr = da.Array( dset.dask, 'arr12345', (500*10, shape[1]), 'f8', shape)
da.to_hdf5('/tmp/test.h5', '/x', arr)
which results in the error:
KeyError: ('arr12345', 77, 0)
You will probably want to do something like the following. The real crux of the problem is, that in the read-csv case, dask doesn't know the number of rows of the data before a full load, and therefore the resultant data-frame has an unknown length (as is the usual case for data-frames). Arrays, on the other hand, generally need to know their complete shape for most operations. In your case you have extra information, so you can sidestep the problem.
Here is an example.
Data
0,1,2
2,3,4
Code
dset = dd.read_csv('data', sep=',', skiprows=0, header=None)
arr = dset.astype('float').to_dask_array(True)
arr.to_hdf5('/test.h5', '/x')
Where "True" means "find the lengths", or you can supply your own set of values.
You should use the to_hdf method on dask dataframes instead of on dask arrays
import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_hdf('myfile.hdf', '/data')
Alternatively, you might consider using parquet. This will be faster and is simpler in many ways
import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_parquet('myfile.parquet')
For more information, see the documentation on creating and storing dask dataframes: http://docs.dask.org/en/latest/dataframe-create.html
For arrays
If for some reason you really want to convert to a dask array first then you'll need to figure out how many rows each chunk of your data has and assign that to chunks attribute. See http://docs.dask.org/en/latest/array-chunks.html#unknown-chunks . I don't recommend this approach though, it's needlessly complex.

How do I read a Large JSON Array File in PySpark

Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)

running nested jobs in spark

I am using PySpark. I have a list of gziped json files on s3 which I have to access, transform and then export in parquet to s3. Each json file contains around 100k lines so parallelizing it wont make much sense(but i am open to parallelizing it), but there are around 5k files which I have parallelize. My approach is pass the json file list to script -> run parallelize on the list -> run map(? this is where I am getting blocked). how do I access and transform the json create a DF out of the transformed json and dump it as parquet into s3.
To read json in a distributed fashion, you will need to parallelize your keys as you mention. To do this while reading from s3, you'll need to use boto3. Below is a skeleton sketch of how to do so. You'll likely need to modify distributedJsonRead to fit your use case.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
return Row(**contents)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.map(distributedJsonRead)
Boto3 Reference: http://boto3.readthedocs.org/en/latest/guide/quickstart.html
Edit: to address the 1:1 mapping of input files to output files
Later on, it's likely that having a merged parquet data set would be easier to work with. But if this is the way you need to do it, you could try something like this
for k in keyList:
rawtext = sc.read.json(k) # or whichever method you need to use to read in the data
outpath = k[:-4]+'parquet'
rawtext.write.parquet(outpath)
I don't think you will not be able to parallelize these operations if you want a 1:1 mapping of json to parquet files. Spark's read/write functionality is designed to be called by the driver, and needs access to sc and sqlContext. This is another reason why having 1 parquet directory is likely the way to go.

Json-Opening Yelp Data Challenge's data set

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can be opened in phyton using mrjob, but I am also not very good with programming. I searched online and looked some of the codes yelp provided in github however I couldn't seem to find an article or something which explains how to open the dataset, clearly.
Can you please tell me step by step how to open this file and maybe how to convert it to csv?
https://www.yelp.com.tr/dataset_challenge
https://github.com/Yelp/dataset-examples
data is in .tar format when u extract it again it has another file,rename it to .tar and then extract it.you will get all the json files
yes you can use pandas. Take a look:
import pandas as pd
# read the entire file into a python array
with open('yelp_academic_dataset_review.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
Now 'data_df' contains the yelp data ;)
Case, you want convert it directly to csv, you can use this script
https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py
I hope it can help you
To process huge json files, use a streaming parser.
Many of these files aren't a single json, but a stream of jsons (known as "jsons format"). Then a regular json parser will consider everything but the first entry to be junk.
With a streaming parser, you can start reading the file, process parts, and wrote them to the desired output; then continue writing.
There is no single json-to-csv conversion.
Thus, you will not find a general conversion utility, you have to customize the conversion for your needs.
The reason is that a JSON is a tree but a CSV is not. There exists no ultimative and efficient conversion from trees to table rows. I'd stick with JSON unless you are always extracting only the same x attributes from the tree.
Start coding, to become a better programmer. To succeed with such amounts of data, you need to become a better programmer.