How to read 100GB of Nested json in pyspark on Databricks - json

There is a nested json with very deep structure. File is of the format json.gz size 3.5GB. Once this file is uncompressed it is of size 100GB.
This json file is of the format, where Multiline = True (if this condition is used to read the file via spark.read_json then only we get to see the proper json schema).
Also, this file has a single record, in which it has two columns of Struct type array, with multilevel nesting.
How should I read this file and extract information. What kind of cluster / technique to use to extract relevant data from this file.
Structure of the JSON (multiline)
This is a single record. and the entire data is present in 2 columns - in_netxxxx and provider_xxxxx
enter image description here

I was able to achieve this in a bit different way.
Use the utility - Big Text File Splitter -
BigTextFileSplitter - Withdata Softwarehttps://www.withdata.com › big-text-file-splitter ( as the file was huge and multiple level nested) the split record size I kept was 500. This generated around 24 split files of around 3gb each. Entire process took 30 -40 mins.
Processed the _corrupt_record seperately - and populated the required information.
Read the each split file in a using - this option removes the _corrupt_record and also removes the null rows.
spark.read.option("mode", "DROPMALFORMED").json(file_path)
Once the information is fetched form each file, we can merge all the files into a single file, as per standard process.

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

JSON variable indent for different entries

Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.
Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.
Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.

NiFi - ConvertCSVtoAVRO - how to capture the failed records?

When converting CSV to AVRO I would like to output all the rejections to a file (let's say error.csv).
A rejection is usually caused by a wrong data type - e.g. when a "string" value appears in a "long" field.
I am trying to do it using incompatible output, however instead of saving the rows that failed to convert (2 in the example below), it saves the whole CSV file. Is it possible to filter out somehow only these records that failed to convert? (Does NiFi add some markers to these records etc?)
Both processors: RouteOnAttribute and RouteOnContent route the whole files. Does the "incompatible" leg of the flow somehow mark single records with something like "error" attribute that is available after splitting the file into rows? I cannot find this in any doc.
I recommend using a SplitText processor upstream of ConvertCSVToAvro, if you can, so you are only converting one record at a time. You will also have a clear context for what the errors attribute refers to on any flowfiles sent to the incompatible output.
Sending the entire failed file to the incompatible relationship appears to be a purposeful choice. I assume it may be necessary if the CSV file is not well formed, especially with respect to records being neatly contained on one line (or properly escaped). If your data violates this assumption, SplitText might make things worse by creating a fragmented set of failed lines.

What is the efficient way of reading and processing very large CSV file in scala (> 1GB)?

In Scala how do you efficiently (memory consumption + performance) read very large csv file? is it fast enough to just stream it line by line and process each line at each iteration?
What i need to do with CSV data :->
In my application Single line in CSV file is treated as an one single record and all the records of the CSV file are to be converted into XML elements and JSON format and save it into another file in xml and json formats.
So here question is while reading the file from csv is it a good idea to read the file in chunks and provide that chunk to another thread which will convert that CSV records into an xml/json and write that xml/json to file? If yes how?
Data of the CSV can be anything, there is no restriction on the type of the data it can be numeric, big decimal, string or date. Any easy way to handle this different data types before saving it to xml? or we don't need to take care of types?
Many Thanks
If this is not a one time task, create a program that will break this 1GB file to small size files. Then provide those new files as a input to separate futures.
Each future will read one file and resolve in the order of file content. File4 resolves after File3, which resolves after File2, which resolves after File1.
As the file has no key-value pair or hierarchical data structure, so I will suggest, just read as a string.
Hope this helps.

Storing GBs of JSON-like data in MongoDB

I'm using MongoDB because Meteor doesn't support-officially-anything else.
The main goal is to upload CSV files, parse them in Meteor and import the data to database.
The inserted data size can be 50-60GB or maybe more per file but I can't even insert anything bigger than 16MB due to document size limit. Also, even 1/10 of the insertion takes a lot of time.
I'm using CollectionFS for CSV file upload on the client. Therefore, I tried using CollectionFS for the data itself as well but it gives me an "unsupported data" error.
What can I do about this?
Edit: Since my question creates a confusion about storing data techniques, I want to clear something: I'm not interested in uploading the CSV file; I'm interested in storing the data in the file. I want to collect all user's data in one place and I want to fetch data with the lowest resources.
You could insert the csv file as a collection (the filename can become the collection name), with each row of the csv as a document. This will get around the 16 MB per document size limit. You may end up with a lot of collections, but that is okay. Another collection could keep track of the filename to collection name mapping.
In CollectionFS you can save file directly filesystem, just add the proper package and create your collection like these:
Csv = new FS.Collection("csv", {
stores: [
new FS.Store.FileSystem("csv","/home/username/csv")
],
filter: {
allow: {
extensions: ['csv']
}
}
});