Storing GBs of JSON-like data in MongoDB - json

I'm using MongoDB because Meteor doesn't support-officially-anything else.
The main goal is to upload CSV files, parse them in Meteor and import the data to database.
The inserted data size can be 50-60GB or maybe more per file but I can't even insert anything bigger than 16MB due to document size limit. Also, even 1/10 of the insertion takes a lot of time.
I'm using CollectionFS for CSV file upload on the client. Therefore, I tried using CollectionFS for the data itself as well but it gives me an "unsupported data" error.
What can I do about this?
Edit: Since my question creates a confusion about storing data techniques, I want to clear something: I'm not interested in uploading the CSV file; I'm interested in storing the data in the file. I want to collect all user's data in one place and I want to fetch data with the lowest resources.

You could insert the csv file as a collection (the filename can become the collection name), with each row of the csv as a document. This will get around the 16 MB per document size limit. You may end up with a lot of collections, but that is okay. Another collection could keep track of the filename to collection name mapping.

In CollectionFS you can save file directly filesystem, just add the proper package and create your collection like these:
Csv = new FS.Collection("csv", {
stores: [
new FS.Store.FileSystem("csv","/home/username/csv")
],
filter: {
allow: {
extensions: ['csv']
}
}
});

Related

How to read 100GB of Nested json in pyspark on Databricks

There is a nested json with very deep structure. File is of the format json.gz size 3.5GB. Once this file is uncompressed it is of size 100GB.
This json file is of the format, where Multiline = True (if this condition is used to read the file via spark.read_json then only we get to see the proper json schema).
Also, this file has a single record, in which it has two columns of Struct type array, with multilevel nesting.
How should I read this file and extract information. What kind of cluster / technique to use to extract relevant data from this file.
Structure of the JSON (multiline)
This is a single record. and the entire data is present in 2 columns - in_netxxxx and provider_xxxxx
enter image description here
I was able to achieve this in a bit different way.
Use the utility - Big Text File Splitter -
BigTextFileSplitter - Withdata Softwarehttps://www.withdata.com › big-text-file-splitter ( as the file was huge and multiple level nested) the split record size I kept was 500. This generated around 24 split files of around 3gb each. Entire process took 30 -40 mins.
Processed the _corrupt_record seperately - and populated the required information.
Read the each split file in a using - this option removes the _corrupt_record and also removes the null rows.
spark.read.option("mode", "DROPMALFORMED").json(file_path)
Once the information is fetched form each file, we can merge all the files into a single file, as per standard process.

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

Handling big JSONs in Azure Data Factory

I'm trying to use ADF for the following scenario:
a JSON is uploaded to a Azure Storage Blob, containing an array of similar objects
this JSON is read by ADF with a Lookup Activity and uploaded via a Web Activity to an external sink
I cannot use the Copy Activity, because I need to create a JSON payload for the Web Activity, so I have to lookup the array and paste it like this (payload of the Web Activity):
{
"some field": "value",
"some more fields": "value",
...
"items": #{activity('GetJsonLookupActivity').output.value}
}
The Lookup activity has a known limitation of an upper limit of 5000 rows at a time. If the JSON is larger, only 5000 top rows will be read and all else will be ignored.
I know this, so I have a system that chops payloads into chunks of 5000 rows before uploading to storage. But I'm not the only user, so there's a valid concern that someone else will try uploading bigger files and the pipeline will silently pass with a partial upload, while the user would obviously expect all rows to be uploaded.
I've come up with two concepts for a workaround, but I don't see how to implement either:
Is there any way for me to check if the JSON file is too large and fail the pipeline if so? The Lookup Activity doesn't seem to allow row counting, and the Get Metadata Activity only returns the size in bytes.
Alternatively, the MSDN docs propose a workaround of copying data in a foreach loop. But I cannot figure out how I'd use Lookup to first get rows 1-5000 and then 5001-10000 etc. from a JSON. It's easy enough with SQL using OFFSET N FETCH NEXT 5000 ROWS ONLY, but how to do it with a JSON?
You can't set any index range(1-5,000,5,000-10,000) when you use LookUp Activity.The workaround mentioned in the doc doesn't means you could use LookUp Activity with pagination,in my opinion.
My workaround is writing an azure function to get the total length of json array before data transfer.Inside azure function,divide the data into different sub temporary files with pagination like sub1.json,sub2.json....Then output an array contains file names.
Grab the array with ForEach Activity, execute lookup activity in the loop. The file path could be set as dynamic value.Then do next Web Activity.
Surely,my idea could be improved.For example,you get the total length of json array and it is under 5000 limitation,you could just return {"NeedIterate":false}.Evaluate that response by IfCondition Activity to decide which way should be next.It the value is false,execute the LookUp activity directly.All can be divided in the branches.

What does it mean when a KinematicBody2D is stored in a JSON file - Godot?

After writing a few files for saving in my JSON file in Godot. I saved the information in a variable called LData and it is working. LData looks like this:
{
"ingredients":[
"[KinematicBody2D:1370]"
],
"collected":[
{
"iname":"Pineapple",
"collected":true
},{
"iname":"Banana",
"collected":false
}
]
}
What does it mean when the file says KinematicBody2D:1370? I understand that it is saving the node in the file - or is it just saving a string? Is it saving the node's properties as well?
When I tried retrieving data - a variable that is assigned to the saved KinematicBody2D.
Code:
for ingredient in LData.ingredients:
print(ingredient.iname)
Error:
Invalid get index name 'iname' (on base: 'String')
I am assuming that the data is stored as a String and I need to put some code to get the exact node it saved. Using get_node is also throwing an error.
Code:
for ingredient in LData.ingredients:
print(get_node(ingredient).iname)
Error:
Invalid get index 'iname' (on base: 'null instance')
What information is it exactly storing when it says [KinematicBody2D:1370]? How do I access the variable iname and any other variables - variables that are assigned to the node when the game is loaded - and is not changed through the entire game?
[KinematicBody2D:1370] is just the string representation of a Node, which comes from Object.to_string:
Returns a String representing the object. If not overridden, defaults to "[ClassName:RID]".
If you truly want to serialize an entire Object, you could use Marshalls.variant_to_base64 and put that string in your json file. However, this will likely bloat your json file and contain much more information than you actually need to save a game. Do you really need to save an entire KinematicBody, or can you figure out the few properties that need to be saved (postion, type of object, ect.) and reconstruct the rest at runtime?
You can also save objects as Resources, which is more powerful and flexible than a JSON file, but tends to be better suited to game assets than save games. However, you could read the Resource docs and see if saving Resources seems like a more appropriate solution to you.

What is the efficient way of reading and processing very large CSV file in scala (> 1GB)?

In Scala how do you efficiently (memory consumption + performance) read very large csv file? is it fast enough to just stream it line by line and process each line at each iteration?
What i need to do with CSV data :->
In my application Single line in CSV file is treated as an one single record and all the records of the CSV file are to be converted into XML elements and JSON format and save it into another file in xml and json formats.
So here question is while reading the file from csv is it a good idea to read the file in chunks and provide that chunk to another thread which will convert that CSV records into an xml/json and write that xml/json to file? If yes how?
Data of the CSV can be anything, there is no restriction on the type of the data it can be numeric, big decimal, string or date. Any easy way to handle this different data types before saving it to xml? or we don't need to take care of types?
Many Thanks
If this is not a one time task, create a program that will break this 1GB file to small size files. Then provide those new files as a input to separate futures.
Each future will read one file and resolve in the order of file content. File4 resolves after File3, which resolves after File2, which resolves after File1.
As the file has no key-value pair or hierarchical data structure, so I will suggest, just read as a string.
Hope this helps.