I have a list containing millions of small records as dicts. Instead of serialising the entire thing to a single file as JSON, I would like to write each record to a separate file. Later I need to reconstitute the list from JSON deserialised from the files.
My goal isn't really minimising I/O so much as a general strategy for serialising individual collection elements to separate files concurrently or asynchronously. What's the most efficient way to accomplish this in either Python 3.x or similar high-level language?
For those looking for a modern Python-based solution supporting async/await, I found this neat package which does exactly what I'm looking for: https://pypi.org/project/aiofiles/. Specifically, I can do
import aiofiles, json
"""" A generator that reads and parses JSON from a list of files asynchronously."""
async json_reader(files: Iterable):
async for file in files:
async with aiofiles.open(file) as f:
data = await f.readlines()
yield json.loads(data)
Related
I have a large JSON file, an array with lots of objects, I want to import these into firestore, and I want each object to become a document, and I am looking for the most efficient way to do it, any advice?
I have tried parsing and looping through some of the objects in the file and for each object run let res = db.collection('mycoll').add(obj)
This works, is there a smarter way to do it?
I want to import these into firestore, and I want each object to become a document
According to the official documentation, you can import/export your data but only as long as you have one:
You can use the Cloud Firestore managed export and import service to recover from accidental deletion of data and to export data for offline processing.
If you only have a JSON file, then you need to write some code for that.
I have tried parsing and looping through some of the objects in the file and for each object run let res = db.collection('mycoll').add(obj)
That's the easiest way to do it. You can also add all the writes to a batch so it can be written atomically.
This works, is there a smarter way to do it?
Once you have a database, use the import/export feature.
I've used JSON a number of times within AJAX requests to perform asynchronous writes/reads of a database. I've been trying to better understand JSON and its uses within different programming environments and one of the questions I've been curious about is: what are the common use cases for JSON as external file (rather than just as an object that is passed within AJAX requests)?
More specifically, what are some use cases in which a .json file would be better suited than simply using temporary JSON objects to pass between AJAX requests? Any insight on this would be much appreciated.
I am not that familiar with AJAX etc., but JSON is so popular that many programming languages support it - not just Java and related languages.
In itself JSON simply holds information - it's merely a format for storing data.
It can often be used to transfer data between languages. Personally, I am also using JSON to store my objects to persistent data storages and then later on rebuild the objects alongside the .class schematics. For example, Google created GSON to easily turn objects into JSON and back. Very handy!
You should also think about: How do you transfer an object from one machine to another?
To sum it up: It's simple, it doesn't create massive overhead, it's even easy to read. And most important of all: So many tools offer JSON support.
Edit:
To show the simplicity of re-building from JSON, here's an example from my game:
public static Player fromJson(String json) {
if(json != null && !json.isEmpty()) {
return gson.fromJson(json, Player.class);
}
return new Player(); //no save game present. Use default constructor
}
Im new to app-engine. Writing a rest api. Wondering if anyone has been in this dilemma before?
This data that i have is not alot (3 to 4 pages) and but it changes annually.
Option 1: Write the data as json and parse the json file every time a request comes in.
Option 2: Model into objects and throw into the datastore and then retrieve them whenever a requests comes in.
Does anyone know the pros and cons for each of this method or any better solutions if any.
Of course the answer is it depends.
Here are some of the questions I'd ask myself to make a decision -
do you want to make the change to the data dependent on a code push?
is there sensitive information in the data that should not be checked in to a VCS
what other parts of your system is dependent on this data
how likely are your assumptions about the data going to change in terms of frequency of updating and size
Assuming the data is small (<1MB) and there's no sensitive information in it, I'd start out loading the JSON file as it's the simplest solution.
You don't have to parse the data on each request, but you can parse it at the top level once and effectively treat it as a constant.
Something along these lines -
import os
import json
DATA_FILE = os.path.join(os.path.dirname(__file__), 'YOUR_DATA_FILE.json')
with open(DATA_FILE, 'r') as dataFile:
JSON_DATA = json.loads(dataFile.read())
You can then use JSON_DATA like a dictionary in your code.
awesome_data = JSON_DATA['data']['awesome']
In case you need to access the data in multiple places, you can move this into its own module (ex. config.py) and import JSON_DATA wherever you need it.
Ex. in main.py
from config import JSON_DATA
# do something w/ JSON_DATA
I am trying to parse the JSON files and insert into the SQL DB.My parser worked perfectly fine as long as the files are small (less than 5 MB).
I am getting "Out of memory exception" when trying to read the large(> 5MB) files.
if (System.IO.Directory.Exists(jsonFilePath))
{
string[] files = System.IO.Directory.GetFiles(jsonFilePath);
foreach (string s in files)
{
var jsonString = File.ReadAllText(s);
fileName = System.IO.Path.GetFileName(s);
ParseJSON(jsonString, fileName);
}
}
I tried the JSONReader approach, but no luck on getting the entire JSON into string or variable.Please advise.
Use 64 bit, check RredCat's answer on a similar question:
Newtonsoft.Json - Out of memory exception while deserializing big object
NewtonSoft Jason Performance Tips
Read the article by David Cox about tokenizing:
"The basic approach is to use a JsonTextReader object, which is part of the Json.NET library. A JsonTextReader reads a JSON file one token at a time. It, therefore, avoids the overhead of reading the entire file into a string. As tokens are read from the file, objects are created and pushed onto and off of a stack. When the end of the file is reached, the top of the stack contains one object — the top of a very big tree of objects corresponding to the objects in the original JSON file"
Parsing Big Records with Json.NET
The json file is too large to fit in memory, in any form.
You must use a JSON reader that accepts a filename or stream as input. It's not clear from your question which JSON Reader you are using. From which library?
If your JSON reader builds the whole JSON tree, you will still run out of memory. As you read the JSON file, either cherry pick the data you are looking for, or write data structures to another on-disk format that can be easily queried, for example, an sqlite database.
I have an HDFS directory full of the following JSON file format:
https://www.hl7.org/fhir/bundle-transaction.json.html
What I am hoping to do is find an approach to flatten each individual file to become one df record or rdd tuple. I have tried everything I could think of using read.json(), wholeTextFiles(), etc.
If anyone has any best practices advice or pointers, it would be sincerely appreciated.
Load via wholeTextFiles something like this:
sc.wholeTextFiles(...) //RDD[(FileName, JSON)
.map(...processJSON...) //RDD[JsonObject]
Then, you can simply call the .toDF method so that it will infer from your JsonObject.
As far as the processJSON method, you could just use something like the Play json parser
mapPartitions is used when having to deal with data that is structured in a way that different elements can be on different lines. I've worked with both JSON and XML using mapPartitions.
mapPartitions works on an entire block of data at a time, as opposed to a single element. While you should be able to use the DataFrameReader API with JSON, mapPartitions can definitely do as you'd like. I don't have the exact code to flatten a JSON file, but I'm sure you can figure it out. Just remember the output must be an iterable type.