Streaming very large json directly from a POST request in Django - json

I'm looking for efficient ways to handle very large json files (i.e. possibly several GB in size, amounting to a few million json objects) in requests to a Django 2.0 server (using Django REST Framework). Each row needs to go through some processing, and then get saved to a DB.
The biggest painpoint thus far is the sheer memory consumption of the file itself, and the memory consumption steadily still increasing while the data is being processed in Django, without any ability to manually release the memory used.
Is there a recommended way of processing very large json files in requests to a Django app, without slaughtering memory consumption? Possible to combine with compressing (gzip)? I'm thinking of uploading the json to the API as a regular file, stream that to disk, and then stream from the file on disk using ijson or similar? Is there a more straightforward way?

Related

Converting JSON .gz files into Delta Tables

I have Data Dog log data archives streaming to an Azure Blob stored in a single 150MB JSON file compressed in a 15MB .gz file. These are being generated every 5 minutes. Need to do some analytics on this data. What is the most efficient and cost effective solution to get this data into delta lake?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Has anyone done this successfully without breaking the bank?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Yes, that's the big downside of gzip format - it is not splitable and therefore cannot be distributed across all your workers and cores - the Driver has to load a file in its entirety and decompress it in a single batch. Topic related to this question.
The only sensible workaround I've used myself is to make Driver have only few cores but as powerful ones as possible - I assume, since you are using Azure Blob, then you are using Databricks on Azure as well and here you can find all Azure VM types - just have to pick the one with fastest cores.

How do I measure JSON properties reading performance?

I have a static dataset inside a react-native app. There are 3 JSON files: 500kb, 2mb and 10mb.
Each of them contains a JSON object with ~8000 keys.
I'm concerned if this will cause any performance issues. Atm I'm using redux toolkit state mainly for ids and reading data through selectors inside the redux connect method. In some cases, I have to access those JSON files sequentially ~100 times in a row, pulling certain properties by key (no array find or something).
Tbh, I still don't know it if 10mb is a significant amount at all?
How do I measure performance and where do I read more about using large JSON files in react-native apps?

Angular 4 app - Client Caching for large data

Question: What is the best approach for implementing client caching of huge data. I am using Angular4 with Asp.net Web API2.
Problem: I am developing an web analytical tool (support mobile browsers as well) which generates echart metrics based on JSON data returned from Asp.net web api2. Web page have filters and chart events which recalculates charts measures based on same JSON data on client side. To optimize the speed, I have stored the JSON data(minified) in browser localstorage. With this I have avoiding frequent calls to api which are made on filter change and chart events. This JSON data is refreshed with server data at every 20 mins as I have set expiry for each JSON data saved on localstorage.
Problem is localstorage has a size constraint of 10mb and above solution does not work when JSON data (multiple localstorage keys) exceeds 10mb.
Since my data size can vary and can go more than 10mb. What is the best approach to cache such data, as same data can be used for recalculating measures for metrics without making web api server calls.
I though about (below but not implemented yet),
a) client- memory caching (may cause performance issues for users having less memory).
b) Storing json data as a javascript variable and using it.
Please let me know better solution for large client cache.

Process and Query big amount of large files in JSON Lines format

Which technology would be best to import large amount of large JSON Line format files (approx 2 GB per file).
I am thinking about Solr.
Once the data will be imported it will have to be query-able.
Which technology would you suggest to import and then query JSON line format data in a timely manner?
You can start prototyping with some scripting language you prefer, to read the lines, massage the format as needed to get valid Solr json and send it to Solr via HTTP. Would the faster to get going.
Longer term, SolrJ will allow you to get max perf (if you need to), as you can:
hit the leader replica in a Solrcloud environment directly
use multiple threads to ingest and send docs (you can also use multiple processes). Not that this is harder/impossible with all other technologies, but in some it is.
you have the full flexibility of using all SolrJ api

BLOBs, Streams, Byte Arrays and WCF

I'm working on an image processing service that has two layers. The top layer is a REST based WCF service that takes the image upload, processes and the saves it to the file system. Since my top layer doesn't have any direct database access (by design) I need to pass the image to my application layer (WsHTTPBinding WCF) which does have database access. As it stands right now, the images can be up to 2MB in size and I'm trying to figure out the best way to transport the data across the wire.
I currently am sending the image data as a byte array and the object will have to be stored in memory at least temporarily in order to be written out to the database (in this case, a MySQL server) so I don't know that using a Stream would help eliminate the potential memory issues or if I am going to have to deal with potentially filling up my memory no matter what I do. Or am I just over thinking this?
Check out the Streaming Data section of this MSDN article: Large Data and Streaming
I've used the exact method described to successfully upload large documents and even stream video contents from a WCF service. The keys are to pass a Stream object in the message contract and setting the transferMode to Streaming in the client and service configuration.
I saw this post regarding efficiently pushing that stream into MySQL, hopefully that gets you pointed in the right direction.