How to push JSON objects separately from JSON ARRAY to rabbitMQ? - json

I have Apache NIFI process
Select data from database and store in process file with JSON format
Split JSON ARRAY stored in process file and produce multiple files, for example if I have 80000 JSON OBJECT in array(80000 record in database) 80000 process file will be produced
manipulate each object using JOLT TRANSFORM
push these messages to rabbitMQ
my problem is that processing these many process files is taking too much time, about 30 seconds to exclude select execution time.
I want to make this process faster and I think I will achieve it if I avoid split JSON ARRAY
to objects(multiple process files), but I don't know how to do it, because I want in RabbitMQ to be published 80000 separated messages and not one message which is JSON ARRAY and contains 80000 object.
Is there any way to do it ? or other solution which makes process faster ?
At this point please ignore hardware and focus only process logic to optimize it and make faster.

Related

right tech choice to work with Json data?

We are a data team working with our data producers to store and process infrastructure log data. Workers running at our client systems generate log data which are primarily in json format.
There is no defined structure to the json data as it depends on multiple factors like # of clusters run by the client, tokens generated etc. There is some definite structure to the top-level json elements that contain the metadata where the logs are generated. Actual data can go into multiple levels of nesting and varying key-value pairs.
I want to build a sytem to ingest these logs, parse them and present in a way where engineers and PMs(prod managers) can read the data for analytics usecases.
My initial plan is to setup a compute layer like Kinesis to the source and write parsing logic to store the outcome in s3. However, this would need prior knowledge of the json file itself.
I define a parser module to process the data based on the log type. For every incoming log, my compute(kinesis?) directs data processing to corresponding parser module and emit data into s3.
However, I am starting to explore if any different storage engine(elastic etc.) will fit my usecase better. I am wondering if anyone as run into such usecases and what did you find helpful in solving the problem

How do I measure JSON properties reading performance?

I have a static dataset inside a react-native app. There are 3 JSON files: 500kb, 2mb and 10mb.
Each of them contains a JSON object with ~8000 keys.
I'm concerned if this will cause any performance issues. Atm I'm using redux toolkit state mainly for ids and reading data through selectors inside the redux connect method. In some cases, I have to access those JSON files sequentially ~100 times in a row, pulling certain properties by key (no array find or something).
Tbh, I still don't know it if 10mb is a significant amount at all?
How do I measure performance and where do I read more about using large JSON files in react-native apps?

Should I use JSONField or FileField to store JSON datas?

I am wondering how I should store my JSON datas to have the best performances and scalability.
I have two options :
The first one would be to use JSONField, which will probably provides me an advantage in simplicity when it comes on performances and handling the datas since I don't have to get them out of a file each time.
My second option would be to store my JSON datas in FileFields as json files. This seems the best option since the huge quantity of JSON wouldn't be stored in a DataBase (only the location of the file). In my opinion it's the best option for scalability but maybe not for user performances since the file has to be read each time before displaying them in the template.
I would like to know if I am thinking reasonably, what's the best way between to store JSON datas for them to be reusable as fast as possible without making it complicated to the database & scalability ?
Json field will obviously has a good performance because of its indexing. A very good feature of it would be the native data access feature which means that you don't have to parse/load json and then query, you can just query directly from model field. Now since you have a huge json data it seems that file is a better option than model field but file only has advantage of storage.
Quoting from some random article from google search:
Postgres json field takes almost 11% extra data than the json file on your file system so test of 268mb file in json field is 233 mb (formatted json file)
Storing in a file has some cons which includes reading files parsing json and querying which is time consuming since it is disk based operations. Scalebility will not be a issue with json field although your db size will be high so moving the data might become tough for you.
So unless you have a shortage of database space you should choose jsonfield.

Accessing objects from json files on disk

I have ~500 json files on my disk that represents hotels all over the world, each around 30 mbs, all objects have the same structure.
At certain points in my spring server I require to get the information of a single hotel, let's say via code (which is inside the json object).
The data is read only, but I might get updates from the hotels providers at certain times, like extra json files or delta changes.
Now I don't want to migrate my json files to a relational database that's for sure, so I've been investigating in the best solution to achieve what I want.
I tried Apache Drill because querying straight from json files made me think less headaches of dealing with the data, I did a directory query using Drill, something like:
SELECT * FROM dfs.'C:\hotels\' WHERE code='1b3474';
but this obviously does not seem to be the most efficient way for me as it takes around 10 seconds to fetch a single hotel.
At the moment I'm trying out Couch DB, but I'm still learning it. Should I migrate all the hotels to a single document (makes a bit of sense to me)? Or should I consider each hotel a document?
I'm just looking for pointers on what is a good solution to achieve what I want, so here to take your opinion.
The main issue here is that json files do not have indexes associated with them, and Drill does not create indexes for them. So whenever you do a query like SELECT * FROM dfs.'C:\hotels\' WHERE code='1b3474'; Drill has no choice but to read each json file and parse and process all the data in each file. The more files and data you have, the longer this query will take. If you need to do lookups like this often, I would suggest not using Drill for this use case. Some alternatives are:
A relational database where you have an index built for the code column.
A key value store where code is the key.

Using streams to manipulate then write data to CSV

I'm writing a script which runs a MySQL query that returns ~5 million results. I need to manipulate this data:
Formatting
Encoding
Convert some sections to JSON
etc
Then write that to a CSV file in the shortest amount of time possible.
Since the node-mysql module is able to handle streams, I figured that's my best route. However I'm not familiar enough with writing my own streams to do this.
What would be the best way to stream the data from MySQL, through a formatting function and into a CSV.
We're in a position where we don't mind what order the data goes in as, so long as it's written to a CSV file real quick.
https://github.com/wdavidw/node-csv seems to do what you need. It can get streams, write streams, and has built-in transformers.