Performance issue with JSON input - json

I am loading mysql table from a mongodb source through kettle.
Mongodb table has more than 4 million records and when I run the kettle job it takes 17 hours to finish the first time load.
Even for incremental load it takes more than a hour.I tried with increasing commit size and also giving more memory to the job, but still performance is not improving. I think JSON input step takes a very long time to parse the data and hence its very slow.
I have these steps in my transformation
Mongodb input step
Json Input
Strings cut
If field value is null
Concat fields
Select values
Table output.
Same 4 million records when extracted from postgre was way more fast than mongodb.
Is there a way I can improve the performance?
Please help me.
Thanks,
Deepthi

Run multiple copies of the step. It sounds like you have mongo input then a json input step to parse the json results right? So use 4 or 8 copies of the json input step ( or more depending on cpu's) and it'll speed up.
Alternatively do you really need to parse the full json, maybe you can extract the data via a regex or something.

Related

How to index a 1 billion row CSV file with elastic search?

Imagine you had a large CSV file - let's say 1 billion rows.
You want each row in the file to become a document in elastic search.
You can't load the file into memory - it's too large so has to be streamed or chunked.
The time taken is not a problem. The priority is making sure ALL data gets indexed, with no missing data.
What do you think of this approach:
Part 1: Prepare the data
Loop over the CSV file in batches of 1k rows
For each batch, transform the rows into JSON and save them into a smaller file
You now have 1m files, each with 1000 lines of nice JSON
The filenames should be incrementing IDs. For example, running from 1.json to 1000000.json
Part 2: Upload the data
Start looping over each JSON file and reading it into memory
Use the bulk API to upload 1k documents at a time
Record the success/failure of the upload in a result array
Loop over the result array and if any upload failed, retry
The steps you've mentioned above looks good. A couple of other things which will make sure ES does not get under load:
From what I've experienced, you can increase the bulk request size to a greater value as well, say somewhere in the range 4k-7k (start with 7k and if it causes pain, experiment with smaller batches but going lower than 4k probably might not be needed).
Ensure the value of refresh_interval is set to a very great value. This will ensure that the documents are not indexed very frequently. IMO the default value will also do. Read more here.
As the above comment suggests, it'd be better if you start with a smaller batch of data. Of-course, if you use constants instead of hardcoding the values, your task just got easier.

Storing large JSON data in Postgres is infeasible, so what are the alternatives?

I have large JSON data, greater than 2kB, in each record of my table and currently, these are being stored in JSONB field.
My tech stack is Django and Postgres.
I don't perform any updates/modifications on this json data but i do need to read it, frequently and fast. However, due to the JSON data being larger than 2kB, Postgres splits it into chunks and puts it into the TOAST table, and hence the read process has become very slow.
So what are the alternatives? Should i use another database like MongoDB to store these large JSON data fields?
Note: I don't want to pull the keys out from this JSON and turn them into columns. This data comes from an API.
It is hard to answer specifically without knowing the details of your situation, but here are some things you may try:
Use Postgres 12 (stored) generated columns to maintain the fields or smaller JSON blobs that are commonly needed. This adds storage overhead, but frees you from having to maintain this duplication yourself.
Create indexes for any JSON fields you are querying (Postgresql allows you to create indexes for JSON expressions).
Use a composite index, where the first field in the index the field you are querying on, and the second field (/json expression) is that value you wish to retrieve. In this case Postgresql should retrieve the value from the index.
Similar to 1, create a materialised view which extracts the fields you need and allows you to query them quickly. You can add indexes to the materialised view too. This may be a good solution as materialised views can be slow to update, but in your case your data doesn't update anyway.
Investigate why the toast tables are being slow. I'm not sure what performance you are seeing, but if you really do need to pull back a lot of data then you are going to need fast data access whatever database you choose to go with.
Your mileage may vary with all of the above suggestions, especially as each will depend on your particular use case. (see the questions in my comment)
However, the overall idea is to use the tools that Postgresql provides to make your data quickly accessible. Yes this may involve pulling the data out of its original JSON blob, but this doesn't need to be done manually. Postgresql provides some great tools for this.
If you just need to store and read fully this json object without using the json structure in your WHERE query, what about simply storing this data as binary in a bytea column? https://www.postgresql.org/docs/current/datatype-binary.html

Efficient way to load a large amount of data to a Vanilla JS frontend with a Flask-MySQL-SQLAlchemy Backend?

I have a flask-based web-server setup that inputs a lot of unique data points. While insertion into the DB is done asynchronously, Getting data out of it is the harder part. For each request, I am looking at an average of 250,000 rows of raw data that needs to be displayed in a graph format using plotly.js. Executing the raw query on the MySQL command line takes about 10 seconds to return the data.
However, since I am using SQLAlchemy as my ORM, there seems to be a significant overhead. The extracted data then needs to be dumped into a JSON format to be sent to the front-end to display.
I understand that this situation has a lot of variables that can be changed but I am asking this question after about a week of trying to find solutions to this problem. Is the solution to throw hardware at it?
TLDR: Large amount of data in Backend(Flask, SQLAlchemy, MySQL); Need to display it on the frontend after querying 250,000 records of data and converting it to JSON; Any good solutions?
EDIT: I forgot to mention the size of the data. The JSON object that is sent is about 22.6MB for 250,000 rows of SQL. The table this problem deals with has about 15 columns with Floats, Timestamps, VarChars and Integers. I'm willing to give any more information that can be given.
Turns out that launching an Async process that does this in the background and hope that it works is a perfectly good solution.

Storing data in JSON vs MYSQL row

I am coding a small app and have a question. i am deciding if storing data in JSON or mysql row is best for my scenario.
I am coding an app that may have lots of page hits, and because of that, i am thinking to store JSON encoded array into a column VS mysql rows. One query will be faster to execute VS 2.
The problem i am trying to figure out is i need to delete part of the JSON encoded array, but doing that means upon delete request, i will have to get the entire data, JSON decode and unset the object then update the row again. VS mysql delete row.
Is there a way that maybe i don't know of that can make this much easier to handle?
It probably depends on details you're not providing.
Do you have to use MySQL? If you're fetching a JSON object, and then modifying it and storing it back again, MongoDB seems faster for that use case.
One query will be faster to execute VS 2.
You don't need more than one query to return several rows; the query might return more rows, but looping over results and serialising/deserialising JSONs are both negligible costs compared to other things you will have to do on your site. Don't think too much into this.
As a rule of thumb, on a relational database, try to normalise the data until you see performance issues. If you're set to use MySQL, you probably want many rows. As your dataset grows, the most straightforward way to improve query performance will be to add indexes, and you won't be able to do that on a JSON blob.

Storing a large JSON string in DB table, as a mean to cache as API request. Good idea?

Is there anything wrong with this idea?
I have 3 JSON API endpoints that rarely change data. Yet, every time someone opens my app... they are queried. There's about 20000 rows among the 3 endpoints.
What I want to do, is rather than access my postgresql DB every time for the data, which is very costly... I want to take the rendered JSON and put it into a new table called "cached_data" for example. I will take the JSON string which might have 10000 entries for example, and store it in a text column type in the "cached_data" table. for say 60 minutes.
Then when the endpoint is accessed... it will look at cached_data first. If it has been less than 60 minutes, the JSON string in cached_data will be returned. If it's been greater than 60 minutes, then we'll go to the database for the data.
Is this an acceptable idea/optimization? Can anyone see any issues with storing this large JSON string in a DB column?