Node.js: Fetching million rows from MySQL and processing the stream - mysql

I have a MySQL table with over 100 million rows. This table is a production table from which a lot of read requests are served. I need to fetch lets say a million rows from this table, process the rows in Node.js script and then store the data to Elasticsearch.
I have done this a lot with MongoDB without facing any issues. I start a read stream and I keep on pausing the stream after reading every 1000 rows, and once the downstream process is complete for this set of 1000 rows, I resume the stream and keep on doing till all the rows are processed. I have not faced any performance challenges with MongoDB since the find query returns a cursor which fetches result in batches. So, irrespective of how big my input is, it does not cause any issues.
Now, I don't know how the streaming queries in MySQL work under the hood. Will my approach work for MySQL or will I have to execute the query again and again like first select query fetches those rows where id < 1000 and the next query fetches result where id between 1000 and 2000 and so on. Anybody worked on similar problem before. I found a similar question on Stackoverflow but there was no answer.

You can use LIMIT (Starting Point),(No.Of Records) you need to fetch at a time and change Starting point after every iteration to multiple of no. of records like -
LIMIT 0,1000(It will start fetching row from Zeroth Position and will fetch 1000 records) and next time use LIMIT 1000,1000(It will fetch next 1000 records starting from 1000th row)

Related

What is the best way to execute a bulk update of all records in a large table

What is the best way to execute a bulk update of all records in a large table in MySQL?
During the sanitization process, we are updating all rows in the users table, which has 28M rows, to mask a few columns. This is currently taking right around 2 hours to complete in a rake task, and the AWS session expiration is also 2 hours. If the rake task takes longer than the session expiration, the build will fail.
Due to a large number of records, we are updating 25K rows at a time by using find_in_batches, and then update_all in the results. We throttle between each batch by sleeping for 0.1s to avoid high CPU.
So the question is, is there any way we can optimize the bulk update further, or shall we increase the AWS session expiration to 3 hours?
One option could be to batch by id ranges, rather than by exact batch sizes. So update between id 1-100000, followed by 100001-200000, and so on. This avoids the large sets of ids being passed around. As there will be gaps in the ids, each batch would be a different size, but this might not be an issue.
Thanks for your input.
For such large updates overhead of fetching records and instantiating AR objects is very significant (and also there'll be slowdown from GC), fastest way to execute - is to write raw sql query that will do the update (or use update_all to construct it, which is very similar, but allows to use scopes/joins by relations).

How to load large data set from calling a procedure (Table Input) to destination table efficiently through Pentaho kettle?

I am planning to split the number of rows from procedure and then load them into a table, and using loop, it continues to load certain rows to table.
I am unsure of the way to do it.
My Process: Table Input (calling a procedure - returns 900 million records) -> Data conversion -> Insert / update step (incremental loading to a target table).
Now I have to retrieve few records (say 1 million at a time) from procedure, based on some field in the procedure, then load them into table. This has to be iterated until end of the rows from procedure.
Kindly help me on this.
I don't really see a problem with this, other than the time it takes to process that many rows. PDI (Spoon/Kettle) works with streams, not "data sets" like in SQL, and rows are processed as soon as they are received. Because of this, PDI will likely never have to deal with all 900 million rows at once, and you will not have to wait for all of them to be returned from SQL before it starts processing.
The Table output step has a Commit size value to control how many records are sent to your target table in one transaction. The trick is to balance the amount of time it takes to start new connections with the time it takes to process a large number of rows in one transaction. I run values from 200 to 5000, depending my need and system's ability, but you may be able to go higher than that.
It sounds like your bigger problem will be returning that many rows from a Stored Procedure. Using an SP instead of a SELECT or VIEW means you will have to find ways to keep memory pressure low.
I have a few multi-million row tables, and I create TEMP tables (non-table variables) to store data while processing, using a single SELECT * FROM temp..table at the end of the SP. This streams the data from the server as expected and uses the minimum amount of memory.

Fetching large number of records from MySQL through Java

There is a MySQL table, Users on a Server. It has 28 rows and 1 million records (It may increase as well). I want to fetch all rows from this table, do some manipulation on them and then want to add them to MongoDB. I know that it will take lots of time to retrieve these records through simple 'Select * from Users' operation. I have been doing this in Java, JDBC.
So, the options I got from my research is:
Option 1. Do batch processing : My plan was to get the total number of rows from the table, ie. select count(*) from users. Then, set a fetch size of say 1000 (setFetchSize(1000)). After that I was stuck. I did not know if I can write something like this:
Connection conn = DriverManager.getConnection(connectionUrl, userName,passWord);
Statement stmt =conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,java.sql.ResultSet.CONCUR_READ_ONLY);
String query="select * from users";
ResultSet resultSet=stmt.executeQuery(query);
My doubt was whether resultSet will have 1000 entries once I execute the query and should I repeatedly do the operation till all records are retrieved.
I dropped the plan because, I understand that for MySQL, ResultSet is fully populated at once and batching might not work. This stackoverflow discussion and MySQL documentation helped out.
Option 2. Do pagination: My idea is that I will set a Limit which will tell starting index for fetching and offset for fetching. May be, set the offset as 1000 and iterate over the index.
I read a suggested article link, but did not find any loop holes in approaching this problem using Limit.
Anybody who is kind enough and patient enough to read this long post, could you please share your valuable opinions on my thought process and correct me if there is something wrong or missing.
Answering my own question based on the research I did:
Batching is not really effective for select queries, especially if you want to use the resultset of each query operation.
Pagination - Good if you want to improve the memory efficiency, not for improving speed of execution. Speed comes down as you fire multiple queries with Limit, as every time JDBC has to connect to MySQL.

How can I parallelize Writes to the same row in MySQL?

I'm currently building a system that does running computations, and every 5 seconds inserts or updates information based on those computations to a few rows in MySQL. I'm working on running this system on a few different servers at once right now with a few agents that are each doing similar processing and then writing on the same set of rows. I already randomize the order in which each agent writes its set of rows, but there's still a lot of deadlock happening. What's the best/fastest way to get through those deadlocks? Should I just rerun the query each time one happens, or do row locks, or something else entirely?
I suggest you try something that won't require more than one client to update your 'few rows.'
For example, you could have each agent that produces results do an INSERT to a staging table with the MEMORY access method.
Then, every five seconds you can run a MySQL event (a stored procedure within the server) that loops through all the rows in that table, posting their results to your 'few rows' and then deleting them. If it's important for the rows in your staging table to be processed in order, then you can use an AUTO_INCREMENT id field. But it might not be important for them to be in order.
If you want to get fancier and more scalable than that, you'll need a queue management system like Apache ActiveMQ.

How do I memory-efficiently process all rows of a MySQL table?

I have a MySQL table with 237 million rows. I want to process all of these rows and update them with new values.
I do have sequential ID's, so I could just use a lot of select statements:
where id = '1'
where id = '2'
This is the method mentioned in Sequentially run through a MYSQL table with 1,000,000 records?.
But I'd like to know if there is a faster way using something like a cursor that would be used to sequentially read a big file without needing to load the full set into memory. The way I see it, a cursor would be much faster than running millions of select statements to get the data back in manageable chunks.
Ideally, you get the DBMS to do the work for you. You make the SQL statement so it runs solely in the database, not returning data to the application. All else apart, this saves the overhead of 237 million messages to the client and 237 million messages back to the server.
Whether this is feasible depends on the nature of the update:
Can the DBMS determine what the new values should be?
Can you get the necessary data into the database so that the DBMS can determine what the new values should be?
Will every single one of the 237 million rows be changed, or only a subset?
Can the DBMS be used to determine the subset?
Will any of the id values be changed at all?
If the id values will never be changed, then you can arrange to partition the data into manageable subsets, for any flexible definition of 'manageable'.
You may need to consider transaction boundaries; can it all be done in a single transaction without blowing out the logs? If you do operations in subsets rather than as a single atomic transaction, what will you do if your driving process crashes at 197 million rows processed? Or the DBMS crashes at that point? How will you know where to resume operations to complete the processing?