Fetching large number of records from MySQL through Java - mysql

There is a MySQL table, Users on a Server. It has 28 rows and 1 million records (It may increase as well). I want to fetch all rows from this table, do some manipulation on them and then want to add them to MongoDB. I know that it will take lots of time to retrieve these records through simple 'Select * from Users' operation. I have been doing this in Java, JDBC.
So, the options I got from my research is:
Option 1. Do batch processing : My plan was to get the total number of rows from the table, ie. select count(*) from users. Then, set a fetch size of say 1000 (setFetchSize(1000)). After that I was stuck. I did not know if I can write something like this:
Connection conn = DriverManager.getConnection(connectionUrl, userName,passWord);
Statement stmt =conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,java.sql.ResultSet.CONCUR_READ_ONLY);
String query="select * from users";
ResultSet resultSet=stmt.executeQuery(query);
My doubt was whether resultSet will have 1000 entries once I execute the query and should I repeatedly do the operation till all records are retrieved.
I dropped the plan because, I understand that for MySQL, ResultSet is fully populated at once and batching might not work. This stackoverflow discussion and MySQL documentation helped out.
Option 2. Do pagination: My idea is that I will set a Limit which will tell starting index for fetching and offset for fetching. May be, set the offset as 1000 and iterate over the index.
I read a suggested article link, but did not find any loop holes in approaching this problem using Limit.
Anybody who is kind enough and patient enough to read this long post, could you please share your valuable opinions on my thought process and correct me if there is something wrong or missing.

Answering my own question based on the research I did:
Batching is not really effective for select queries, especially if you want to use the resultset of each query operation.
Pagination - Good if you want to improve the memory efficiency, not for improving speed of execution. Speed comes down as you fire multiple queries with Limit, as every time JDBC has to connect to MySQL.

Related

Qt SQL `nextResult` function for MySQL Server 8.0: delayed execution per result set?

We are currently doing a lot of small queries. We execute a query, read the results, and then execute the next one. Since network requests cost a lot of time, this ping-ponging gets slow very fast.
This is why we want to do multiple queries at once, sending all data that the SQL server must know to it, and only retrieving one result (consisting of multiple result sets).
We found that Qt 5.14.1's QSqlQuery has the nextResult() function, but in the documentation (link) it says:
Some databases may execute all statements at once while others may delay the execution until the result set is actually accessed, [...].
MY QUESTION:
So, does MySql Server 8.0 delay the execution until the result set is actually accessed? If this is the case, then we still have a ping-pong for every query right? Which would be very slow still.
P.S. Our current solution to just have 1 ping-pong is to union different result sets (resulting in kind of a block diagonal matrix) with lots and lots of null values), and this question is meant to find a better way to do this.

EclipseLink: ConnectionPools and native queries

We are using Spring (Eclipselink) on MariaDB. Our SQL via ORM result in a long lasting DB query. Therefore I need to refine it into a nativequery - which is no big deal by itself. Nevertheless the Resultset is limited by a LIMIT and I need a total counter for all found records. For querying the total counter I found for MariaSQL the following solution.
My question:
Is it save to query the two SQL command separately or should I send them once with a UNION combined?
The question arises due to the fact that between my query and the SELECT FOUND_ROWS() the another query might interfere (with a request from the same Microservice) and dilute the result.
If both queries are done in the same transaction, the MVCC of INNODB should guarantee, that the results will not be influenced by other transactions.
see: https://dev.mysql.com/doc/refman/8.0/en/innodb-multi-versioning.html

Will a MySQL SELECT statement interrupt INSERT statement?

I have a mysql table that keep gaining new records every 5 seconds.
The questions are
can I run query on this set of data that may takes more than 5 seconds?
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
I'll go over your questions and some of the comments you added later.
can I run query on this set of data that may takes more than 5 seconds?
Can you? Yes. Should you? It depends. In a MySQL configuration I set up, any query taking longer than 3 seconds was considered slow and logged accordingly. In addition, you need to keep in mind the frequency of the queries you intend to run.
For example, if you try to run a 10 second query every 3 seconds, you can probably see how things won't end well. If you run a 10 second query every few hours or so, then it becomes more tolerable for the system.
That being said, slow queries can often benefit from optimizations, such as not scanning the entire table (i.e. search using primary keys), and using the explain keyword to get the database's query planner to tell you how it intends to work on that internally (e.g. is it using PKs, FKs, indices, or is it scanning all table rows?, etc).
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
"Affect" in what way? If you mean "prevent insert from actually inserting until the select has completed", that depends on the storage engine. For example, MyISAM and InnoDB are different, and that includes locking policies. For example, MyISAM tends to lock entire tables while InnoDB tends to lock specific rows. InnoDB is also ACID-compliant, which means it can provide certain integrity guarantees. You should read the docs on this for more details.
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
Part of "what happens" is determined by how the specific storage engine behaves. Regardless of what happens, the database is designed to answer application queries in a way that's consistent.
As an example, if the select statement were to lock an entire table, then the insert statement would have to wait until the select has completed and the lock has been released, meaning that the app would see the results prior to the insert's update.
I understand that locking database can prevent messing up the SELECT statement.
It can also put a potentially unacceptable performance bottleneck, especially if, as you say, the system is inserting lots of rows every 5 seconds, and depending on the frequency with which you're running your queries, and how efficiently they've been built, etc.
what is the good practice to do when I need the data for calculations while those data will be updated within short period?
My recommendation is to simply accept the fact that the calculations are based on a snapshot of the data at the specific point in time the calculation was requested and to let the database do its job of ensuring the consistency and integrity of said data. When the app requests data, it should trust that the database has done its best to provide the most up-to-date piece of consistent information (i.e. not providing a row where some columns have been updated, but others yet haven't).
With new rows coming in at the frequency you mentioned, reasonable users will understand that the results they're seeing are based on data available at the time of request.
All of your questions are related to locking of table.
Your all questions depend on the way database is configured.
Read : http://www.mysqltutorial.org/mysql-table-locking/
Perform Select Statement While insert statement working
If you want to perform a select statement during insert SQL is performing, you should check by open new connection and close connection every time. i.e If I want to insert lots of records, and want to know that last record has inserted by selecting query. I must have to open connection and close connection in for loop or while loop.
# send a request to store data
insert statement working // take a long time
# select statement in while loop.
while true:
cnx.open()
select statement
cnx.close
//break while loop if you get the result

Reporting with MySQL - Simplest Query taking too long

I have a MySQL Table on an Amazon RDS Instance with 250 000 Rows. When I try to
SELECT * FROM tableName
without any conditions (just for testing, the normal query specifies the columns I need, but I need most of them) , the query takes between 20 and 60 seconds to execute. This will be the base query for my report, and the report should run in under 60 seconds, so I think this will not work out (it times out the moment I add the joins). The report runs without any problems in our smaller test environments.
Could it be that the Query is taking so long because MySQL is trying to lock the table and waiting for all writes to finish? There might be quite a lot of writes on this table. I am doing the query on a MySQL slave, since I do not want to lockup the production system with my queries.
I have no experience with how much rows are much for a relational DB. Are 250 000 Rows with ~30 columns (varchar, date and integer types) much?
How can I speedup this query (hardware, software, query optimization ...)
Can I tell MySQL that I do not care that the Data might be inconsistent (It is a snapshot from a Reporting Database)
Is there a chance that this query will run under 60 seconds, or do I have to adjust my goals?
Remember that MySQL has to prepare your result set and transport it to your client. In your case, this could be 200MB of data it has to shuttle across the connection, so 20 seconds is not bad at all. Most libraries, by default, wait for the entire result being received before forwarding it to the application.
To speed it up, fetch only the columns you need, or do it in chunks with LIMIT. SELECT * is usually a sign that someone's being super lazy and not optimizing at all.
If your library supports streaming resultsets, use that, as then you can start getting data almost immediately. It'll allow you to iterate on rows as they come in without buffering the entire result.
A table with 250,000 rows is not too big for MySQL at all.
However, waiting for those rows to be returned to the application does take time. That is network time, and there are probably a lot of hops between you and Amazon.
Unless your report is really going to process all the data, check the performance of the database with a simpler query, such as:
select count(*) from table;
EDIT:
Your problem is unlikely to be due to the database. It is probably due to network traffic. As mentioned in another answer, streaming might solve the problem. You might also be able to play with the data formats to get the total size down to something more reasonable.
A last-resort step would be to save the data in a text file, compress the file, move it over, and uncompress it. Although this sounds like a lot of work, you might get 5x - 10x compression on the data, saving oodles of time on the transmission and still have a large improvement in performance with the rest of the processing.
I got updated specs from my client and was able to reduce the amount of users returned to 250, which goes (with a lot of JOINS) though in 60 seconds.
So maybe the answer is really: Try to not dump a whole table with a query, fetch only the exact data your need. The Client has SQL access, and he will have to update his queries, so only relevant users are returned.
I should never really use * as a wildcard. Choose the fields that you actually want and then create an index of these fields combined.
If you have thousands of rows, another option is implement pagination.
If result data directly using for report , no one can look more than 100 rows in single shot.

dilemma about mysql. using condition to limit load on a dbf

I have a table of about 800 000 records. Its basically a log which I query often.
I gave condition to query only queries that were entered last month in attempt to reduce the load on a database.
My thinking is
a) if the database goes only through the first month and then returns entries, its good.
b) if the database goes through the whole database + checking the condition against every single record, it's actually worse than no condition.
What is your opinion?
How would you go about reducing load on a dbf?
If the field containing the entry date is keyed/indexed, and is used by the DB software to optimize the query, that should reduce the set of rows examined to the rows matching that date range.
That said, it's a commonly understood that you are better off optimizing queries, indexes, database server settings and hardware, in that order. Changing how you query your data can reduce the impact of a query a millionfold for a query that is badly formulated in the first place, depending on the dataset.
If there are no obvious areas for speedup in how the query itself is formulated (joins done correctly or no joins needed, or effective use of indexes), adding indexes to help your common queries would by a good next step.
If you want more information about how the database is going to execute your query; you can use the MySQL EXPLAIN command to find out. For example, that will tell you if it's able to use an index for the query.