Long time open database connection and SQL join query memory - mysql

I'm currently working on a java application which performs following in a background thread.
opens database connection
Select some rows (100000+ rows)
perform a long running task for each row by calling ResultSet.next() with some buffer size defined by resultSet.setFetchSize()
finally after everything's done closes the connection
If the query does some sorting or joining it will create a temp table and will have some additional memory usage. My question is if my database connection is being opened for long time (let's say few hours) and fetch batch by batch slowly, will it cause performance trouble's in database due to memory usage? (If the database is concurrently used by other threads also.) Or databases are designed to handle these things effectively?
(In the context of both MySQL and Oracle)

From an Oracle perspective, opening a cursor and fetching from it periodically doesn't have that much of an impact if it's left open... unless the underlying data that the cursor is querying against changes since the query was first started.
If so, the Oracle database now has to do additional work to find the data as it was at the start of the query (since read-consistency!), so now it needs to query the data blocks (either on disk or from the buffer cache) and, in the event the data has changed, the undo tablespace.
If the undo tablespace is not sized appropriately and enough data has changed, you may find that your cursor fetches fail with an "ORA-01555: snapshot too old" exception.
In terms of memory usage, a cursor doesn't open a result set and store it somewhere for you; it's simply a set of instructions to the database on how to get the next row that gets executed when you do a fetch. What gets stored in memory is that set of instructions, which is relatively small when compared to the amount of data it can return!

this mechanism seems not good.
although both mysql(innodb engine) and oracle provides consistent read for select,
do such a long select may leads to performance downgrade due to build cr block and other work,
even ora-01555 in oracle.
i think you should query/export all data first,
then process the actual business one by one.
at last, query all data first will not reduce the memory usage,
but reduce the continus time for memory and temp sort segment/file usage.
or you shoud consider separete the whole work to small pieces,
this is better.

Related

Resources consumed by a simple SELECT query in MySql

There a few large tables in one of the databases of a customer (each table is ~50M rows in size and is not too wide). The intent is to infrequently read these tables (completely). As there are no reasonable CDC indices present, the plan is to read the tables by querying them
SELECT * from large_table;
The reads will be performed using a jdbc driver. With the following fetch configuration present, the intent is to read the data approximately one record at a time (it may require a significant amount of time) so that the client code is never overwhelmed.
PreparedStatement stmt = connection.prepareStatement(queryString, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
I was going through the execution path of a query in High Performance MySQL, however some questions seemed unanswered:
Without the temp tables being explicitly created and the query cache being made use of, "how" are the stream reads tracked on the server?
Is any temporary data created (in main memory or files on disk) whatsoever? If so, where is it created and how much?
If temporary data is not created, how are the rows to be returned tracked? Does the query engine keep track of all the page files to be read for this query on this connection? In case there are several such queries running on the server, are the earliest "Tracked" files purged in favor of queries submitted recently?
PS: I want to understand the effect of this approach on the MySql server (not saying that there aren't better ways of reading the tables)
That simple query will not use a temp table. It will simply fetch the rows and transfer them to the client until it finishes. Nor would any possible index be useful. (If the real query is more complex, let's see it.)
The client may wait for all the rows (faster, but memory intensive) before it hands any to the user code, or it may hand them off one at a time (much slower).
I don't know the details in JDBC on specifying it.
You may want to page through the table. If so, don't use OFFSET, but use the PRIMARY KEY and "remember where you left off". More discussion: http://mysql.rjweb.org/doc.php/pagination
Your Question #3 leads to a complex answer...
Every query brings all the relevant data (and index entries) into RAM. The data/index is read in chunks ("blocks") of 16KB from the BTree structure that is persisted on disk. For a simple select like that, it will read the blocks 'sequentially' until finished.
But, be aware of "caching":
If a block is already in RAM, no I/O is needed.
If a block is not in the cache ("buffer_pool"), it will, if necessary, bump some block out and read the desired block in. This is very normal, and very common. Do not fear it.
Because of the simplicity of the query, only a few blocks ever need to be in RAM at any moment. Hence, if your buffer pool were only a few megabytes, it could still handle, say, a 1TB table. There would be a lot of I/O, and that would impact other operations.
As for "tracking", let me use the analogy of reading a long book in a single sitting. There is nothing to track, you are simply turning pages ('blocks'). You don't even need a 'bookmark' for tracking, it is next-next-next...
Another note: InnoDB uses "B+Tree", which includes a link from one block to the "next", thereby making the page turning efficient.
Another interpretation of tracking... "Transactions" and "ACID". When any query (read or write) touches a table, there is some form of lock applied to each row touched. For SELECT the lock is rather light-weight. For writes it can cause delays or even a "deadlock". The locks are unavoidable, but sometimes actions can be taken to minimize their impact.
Logically (but not actually), a "snapshot" of all rows in all tables is taken at the instant you start a transaction. This allows you to see a consistent view of everything, even if other connections are changing rows. The underlying mechanism is very lightweight on reading, but heavier for writes. Writes will make a copy of the row so that each connection sees the snapshot that it 'should' see. Also, the copy allows for ROLLBACK and recovery from a crash (eg power failure).
(Transaction "isolation" mode allows some control over the snapshot.) To get the optimal performance for your case, do nothing special.
Here's a way to conceptualize the handling of transactions: Each row has a timestamp associated with it. Each query saves the start time of the query. The query can "see" only rows that are older than that start time. A subsequent write in another connection will be creating copies of rows with a later timestamp, hence not visible to the SELECT. Hence, the onus is on writes to do extra work; reads are cheap.

Long running innodb query generate a big undo file in mariadb

I have a big query in php using MYSQLI_USE_RESULT not to put all the results into the php memory.
Because if I use MYSQLI_STORE_RESULT it will put all of the data into memory for all results, which takes multiple GB of ram, instead of getting row by row.
It returns millions of rows and each row will generate an api request, so the query will be running for days.
In the mean time, I have other mysql queries that update/insert the tables related to the first query, and I think it cause the undo log to grow without stopping.
I setup innodb_undo_tablespaces=2 and innodb_undo_log_truncate = ON
so the undo log is separated from ibdata1, but the undo files are still big until I kill the queries that have been running for days.
I executed "SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;" before running the long running query, hoping that it would prevent undo file to grow, but it didn't.
The other queries that are updating/inserting have autocommit.
In 1-2 day, the undo file is already 40GB large.
The question : how to prevent this undo file to increase ? As I don't want to keep the previous version of the data while the query is running. It's not important if I get updated data instead of the data that was at the time of the query.
Regardless of your transaction isolation level, a given query will always establish a fixed snapshot, which requires the data to be preserved in the state it was when the query started.
In other words, READ-COMMITTED or READ-UNCOMMITTED allow subsequent queries in the same transaction to see updated data, but a single query will never see a changing data set. Thus concurrent updates to data will force old record versions to be copied to the undo log, and those record versions will be preserved there until your long-running query is finished.
READ-UNCOMMITTED doesn't help any more than READ-COMMITTED. In fact, I've never needed to use READ-UNCOMMITTED for any reason. Allowing "dirty reads" of unfinished transactions breaks rules of ACID databases, and leads to anomalies.
The only way to avoid long-lasting growth of your undo log is to finish your query.
The simplest way to achieve this is to use multiple short-running queries, each fetching a subset of the result. Finish each query in a timely way.
Another solution would be to run the whole query for the millions of rows of result, and store the result somewhere that isn't constrained by InnoDB transaction isolation.
MyISAM table
Message queue
Plain file on disk
Cache like Memcached or Redis
PHP memory (but you said you aren't comfortable with this because of the size)

MySQL cache of subset queries

I am attempting to make a query run on a large database in acceptable time. I'm looking at optimizing the query itself (e.g. Clarification of join order for creation of temporary tables), which took me from not being able to complete the query at all (with a 20 hr cap) to completing it but with time that's still not acceptable.
In experimenting, I found the following strange behavior that I'd like to understand: I want to do the query over a time range of 2 years. If I try to run it like that directly, then it still will not complete within the 10 min I'm allowing for the test. If I reduce it to the first 6 months of the range, it will complete pretty quickly. If I then incrementally re-run the query by adding a couple of months to the range (i.e. run it for 8 months, then 10 months, up to the full 2 yrs), each successive attempt will complete and I can bootstrap my way up to being able to get the full two years that I want.
I suspected that this might be possible due to caching of results by the MySQL server, but that does not seem to match the documentation:
If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again.
http://dev.mysql.com/doc/refman/5.7/en/query-cache.html
The key word there seems to be "identical," and the apparent requirement that the queries be identical was reenforced by other reading that I did. (The docs even indicate that the comparison on the query is literal to the point that logically equivalent queries written with "SELECT" vs. "select" would not match.) In my case, each subsequent query contains the full range of the previous query, but no two of them are identical.
Additionally, the tables are updated overnight. So at the end of the day yesterday we had the full, 2-yr query running in 19 sec when, presumably, it was cached since we had by that point obtained the full result at least once. Today we cannot make the query run anymore, which would seem to be consistent with the cache having been invalidated when the table was updated last night.
So the questions: Is there some special case that allows the server to cache in this case? If yes, where is that documented? If not, any suggestion on what else would lead to this behavior?
Yes, there is a cache that optimizes (general) access to the harddrive. It is actually a very important part of every storage based database system, because reading data from (or writing e.g. temporary data to) the harddrive is usually the most relevant bottleneck for most queries.
For InnoDB, this is called the InnoDB Buffer Pool:
InnoDB maintains a storage area called the buffer pool for caching data and indexes in memory. Knowing how the InnoDB buffer pool works, and taking advantage of it to keep frequently accessed data in memory, is an important aspect of MySQL tuning. For information about how the InnoDB buffer pool works, see InnoDB Buffer Pool LRU Algorithm.
You can configure the various aspects of the InnoDB buffer pool to improve performance.
Ideally, you set the size of the buffer pool to as large a value as practical, leaving enough memory for other processes on the server to run without excessive paging. The larger the buffer pool, the more InnoDB acts like an in-memory database, reading data from disk once and then accessing the data from memory during subsequent reads. See Section 15.6.3.2, “Configuring InnoDB Buffer Pool Size”.
There can be (and have been) written books about the buffer pool, how it works and how to optimize it, so I will stop there and just leave you with this keyword and refer you to the documentation.
Basically, your subsequent reads add data to the cache that can be reused until it has been replaced by other data (which in your case has happened the next day). Since (for MySQL) this can be any read of the involved tables and doesn't have to be your maybe complicated query, it might make the "prefetching" easier for you.
Although the following comes with a disclaimer because it obviously can have a negative impact on your server if you change your configuration: the default MySQL configuration is very (very) conservative, and e.g. the innodb_buffer_pool_size system setting is way too low for most servers younger than 15 years, so maybe have a look at your configuration (or let your system administrator check it).
We did some experimentation, including checking the effect from the system noted in the answer by #Solarflare. In our case, we concluded that the apparent caching was real, but it had nothing to do with MySQL at all. It was instead caused by the Linux disk cache. We were able to verify this in our case by manually flushing that cache after and before getting a result and comparing times.

Mysql (MariaDB) How to figure out why PROCESSLIST MEMORY_USAGE continuously growing unexpectedly

I have a database with a few tables in it. One of the tables has ~3000 rows each with ~20 columns. Every 30 seconds one of the rows in the table is UPDATE'ed with new information. I'm having a problem where sometimes (infrequently) I will notice the memory being used by the process that is updating the rows will start increasing "indefinitely" (I stop the process before it stops growing, but I'm sure it stops at some upper limit). The database is not growing during this time. Only existing rows are being updated.
I'm looking for ideas on what could cause the memory usage to start going up so that I can prevent it from happening. Since most of the time the memory usage stays the same despite running the same update process I'm not sure what condition is triggering the failure state (growing memory usage) so that I can recreate the failure on demand.
The table is using the Memory engine and I've seen the same failure using the InnoDB engine.
The MEMORY_USAGE I'm looking at is in the table returned by the below query. Are there other mysql variables I can look at to get a better idea of what specifically is using up the memory?
SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST
I found my bug. To anyone else who ends up here remember to call mysql_free_result() (I had a case where I wasn't).

When does a slow MySQL query on a given connection affect other connections?

I think I have a basic understanding of this, but am hoping that someone can give me more details as I am interested in learning more about database performance.
Lets say I have a very large database, with many millions of entries, the database supports many connections. Doing simple queries on the database will be slow as there's so much data. I'm trying to understand exactly when a query on a given connection starts to have a direct effect on the performance of queries running on other connections.
If one connection locks some elements, I understand that that will hold up queries running the other connections that need those elements . For example doing:
SELECT FOR UPDATE
will lock what you are selecting.
What happens when you do something simple like:
SELECT COUNT(*) FROM myTable
lets say we have a table with a billion rows so running the count is going to take some time (running on innodb). Will it affect queries running on other connections?
What if you select a large amount of data using SELECT and JOIN, like:
SELECT * FROM myTable1 JOIN myTable2 ON myTable1.id = myTable2.id;
does having a join lock anything for other queries?
I'm finding it hard to know which queries will have a direct effect on the performance of queries running on other connections.
Thanks
There are different angles:
Row locking: this shouldn't happen if you tune your architecture, so you should forget about it
Real performances issues and bottleneck. In our case, collateral effects.
About this second point, the problem is mainly divided in 3 areas:
Disk reads
Memory usage (buffer)
CPU usage.
About disk reads: the more data (in bytes) you will retrieve, the more the harddrive is going to be busy and slowdown any other activity using it. Reduce the size of selected rows to avoid disk overhead.
About memory usage: mysql manages an internal buffer, that can get stuck in some situations. I don't know enough about it to give you a proper answer, but I know this is definetly something you should keep an eye on.
About cpu usage: basically the cpu will get busy when it
has to calculate (joins, preparing statements, arithmetics...)
has to do all the peripheric stuff: moving bytes from disk to memory for instance.
Optimize your queries to reduce cpu overhead. (sounds silly but, well, it always turns out to be the problem anyway...)
So, now when to know when there's a collateral effect? By profiling your hardware...
How to profile?
absolute profiling: use SHOW INNODB STATUS or SHOW PROFILE to get useful informations about main mysql harddrive, cpu and memory watches.
relative profiling: use your favorite OS profiler. Under windows xp for instance, you can use the great perfmon.exe and watch for PRIVATE BYTES and VIRTUAL BYTES of the mysql process. I say relative, because afterall if a query is time consuming on your computer, it might not be on the NASA system...
Hope it helps, regards.
This is a very general question, so giving a precise answer is difficult.
You can think of the database as a pool of shared resources; especially because the underlying hardware your database runs on has physical limits. Most often the reason you see something like a select query that causes a performance impact on other queries it's because they're all competing for using those underlying physical resources like Disk IO or RAM access or CPU time and there isn't enough to go around.
So the actual results you wil see depend heavily on your database's physical hardware, and the configuration settings.
For instance in your select examples the variables might be: Is the data the query needs already in RAM? Can it look up the rows efficiently by an index? If it does have to do IO, how many other queries are asking to read data from disk? Are you using a secondary index and have to do multiple reads? Is the database doing read-ahead to buffer other pages? Is the query causing sequential or random io? Are any updates holding locks on the data? How much read IO can physical hardware support?
You would have to answer all those questions for all queries currently executing to know if they're going to affect performance of others queries.
This is why DBAs exist. Busy databases are complex system, and it's all about the interaction of a great many different operations, all with thousands of possible variables affecting them.
So what you generally do is optimize the things you can control as well as you know how (hardware, mysql configuration, schema and indexes) then start measuring the system as it runs to understand what is actually going on.
So in your case, I would say that it's infinitely more helpful to focus on simply optimizing your queries individually. The faster they execute, the less resources they are probably using and the less change they will impact others. Then you learn to analyze the system. Just look at one thing that's slow and ask "why is this slow?" Then fix it. That's the optimization process.
However, in the first case you wrote with SELECT ... FOR UPDATE explicit locks can and will be big performance issues. Be careful with those.
Read queries are only affected by isolation levels of other queries. They themselves do not block the table ever.
Isolation levels are designated transactional safety modes. If another query that uses locking does not allow dirty reads your reads will be held until the other query finishes writing or unlocks.
MVCC is a mechanism that allows databases to create a new version of the data when they need to update or delete. Which means that when you start a read on the current version of the data, it data won't get tainted by future updates/deletes.
When you start a write on current data despite the data being currently read by another process, you're in fact writing the new stuff somewhere else and marking them as the newest version. Which in the end means no blocking for the writing process (at least not because of the reading process).