I am running a mysql server on a mac pro, 64GB of ram, 6 cores. Table1 in my schema has 330 million rows. Table2 has 65,000 rows. (I also have several other tables with a combined total of about 1.5 billion rows, but they are not being used in the operation I am attempting, so I don't think they are relevant).
I am trying to do what I would have thought was a relatively simple update statement (see below) to bring some data from Table2 into Table1. However, I am having a terrible time with mysql blowing through my system ram, forcing me into swaps, and eventually freezing up the whole system so that mysql becomes unresponsive and I need to restart my computer. My update statement is as below:
UPDATE Table1, Table2
SET
Table1.Column1 = Table2.Column1,
Table1.Column2 = Table2.Column2,
Table1.Column3 = Table2.Column3,
Table1.Column4 = Table2.Column4
WHERE
(Table1.Column5 = Table2.Column5) AND
(Table1.Column6 = Table2.Column6) AND
(Table1.Column7 = Table2.Column7) AND
(Table1.id between 0 AND 5000000);
Ultimately, I want to perform this update for all 330 million rows in Table1. I decided to break it up into batches of 5 million lines each though because
(a) I was getting problems with exceeding lock size and
(b) I thought it might help with my problems of blowing through ram.
Here are some more relevant details about the situation:
I have created indexes for both Table1 and Table2 over the combination of Column5, Column6, Column7 (the columns whose values I am matching on).
Table1 has 50 columns and is about 60 GB total.
Table2 has 8 columns and is 3.5 MB total.
I know that some people might recommend foreign keys in this situation, rather than updating table1 with info from table2, but (a) I have plenty of disk space and don't really care about using it to maximum efficiency (b) none of the values in any of these tables will change over time and (c) I am most concerned about speed of queries run on table1, and if it takes this long to get info from table2 to table1, I certainly don't want to need to repeat the process for every query I run on table1.
In response to the problem of exceeding maximum lock table size, I have experimented with increasing innodb_buffer_pool_size. I've tried a number of values. Even at something as low as 8 GB (i.e. 1/8th of my computer's ram, and I'm running almost nothing else on it while doing this), I am still having this problem of the mysqld process using up basically all of the ram available on the system and then starting to pull ram allocation from the operating system (i.e. my kernel_task starts showing up as using 30GB of ram, whereas it usually uses around 2GB).
The problem with the maximum locks seems to have been largely resolved; I no longer get this error, though maybe that's just because now I blow through my memory and crash before I can get there.
I've experimented with smaller batch sizes (1 million rows, 100,000 rows). These seem to work maybe a bit better than the 5 million row batches, but they still generally have the same problems, maybe only a bit slower to develop. And, performance seems terrible - for instance, at the rate I was going on the 100,000 batch sizes, it would have taken about 7 days to perform this update.
The tables both use InnoDB
I generally set SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; although I don't know if it actually helps or not (I am the only user accessing this DB in any way, so I don't really care about locking and would do away with it entirely if I could)
I notice a lot of variability in the time it takes batches to run. For instance, on the 1 million row batches, I would observe times anywhere between 45 seconds and 20 minutes.
When I tried running something that just found the matching rows and then put only two column values for those into a new table, I got much more consistent times (about 2.5 minutes per million lines). Thus, it seems that my problems might somehow stem from the fact maybe that I'm updating values in the table that I am doing the matching on, even though the columns that I'm updating are different from those I am matching on.
The columns that I am matching on and updating just contain INT and CHAR types, none with more than 7 characters max.
I ran a CHECK TABLE diagnostic and it came back ok.
Overall, I am tremendously perplexed why this would be so difficult. I am new to mysql and databases in general. Since Table2 is so small, I could accomplish this same task, much faster I believe, in python using a dictionary lookup. I would have thought though that databases would be able to handle this better, since handling and updating big datasets is what they are designed for.
I ran some diagnostics on the queries using Mysql workbench and confirmed that there are NOT full table scans being performed.
It really seems something must be going wrong here though. If the system has 64 GB of ram, and that is more than the entire size of the two tables combined (though counting index size it is a bit more than 64 GB for the two tables), and if the operation is only being applied on 5 million out of 330 million rows at a time, it just doesn't make sense that it should blow out the ram.
Therefore, I am wondering:
Is the syntax of how I am writing this update statement somehow horribly bad and inefficient such that it would explain the horrible performance and problems?
Are there some kind of parameters beyond the innodb_buffer_pool_size that I should be configuring, either to put a firmer cap on the ram mysql uses or to get it to more effectively use resources?
Are there other sorts of diagnostics I should be running to try to detect problems with my tables, schema, etc.?
What is a "reasonable" amount of time to expect an update like this to take?
So, after consulting with several people knowledgeable about such matters, here are the solutions I came up with:
I brought my innodb_buffer_pool_size down to 4GB, i.e. 1/16th of my total system memory. This finally seemed to be enough to reliably stop MySQL from blowing through my 64GB of RAM.
I simplified my indexes so that they only contained exactly the columns I needed, and made sure that all indexes I was using were small enough to fit into RAM (with plenty of room to spare for other uses of RAM by MySQL as well).
I learned to accept that MySQL just doesn't seem to be built for particularly large data sets (or, at least not on a single machine, even if a relatively big machine like what I have). Thus, I accepted that manually breaking up my jobs into batches would often be necessary, since apparently the machinery of MySQL doesn't have what it takes to make the right decisions about how to break a job up on its own, in order to be conscientious about system resources like RAM.
Sometimes, when doing jobs along the lines of this, or in general, on my moderately large datasets, I'll use MySQL to do my updates and joins. Other times, I'll just break the data up into chunks and then do the joining or other such operations in another program, such as R (generally using a package like data.table that handles largish data relatively efficiently).
I was also advised that alternatively, I could use something like Pig of Hive on a Hadoop cluster, which should be able to better handle data of this size.
Related
I have several tables with ~15 million rows. When I create an idex on the id column and then I execute a simple query like SELECT * FROM my_table WHERE id = 1 I retrieve the data within one second. But then, after a few minutes, if I execute the query with a different id it takes over 15 seconds.
I'm sure it is not the query cache because I'm trying different ids all the time to make sure I'm not retrieving from the cache. Also, I used EXPLAIN to make sure the index it's being used.
The specs of the server are:
CPU: Intel Dual Xeon 5405 Harpertown 2.0Ghz Quad Core
RAM: 8GB
Hard drive 2: 146GB SAS (15k rpm)
Another thing I noticed is that if I execute REPAIR TABLE my_table the queries become within one second again. I assume something is being cached, either the table or the index. If so, is there any way to tell MySQL to keep it cached. Is it normal, given the specs of the server, to take around 13 seconds on an indexed table? The index is not unique and each query returns around 3000 rows.
NOTE: I'm using MyISAM and I know there won't be any write in these tables, all the queries will be to read data.
SOLVED: thank you for your answers, as many of you pointed out it was the key_buffer_size.I also reordered the tables using the same column as the index so the records are not scattered, now I'm executing the queries consistently under 1 second.
Please provide
SHOW CREATE TABLE
SHOW VARIABLES LIKE '%buffer%';
Likely causes:
key_buffer_size (when using MyISAM) is not 20% of RAM; or innodb_buffer_pool_size is not 70% of available RAM (when using InnoDB).
Another query (or group of queries) is coming in and "blowing out the cache" (key_buffer or buffer_pool). Look for such queries).
When using InnoDB, you don't have a PRIMARY KEY. (It is really important to have such.)
For 3000 rows to take 15 seconds to load, I deduce:
The cache for the table (not necessarily for the index) was blown out, and
The 3000 rows were scattered around the table (hence fetching one row does not help much in finding subsequent rows).
Memory allocation blog: http://mysql.rjweb.org/doc.php/memory
Is it normal, given the specs of the server, to take around 13 seconds on an indexed table?
The high variance in response time indicates that something is amiss. With only 8 GB of RAM and 15 million rows, you might not have enough RAM to keep the index in memory.
Is swap enabled on the server? This could explain the extreme jump in response time.
Investigate the memory situation with a tool like top, htop or glances.
There are 10 InnoDB partitioned tables. MySQL is configured with option innodb-file-per-table=1 (innodb-file per table/partition - for some reasons). Tables size is abount 40GB each. They contains statictics data.
During normal operation, the system can handle the load. The accumulated data is processed every N minutes. However, if for some reason, there was no treatment for more than 30 minutes (eg, maintenance of the system - it is rare, but once a year is necessary to make changes), begin to lock timeout.
I will not tell you how to come to such an architecture, but it is the best solution - way was long.
Đ•ach time, making changes requires more and more time. Today, for example, a simple ALTER TABLE took 2:45 hours. This is unacceptable.
So, as I said, processing the accumulated data requires a lot of resources and SELECT statements are beginning to return lock timeout errors. Of course, the tables in the query are not involved, and the work goes to the results of queries to them. Total size of these 10 tables is about 400GB, and a few dozen small tables, the total size of which is comparable to (and maybe not yet) to the size of an big table. Problems with small tables there.
My question is: how can I solve the problem with a lock timeout error? A server is not bad - 8 core xeon, 64 RAM. And this is only the database server. Of course, the entire system is not located on the same machine.
There is an only reason why I get this errors: on data transfrom process from big tables to small.
Any ideas?
I took over a project and have 2 MyISAM tables.
table1 with approx. 1M rows, and
table2 with approx. 100K rows.
In the project these tables are accessed often, and at first it seems ok.
After I installed the project on a Windows 8.1 for local development I found that every day, the first time I access the site, my query takes 14 seconds. A bit too much.
Afterwards is less than 0.1 second.
Now, since on dev this accumulated with another query runs into a timeout-exception for php, it got me concerned about whether it's recommended to do anything about it or not. On production it seems not to occur (or hard to reproduce).
I heard of things like warm cache or optimize query but don't know what is meant by that.
What do experts like you do in this case?
I had another question set up here trying to see whether I can optimize the query.
Changing to InnoDB doesn't seem to have an impact.
The "first" time you run a query, two things may or may not happen:
Lots of disk I/O may be done to fetch the index blocks and/or data blocks from disk. (If other queries happened to have fetched those blocks, the blocks may be cached already.) (14s vs 0.1s is more than I usually see for this cold/warm cache difference.)
If the "Query cache" was on, the first SELECT and its resultset were stored in the QC. The second call may have found it there and returned the result almost instantly. (Usually this is ~1ms, not the 100ms you mentioned.) The QC can be bypassed for a single query by saying SELECT SQL_NO_CACHE ....
Since it is annoying you daily, you may as well go through the exercise of trying to optimize the query. If the tables are growing daily, it may get slower and slower over time. Note that if production needs to be restarted for any reason, that query may timeout on it. So, yes, try to optimize it.
A million rows is beginning to be "big".
The characteristics of this indicate that you are I/O-bound only initially. So it does not indicate that key_buffer_size and innodb_buffer_pool_size are too low.
If you want to discuss the performance of a particular query, start a new thread and provide SHOW CREATE TABLE and EXPLAIN SELECT ....
The Situation:
I use a (php) cronjob to keep my database up-to-date. the affected table contains about 40,000 records. basically, the cronjob deletes all entries and inserts them afterwards (with different values ofc). I have to do it this way, because they really ALL change, because they are all interrelated.
The Problem:
Actually, everything works fine. The cronjob is doin' his job within 1.5 to 2 seconds (again, for about 40k inserts - i think this is adequate). MOSTLY.. But sometimes, the query takes up to 60, 90 or even 120 seconds!
I indexed my database. And I think query is good working, due to the fact it only needs 2 seconds mots of the time. I close the connection via mysql_close();
Do you have any ideas? If you need more information please tell me.
Thanks in advance.
Edit: Well, it seems like there was no problem with the inserts. it was a complex SELECT query, that made some trouble. Tho, thanks to everyone who answered!
[Sorry, apparently I haven't mastered the formatting yet]
From what I read, I can conclude that your cronjob is using bulk-insert statements. If you know when cronjob works, I suggest you to start a Database Engine Tuning Advisor session and see what other processes are running while the cronjob do its things. A bulk-insert has some restrictions with the number of fields and the number of rows at once. You could read the subtitles of this msdn http://technet.microsoft.com/en-us/library/ms188365.aspx
Performance Considerations
If the number of pages to be flushed in a
single batch exceeds an internal threshold, a full scan of the buffer
pool might occur to identify which pages to flush when the batch
commits. This full scan can hurt bulk-import performance. A likely
case of exceeding the internal threshold occurs when a large buffer
pool is combined with a slow I/O subsystem. To avoid buffer
overflows on large machines, either do not use the TABLOCK hint (which
will remove the bulk optimizations) or use a smaller batch size
(which preserves the bulk optimizations). Because computers vary, we
recommend that you test various batch sizes with your data load to
find out what works best for you.
I run into performance issues with my web application. Found out that bottleneck is db. App is running on LAMP server(VPS) with 4 CPU and 2GB RAM.
After insertion of new record into DB (table with around 100.000 records) select queries significantly slows down for a while (sometimes several for minutes). I thought that problem is reindexing, but there is practicly no activity at VPS after insert. There are plenty of memory left, no need for swapping. CPU is idle.
Truth is, selects are quite complex:
SELECT COUNT(A.id), B.title FROM B JOIN A .... WHERE ..lot of stuff..
Both A and B has about 100K records. A has many columns, B only few but it is tree structure represented by nested set. B doesnt change very often, but A does. WHERE conditions are mostly covered by indexes. There are usually about 10-30 rows in result set.
Are there any optimizations I could perform?
You might want to include your "lot of stuff"... you could be doing 'like' comparisons or joining on unindexed varchar columns :)
you'll also need to look at indexing columns that are used heavily.
First thing is: DO NOT trust any CPU/RAM etc. measurements inside a VPS - they can be wrong since they don't take into account what is going on on the machine (in other VPS)!
As for the performance:
Check the query plans for all your SQL statements... use a profiler on the app itself and see where the bottlenecks are...
Another point is to check the configuration of your MySQL DB... is there any replication going on (might cause a slowdon too) ? Does the DB have enough RAM ? Is the DB on a different machine/VPS or in the same VPS ?