MySql count(*) super slow with concurrent queries - mysql

I have a script that tries to read all the rows from a table like this:
select count(*) from table where col1 = 'Y' or col1 is null;
col1 and col2 are not indexed and this query usually takes ~20 seconds but if someone is already running this query, it takes ages and gets blocked.
We just have around 100k rows in the table and I tried it without the where clause and it causes the same issue.
The table uses InnoDB so, it doesn't store the exact count but I am curious if there is any concurrency parameter I should look into. I am not sure if absence of indexes on the table causes the issue but it doesn't make sense to me.
Thanks!

If they are not indexed, then it is required to read the entire disk files of your tables to find your data. A single hard disk cannot perform very well concurrent read intensive operations. You have to index.

It looks like your SELECT COUNT(*)... query is being serialized with other operations on your table. Unless you tell the MySQL server otherwise, your query will do its best to be very precise.
Try changing the transaction isolation level by issuing this command immediately before your query.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
Setting this enables so-called dirty reads, which means you might not count everything in the table that changes during your operation. But that probably will not foul up your application too badly.
(Adding appropriate indexes is always a good idea, but not the cause of the problem you ask about.)

Related

How to select consistent data from multiple tables efficiently

I'm using MySQL 5.6. Let's say we have the following two tables:
Every DataSet has a huge amount of child DataEntry records that the number would be 10000 or 100000 or more. DataSet.md5sum and DataSet.version get updated when its child DataEntry records are inserted or deleted, in one transaction. A DataSet.md5sum is calculated against all of its children DataEntry.content s.
Under this situation, What's the most efficient way to fetch consistent data from those two tables?
If I issue the following two distinct SELECTs, I think I might get inconsistent data due to concurrent INSERT / UPDATEs:
SELECT md5sum, version FROM DataSet WHERE dataset_id = 1000
SELECT dataentry_id, content FROM DataEntry WHERE dataset_id = 1000 -- I think the result of this query will possibly incosistent with the md5sum which fetched by former query
I think I can get consistent data with one query as follows:
SELECT e.dataentry_id, e.content, s.md5sum, s.version
FROM DataSet s
INNER JOIN DataEntry e ON (s.dataset_id = e.dataset_id)
WHERE s.dataset_id = 1000
But it produces redundant dataset which filled with 10000 or 100000 duplicated md5sums, So I guess it's not efficient (EDIT: My concerns are high network bandwidth and memory consumption).
I think using pessimistic read / write lock (SELECT ... LOCK IN SHARE MODE / FOR UPDATE) would be another option but it seems overkill. Are there any other better approaches?
The join will ensure that the data returned is not affected by any updates that would have occurred between the two separate selects, since they are being executed as a single query.
When you say that md5sum and version are updated, do you mean the child table has a trigger on it for inserts and updates?
When you join the tables, you will get a "duplicate md5sum and version" because you are pulling the matching record for each item in the DataEntry table. It is perfectly fine and isn't going to be an efficiency issue. The alternative would be to use the two individual selects, but depending upon the frequency of inserts/updates, without a transaction, you run the very slight risk of getting data that may be slightly off.
I would just go with the join. You can run explain plans on your query from within mysql and look at how the query is executed and see any differences between the two approaches based upon your data and if you have any indexes, etc...
Perhaps it would be more beneficial to run these groups of records into a staging table of sorts. Before processing, you could call a pre-processor function that takes a "snapshot" of the data about to be processed, putting a copy into a staging table. Then you could select just the version and md5sum alone, and then all of the records, as two different selects. Since these are copied into a separate staging table, you wont have to worry about immediate updates corrupting your session of processing. You could set up timed jobs to do this or have it as an on-demand call. Again though, this would be something you would need to research the best approach given the hardware/network setup you are working with. And any job scheduling software you have available to you.
Use this pattern:
START TRANSACTION;
SELECT ... FOR UPDATE; -- this locks the row
...
UPDATE ...
COMMIT;
(and check for errors after every statement, including COMMIT.)
"100000" is not "huge", but "BIGINT" is. Recomment INT UNSIGNED instead.
For an MD5, make sure you are not using utf8: CHAR(32) CHARACTER SET ascii. This goes for any other hex strings.
Or, use BINARY(16) for half the space. Then use UNHEX(md5...) when inserting, and HEX(...) when fetching.
You are concerned about bandwidth, etc. Please describe your client (PHP? Java? ...). Please explain how much (100K rows?) needs to be fetched to re-do the MD5.
Note that there is a MD5 function in MySQL. If each of your items had an MD5, you could take the MD5 of the concatenation of those -- and do it entirely in the server; no bandwidth needed. (Be sure to increase group_concat_max_len)

Prepared Statements, MyISAM with 100 million records and Caching on MySQL

I have 10 large read-only tables. I make a lot of queries that have the same format with different parameters, for example:
SELECT 'count' FROM table1 WHERE x='SOME VARIABLE';
SELECT 'count' FROM table1 WHERE x='SOME OTHER VARIABLE';
...
SELECT 'count' FROM table2 WHERE x='' AND y='';
SELECT 'count' FROM table2 WHERE x='' AND y='';
...
SELECT 'count' FROM table3 WHERE x='' AND y='' AND z='';
The variables used are different each query so I almost never execute the same query twice. Is it correct that query caching and row caching on the MySQL side would be wasteful and they should be disabled? Table caching seems like it would be a good thing.
On the client side, I am using prepared statements which I assume is good. If I enable Prepared statement caching (via Slick) though won't that hurt my performance since the parameters are so variable? Is there anything else I can do to optimize my performance?
Should auto-commit be off since I'm only doing selects and will never need to rollback?
Given that you are using the MYISAM engine and have tables which have hundreds of millions of active rows, I would take care less of how I query the cache (due to your low complexity, this is most likely the least problem), but more focus on the proper organization of the data within the database:
Prepared Statements are totally ok. It may be helpful to not prepare the statement over and over again. Instead, just reuse the existing prepared statement (some environments even allow to store prepared statements on the client side) with a new set of parameter values. However, this mainly only saves time, which is being used in the query cache. As the complexity of your query is quite low, it can be assumed that this won't be the biggest time consumer.
Key Caching (also called Key Buffering), however, is - as the name already suggests - key for your game! Most DB configurations of MySQL suffer greatly from wrong values in that area, as the buffers are way too small. In a nutshell, key caching makes sure that the references to the data (for instance in your indices) can be accessed in main memory. If they are not in memory, they need to be retrieved from the disk, which is slow. To see if your key cache is efficient, you should watch the key hit ratio, when your system is under load. Details about that is greatly explained at https://dba.stackexchange.com/questions/58182/tuning-key-reads-in-mysql.
If the caches become large or are being displaced frequently due to the usage of other tables, it may be helpful to create own key caches for your tables. For details, see https://dev.mysql.com/doc/refman/5.5/en/cache-index.html
If you always access large portions of your table via the same attributes, it may make sense to change the ordering of the data storage on the disk by using ALTER TABLE ... ORDER BY expr1, expr2, .... For details on this approach see also https://dev.mysql.com/doc/refman/5.5/en/optimizing-queries-myisam.html
Avoid using variable-length columns, such as VARCHAR, BLOB or TEXT. They might help to save some space, but especially comparing their values can become time-consuming. Please note, however, that already one single column of such a type will MySQL make switch to Dynamic column mode.
Run ANALYZE TABLE after huge data changes to keep the statistics up to date. If you have deleted huge areas, it might help to OPTIMIZE TABLE, helping to make sure that there are no large gaps around which need to be skipped when reading.
Use INSERT DELAYED to write changes asynchronously, if you do not need the reply. This will greatly improve your performance, if there are other SELECT statements around at the same point in time.
Alternatively, if you need the reply, you may use INSERT LOW_PRIORITY. Then the execution of the concurrent SELECTs are preferred compared to your INSERT. This may help to ease the pain of the fact a little, that MyISAM only supports table-level locking.
You may try to provide Index Hints to your queries, especially if there are multiple indices on your table which are overlapping each other. You should try to use that index which has the smallest width, but still covers the most attributes.
However, please note that in your case the impact must be quite small: You are not ordering/grouping or joining, so the query optimizer should already be very good at finding the best one. Simply check by using EXPLAIN on your SELECT statement to see, if the choice of the index used is reasonable.
In short, Prepared Statements are totally ok. Key Caching is key - and there are some other things you can do to help MySQL getting along with the whole bulk of data.

InnoDB deadlock with lock modes S and X

In my application, I have two queries that occur from time to time (from different processes), that cause a deadlock.
Query #1
UPDATE tblA, tblB SET tblA.varcharfield=tblB.varcharfield WHERE tblA.varcharfield IS NULL AND [a few other conditions];
Query #2
INSERT INTO tmp_tbl SELECT * FROM tblA WHERE [various conditions];
Both of these queries take a significant time, as these tables have millions of rows. When query #2 is running, it seems that tblA is locked in mode S. It seems that query #1 requires an X lock. Since this is incompatible with an S lock, query #1 waits for up to 30 seconds, at which point I get a deadlock:
Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction
Based on what I've read in the documentation, I think I have a couple options:
Set an index on tblA.varcharfield. Unfortunately, I think that this would require a very large index to store the field of varchar(512). (See edit below... this didn't work.)
Disable locking with SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
. I don't understand the implications of this, and am worried about corrupt data. I don't use explicit transactions in my application currently, but I might at some point in the future.
Split my time-consuming queries into small pieces so that they can queue and run in MySQL without reaching the 30-second timeout. This wouldn't really fix the heart of the issue, and I am concerned that when my database servers get busy that the problem will occur again.
Simply retrying queries over and over again... not an option I am hoping for.
How should I proceed? Are there alternate methods I should consider?
EDIT: I have tried setting an index on varcharfield, but the table is still locking. I suspect that the locking happens when the UPDATE portion is actually executing. Are there other suggestions to get around this problem?
A. If we assume that indexing varcharField takes a lot of disk space and adding new column will not hit you hard I can suggest the following approach:
create new field with datatype "tinyint"
index it.
this field will store 0 if varcharField is null and 1 - otherwise.
rewrite the first query to do update relying on new field. In this case it will not cause entire table locking.
Hope it helps.
You can index only part of the varchar column, it will still work, and will require less space. Just specify index size:
CREATE INDEX someindex ON sometable (varcharcolumn(32))
I was able to solve the issue by adding explicit LOCK TABLE statements around both queries. This turned out to be a better solution, since each query affects so many records, and that both of these are background processes. They now wait on each other.
http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html
While this is an okay solution for me, it obviously isn't the answer for everyone. Locking with WRITE means that you cannot READ. Only a READ lock will allow others to READ.

Dummies guide to locking in innodb

The typical documentation on locking in innodb is way too confusing. I think it will be of great value to have a "dummies guide to innodb locking"
I will start, and I will gather all responses as a wiki:
The column needs to be indexed before row level locking applies.
EXAMPLE: delete row where column1=10; will lock up the table unless column1 is indexed
Here are my notes from working with MySQL support on a recent, strange locking issue (version 5.1.37):
All rows and index entries traversed to get to the rows being changed will be locked. It's covered at:
http://dev.mysql.com/doc/refman/5.1/en/innodb-locks-set.html
"A locking read, an UPDATE, or a DELETE generally set record locks on every index record that is scanned in the processing of the SQL statement. It does not matter whether there are WHERE conditions in the statement that would exclude the row. InnoDB does not remember the exact WHERE condition, but only knows which index ranges were scanned. ... If you have no indexes suitable for your statement and MySQL must scan the entire table to process the statement, every row of the table becomes locked, which in turn blocks all inserts by other users to the table."
That is a MAJOR headache if true.
It is. A workaround that is often helpful is to do:
UPDATE whichevertable set whatever to something where primarykey in (select primarykey from whichevertable where constraints order by primarykey);
The inner select doesn't need to take locks and the update will then have less work to do for the updating. The order by clause ensures that the update is done in primary key order to match InnoDB's physical order, the fastest way to do it.
Where large numbers of rows are involved, as in your case, it can be better to store the select result in a temporary table with a flag column added. Then select from the temporary table where the flag is not set to get each batch. Run updates with a limit of say 1000 or 10000 and set the flag for the batch after the update. The limits will keep the amount of locking to a tolerable level while the select work will only have to be done once. Commit after each batch to release the locks.
You can also speed this work up by doing a select sum of an unindexed column before doing each batch of updates. This will load the data pages into the buffer pool without taking locks. Then the locking will last for a shorter timespan because there won't be any disk reads.
This isn't always practical but when it is it can be very helpful. If you can't do it in batches you can at least try the select first to preload the data, if it's small enough to fit into the buffer pool.
If possible use the READ COMMITTED transaction isolation mode. See:
http://dev.mysql.com/doc/refman/5.1/en/set-transaction.html
To get that reduced locking requires use of row-based binary logging (rather than the default statement based binary logging).
Two known issues:
Subqueries can be less than ideally optimised sometimes. In this case it was an undesirable dependent subquery - the suggestion I made to use a subquery turned out to be unhelpful compared to the alternative in this case because of that.
Deletes and updates do not have the same range of query plans as select statements so sometimes it's hard to properly optimise them without measuring the results to work out exactly what they are doing.
Both of these are gradually improving. This bug is one example where we've just improved the optimisations available for an update, though the changes are significant and it's still going through QA to be sure it doesn't have any great adverse effects:
http://bugs.mysql.com/bug.php?id=36569

What does MySQL do if you attempt to update a table that is being queried?

I have a very slow query that I need to run on a MySQL database from time to time.
I've discovered that attempts to update the table that is being queried are blocked until the query has finished.
I guess this makes sense, as otherwise the results of the query might be inconsistent, but it's not ideal for me, as the query is of much lower importance than the update.
So my question really has two parts:
Out of curiosity, what exactly does MySQL do in this situation? Does it lock the table for the duration of the query? Or try to lock it before the update?
Is there a way to make the slow query not blocking? I guess the options might be:
Kill the query when an update is needed.
Run the query on a copy of the table as it was just before the update took place
Just let the query go wrong.
Anyone have any thoughts on this?
It sounds like you are using a MyISAM table, which uses table level locking. In this case, the SELECT will set a shared lock on the table. The UPDATE then will try to request an exclusive lock and block and wait until the SELECT is done. Once it is done, the UPDATE will run like normal.
MyISAM Locking
If you switched to InnoDB, then your SELECT will set no locks by default. There is no need to change transaction isolation levels as others have recommended (repeatable read is default for InnoDB and no locks will be set for your SELECT). The UPDATE will be able to run at the same time. The multi-versioning that InnoDB uses is very similar to how Oracle handles the situation. The only time that SELECTs will set locks is if you are running in the serializable transaction isolation level, you have a FOR UPDATE/LOCK IN SHARE MODE option to the query, or it is part of some sort of write statement (such as INSERT...SELECT) and you are using statement based binary logging.
InnoDB Locking
For the purposes of the select statement, you should probably issue a:
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
command on the connection, which causes the subsequent select statements to operate without locking.
Don't use the 'SELECT ... FOR UPDATE', as that definitely locks the table rows that are affected by the select statement.
The full list of msql transaction isloation levels are in the docs.
First off all you need to know what engine you´re using (MySam or InnoDb).
This is clearly a transaction problem.
Take a look a the section 13.4.6. SET TRANSACTION Syntax in the mysql manual.
UPDATE LOW_PRIORITY .... may be helpful - the mysql docs aren't clear whether this would let the user requesting the update continue and the update happen when it can (which is what I think happens) or whether the user has to wait (which would be worse than at present ...), and I can't remember.
What table types are you using? If you are on MyISAM, switching to InnoDB (if you can - it has no full text indexing) opens up more options for this sort of thing, as it supports the transactional features and row level locking.
I don't know MySQL, But it sounds like transaction problem.
You should be able to set transaction typ to Dirty Read in your select query.
That won't nessarily give you correct results. But it should'nt be blocked.
Better would be to make the first query go faster. Do some analyzing and check if you can speed it up with correct indeing and so on.