Select only one table row on high parallel connections - mysql

I'm looking for a way to select one table row explicitly for one thread. I've written a crawler, that works with about 50 parallel processes. Every process has to take one row out of a table and process it.
CREATE TABLE `crawler_queue` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`url` text NOT NULL,
`class_id` tinyint(3) unsigned NOT NULL,
`server_id` tinyint(3) unsigned NOT NULL,
`proc_id` mediumint(8) unsigned NOT NULL,
`prio` tinyint(3) unsigned NOT NULL,
`inserted` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `proc_id` (`proc_id`),
KEY `app_id` (`app_id`),
KEY `crawler` (`class_id`,`prio`,`proc_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
Now my processes do the following:
start DB transaction
do a select like SELECT * FROM crawler_queue WHERE class_id=2 AND prio=20 AND proc_id=0 ORDER BY id LIMIT 1 FOR UPDATE
then update this row with UPDATE crawler_queue SET server_id=1,proc_id=1376 WHERE id=23892
commit transaction
This should help that no other process can grab a row that is processed yet. Doing an EXPLAIN on the select shows
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE crawler_queue ref proc_id,crawler proc_id 3 const 617609 Using where
But the processes seem to cause too high parallelism, because sometimes I can see two types of errors/warnings in my log (every 5 minutes or so):
mysqli::query(): (HY000/1205): Lock wait timeout exceeded; try restarting transaction (in /var/www/db.php l
ine 81)
mysqli::query(): (40001/1213): Deadlock found when trying to get lock; try restarting transaction (in /var/www/db.php line 81)
My question is: can anybody point me in the right direction to minimize these locking problems? (in production state, the parallelism would be 3-4 times higher than now, so I assume, that there would be much more locking problems)
I modified SELECT to use index crawler by hint USE INDEX(crawler). My problem now are lockwait timeouts anymore (deadlocks disappeared).
EXPLAIN with USE INDEX() shows now (no. of rows is higher, because table contains more data now):
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE crawler_queue ref proc_id,crawler crawler 5 const,const,const 5472426 Using where

Your EXPLAIN report shows that you're using only the single-column index proc_id, and the query has to examine over 600K rows. It would probably be better if the optimizer chose the crawler index.
InnoDB may be locking all 600K rows, not just the rows that match the full condition in your WHERE clause. InnoDB locks all the examined rows to make sure concurrent changes don't get written to the binlog in the wrong order.
The solution is to use an index to narrow the range of examined rows. This will probably help you not only to find the rows more quickly, but also to avoid locking large ranges of rows. The crawler index should help here, but it's not immediately clear why it's not using that index.
You might have to ANALYZE TABLE to make sure to update InnoDB's table statistics to know about the crawler index before it uses that index in the optimization plan. ANALYZE TABLE is an inexpensive operation.
The other option is to use an index hint:
SELECT * FROM crawler_queue USE INDEX(crawler) ...
This tells the optimizer to use that index, and do not consider other indexes for this query. I prefer to avoid index hints, because the optimizer is usually able to make good decisions on its own, and using the hint in code means I may be forcing the optimizer not to consider an index I create in the future, which it would otherwise choose.
With more explanation, it's now clear you're using your RDBMS as a FIFO. This is not an efficient use of an RDBMS. There are message queue technologies for this purpose.
See also:
https://stackoverflow.com/a/13851231/20860
The Database as Queue Anti-Pattern.

From what I can tell the problem that you're facing is that two threads are vyying for the same row in the table and they both can't have it. But there isn't any elegant way for the database to say "no you can't have that one, find another row" and thus you get errors. This is called resource contention.
When you're doing highly parallel work like this one of the easiest ways to reduce contention-based problems is to completely eliminate the contention by inventing a way for all the threads to know which rows they're supposed to work on ahead of time. Then they can lock without having to contend for the resources and your database doesn't have to resolve the contention.
How best to do this? Usually people pick some kind of thread-id scheme and use modulo arithmetic to determine which threads get which rows. If you 10 threads then thread 0 gets row 0, 10, 20, 30, etc. Thread 1 gets 1, 11, 21, 31, etc.
In general if you have NUM_THREADS then each of your threads would pick the ids which are THREAD_ID + i*NUM_THREADS from the database and work on those.
We have introduced a problem in that threads may stall or die, and you could end up with rows in the database which never get touched. There are several solutions to that problem, one of which is to run a "cleanup" once most/all of your threads have finished where all the threads grab piecemeal whatever rows they can and crawl them until there are no un-crawled URLs left. You could get more sophisticated and have a few cleanup threads constantly running, or have each thread occasionally perform cleanup duties, etc.

A better solution would be to do the update and skipping the select entirely. Then you can use last_insert_id() to pick up the updated item. This should allow you to skip locking completely, while performing the update at the same time. Once the record is updated, you can start processing it, since it will never be selected again by the exact same query, considering not all the initial conditions are matching anymore.
I think this should help you aleviate all the problems related to locking and should allow you to run as many processes as you want in parallel.
PS: Just to clarify, i am talking about update ... limit 1 to make sure you only update one row.
EDIT:
Solution
is the correct one as pointed below.

Related

Mysql : performance of update with primary key and extra wheres, will it be slower?

Let's say I have a table as follows
CREATE TABLE `Foo` (
`id` int(10) unsigned NOT NULL,
`bar1` int(10) unsigned NOT NULL,
`bar2` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
);
And I have two queries:
update Foo set bar1=10 where id=5000;
update Foo set bar1=10 where id=5000 and bar1=0;
My guess is that the second query will not run slower than the first query but I need confirmation from someone with certainty knowledge.
(The reason I want to do the second is that when multiple clients select the table first and then update them simultaneously only one people will be able to update successfully)
Find the row. The Optimizer will look at the possible indexes (just the PK) and decide to start with id=5000. There is at most one such row.
(for the second case) verify that bar1=0. If not, the query is finished.
Check to see if there is anything to change -- is bar1 already set to 10? If so, finish.
Do the work of updating -- this involves saving the existing row in case of a ROLLBACK, tentatively storing the new value, etc, etc. -- This step is likely to be the most costly step.
Step 2 is the only difference -- and it is a quite small step. It is not worth worrying about when it comes to performance.
On the other hand, Step 2 means that the two Updates are different -- What should happen if bar1=4567? The first Update would change it, but the second won't.
Your final comment implies that maybe you should be using transactions to keep one client from stepping on another. Perhaps the code should be more like:
BEGIN;
SELECT ... WHERE id = 5000 FOR UPDATE;
decide what to do -- which might include ROLLBACK and exit
UPDATE Foo SET bar1=10 WHERE id = 5000;
COMMIT;
Bottom Line: Use transactions, not extra code, to deal with concurrency.
Caveat: A transaction should be "fast" (less than a few seconds). If you need a "long" transaction (eg, a 'shopping cart' that could take minutes to finish), a different mechanism is needed. If you need a long transaction, start a new question explaining the situation. (The current question is discussing the performance of a single Update.)

Benchmark MySQL with batch insert on multiple threads within same table

I want to test high-intensive write between InnoDB and MyRock engine of the MySQL database. For this purpose, I use sysbench to benchmark. My requirements are:
multiple threads concurrency write to the same table.
support batch insert (each insert transaction will insert bulk of records)
I check all pre-made tests of sysbench and I don't see any tests that satisfy my requirements.
oltp_write_only: supports multiple threads that write to the same table. But this test doesn't have bulk insert option.
bulk_insert: support multiple threads, but each thread writes to a different table.
Are there any pre-made sysbench tests satisfied my requirement? If not, can I find custom Lua scripts somewhere which already are done this?
(from Comment:)
CREATE TABLE IF NOT EXISTS `tableA` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`user_id` VARCHAR(63) NOT NULL DEFAULT '',
`data` JSON NOT NULL DEFAULT '{}',
PRIMARY KEY (`id`),
UNIQUE INDEX `user_id_UNIQUE` (`user_id` ASC)
) ENGINE = InnoDB;
(From a MySQL point of view...)
Toss id and the PK -- saves 8 bytes per row.
Promote UNIQUE(user_id) to PRIMARY KEY(user_id) -- might save 40 bytes per row (depends on LENGTH(user_id)).
Doing those will
Shrink the disk I/O needed (providing some speedup)
Eliminate one of the indexes (probably a significant part of the post-load processing)
Run OS monitoring tools to see what percentage of the I/O is being consumed. That is likely to be the limiting factor.
Benchmarking products are handy for limited situations. For your situation (and many others), it is best to build your product and time it.
Another thought...
What does the JSON look like? If the JSON has a simple structure (a consistent set of key:value pairs), then the disk footprint might be half as much (hence speed doubling) if you made individual columns. The processing to change from JSON to individual columns would be done in the client, which may (or may not) cancel out the savings I predict.
If the JSON is more complex, there still might be savings by pulling out "columns" that are always present.
If the JSON is "big", then compress it in the client, then write to a BLOB. This may shrink the disk footprint and network bandwidth by a factor of 3.
You mentioned 250GB for 250M rows? That's 1000 bytes/row. That means the JSON averages 700 bytes? (Note: there is overhead.) Compressing the JSON column into a BLOB would shrink to maybe 400 bytes/row total, hence only 100GB for 250M rows.
{"b": 100} takes about 10 bytes. If b could be stored in a 2-byte SMALLINT column, that would shrink the record considerably.
Another thing: If you promote user_id to PK, then this is worth considering: Use a file sort to sort the table by user_id before loading it. This is probably faster than INSERTing the rows 'randomly'. (If the data is already sorted, then this extra sort would be wasted.)

Test insert query using mysqlslap

First, I am new to mysqlslap
I want to test insert query using mysqlslap on my existing database. Table which I want to test has primary and composite unique.
So, how to do performance test on this table using mysqlslap concurrently?
I should not face mysql error duplicate key
Below is skeleton for my table:
CREATE TABLE data (
id bigint(20) NOT NULL,
column1 bigint(20) DEFAULT NULL,
column2 varchar(255) NOT NULL DEFAULT '0',
datacolumn1 VARCHAR(255) NOT NULL DEFAULT '',
datacolumn2 VARCHAR(2048) NOT NULL DEFAULT '',
PRIMARY KEY (id),
UNIQUE KEY profiles_UNIQUE (column1,column2),
INDEX id_idx (id),
INDEX unq_id_idx (column1, column2) USING BTREE
) ENGINE=innodb DEFAULT CHARSET=latin1;
Please help me on this
There are several problems with benchmarking INSERTs. The speed will change as you insert more and more, but not in an easily predictable way.
An Insert performs (roughly) this way:
Check for duplicate key. You have two unique keys (the PK and a UNIQUE). Each BTree will be drilled down to check for a dup. Assuming no dup...
The row will be inserted in the data (a BTree keyed by the PK)
A "row" will be inserted into each Unique's BTree. In your case, there is a BTree effectively ordered by (column1, column2) and containing (id).
Stuff is put into the "Change Buffer" for each non-unique index.
If you had an AUTO_INCREMENT or a UUID or ..., there will be more discussion.
The Change Buffer is effectively a "delayed write" to non-unique indexes. This delay has to be dealt with eventually. That is, at some time, things will slow down if a background process fails to keep up with the changes. That is, if you insert 1 million rows, you may not hit this slowdown; if you insert 10 million rows, you may hit it.
Another variable: VARCHAR(2048) (and other TEXT and BLOB columns) may or may not be stored "off record". This depends on the size of the row, the size of that column, and "row format". A big string may take an extra disk hit, thereby slowing down the benchmark, probably by a noticeable amount. That is, if you benchmark with only small strings and certain row formats, you will get a faster insert time than otherwise.
And you need to understand how the benchmark program runs -- versus how your application will run:
Insert rows one at a time in a single thread -- each being a transaction.
Insert rows one at a time in a single thread -- lots batched into a transaction.
Insert 100 rows at a time in a single thread in a single transaction.
LOAD DATA.
Multiple threads with each of the above.
Different transaction isolation settings.
Etc.
(I am not a fan of benchmarks because of how many flaws they have.) The 'best' benchmark for comparing hardware or limited schema/app changes: Capture the "general log" from a running application; capture the database at the start of that; time the re-applying of that log.
Designing a table/insert for 50K inserted rows/sec
Minimize indexes. In your case, all you need is PRIMARY KEY(col1, col2); toss the rest; toss id. Please explain what col1 and col2 are; there may be more tips here.
Get rid of the table. Seriously, consider summarize the 50K rows every second and store only the summarization. If it is practical, this will greatly speed things up. Or maybe a minute's worth.
Batch insert rows in some way. The details here depend on whether you have one or many clients doing the inserts, whether you need to massage the data as it comes, in, etc. More discussion: http://mysql.rjweb.org/doc.php/staging_table
What is in those strings? Can/should they be 'normalized'?
Let's discuss the math. Will you be loading about 10 petabytes per year? Do you have that much disk space? What will you do with the data? How long will it take to read even a small part of that data? Or will it be a "write only" database??
More math. 50K rows * 0.5KB = 25MB writing to disk per second. What device do you have? Can it handle, say, 2x that? (With your original schema, it would be more like 60MB/s because of all the indexes.)
After comments
OK, so more like 3TB before you toss the data and start over (in 2 hours)? For that, I would suggest PARTITION BY RANGE and use some time function that gives you 5 minutes in each partition. This will give you a reasonable number of partitions (about 25) and the DROP PARTITION will be dropping only about 100GB, which might not overwhelm the filesystem. More discussion: http://mysql.rjweb.org/doc.php/partitionmaint
As for the strings... You suggest 25KB, yet the declarations don't allow for that much???

Does InnoDB lock the whole table for a delete which uses part of a composed key?

I have a MySQL table (call it 'my_table') with a composed primary key with 4 columns (call them 'a', 'b', 'c' and 'd').
At least one time I encountered a deadlock on parallel asynchronous EJB calls calling 'DELETE FROM my_table where a=? and b=?' with different values, so I started to look into how InnoDB table locking works.
I've found no clear documentation on how table locking works with composed keys. Is the whole table locked by the delete, despite the fact that there's no overlap among the actual rows being deleted?
Do I need to do a select to recover the values for c and d and delete batches using the whole primary key?
This is in the context of a complex application which works with 4 different databases. Only MySQL seems to have this issue.
InnoDB never locks the entire table for DML statements. (Unless the DML is hitting all rows.)
There are other locks for DDL statements, such as when ALTER TABLE is modifying/adding columns/indexes/etc. (Some of these have been greatly sped up in MySQL 8.0.)
There is nothing special about a composite key wrt locking.
There is a thing called a "gap lock". For various reasons, the "gap" between two values in the index will be locked. This prevents potential conflicts such as inserting the same new value that does not yet exist, and there is a uniqueness constraint.
Since the PRIMARY KEY is a unique key, you may have hit something like that.
If practical, do SHOW ENGINE INNODB STATUS; to see whether the lock is "gap" or not.
Another thing that can happen is that a lock can start out being weak, then escalate to "eXclusive". This can lead to a deadlock.
Do I need to do a select to recover the values for c and d and delete batches using the whole primary key?
I think you need to explain more precisely what you are doing. Provide the query. Provide SHOW CREATE TABLE.
InnoDB's lock handling is possibly unique to MySQL. It has some quirks. Sometimes it is a bit greedy about what it locks; to compensate, it is possibly faster than the competition.
In any case, check for deadlocks (and timeouts) and deal with them. The hope that these problems are rare enough that having to deal with them is not too much a performance burden.
DELETE FROM my_table where a=? and b=? means that potentially a large number of rows are being deleted. That means that the undo log and MVCC need to do a lot of work. Hence, I recommend trying not to delete (or update) more than 1K rows at a time.

MySQL MEMORY Engine Performance - Concurrent Insert vs Update

Thanks in advance for your help!
Question
While executing a large volume of concurrent writes to a simple table using the MySQL Memory Storage Engine is there an objective performance difference between having all the writes be (A) updates to a very small table (say 100 rows) vs (B) inserts? By performance difference, I'm thinking speed/locking - but if there are other significant variable(s) please say so. Since the answer to this less specific question is often "it depends" I've written scenario's (A) & (B) below to provide context & define the detail's in hopes of allowing for an objective answer.
Example Scenarios
I've written scenario's (A) & (B) below to help illustrate & provide context. You can assume excess of RAM & CPU, MySQL 5.7 if it matters, the scenarios are simplified, and I'm using the Memory engine to remove Disk I/O from the equation (and I'm aware it uses table-level locking). Thanks again for your help!
~ Scenario A ~
1) I've got a memory table with ~100 rows like this:
CREATE TABLE cache (
campaign_id MEDIUMINT UNSIGNED NOT NULL,
sum_clicks SMALLINT UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY (campaign_id)
) Engine=MEMORY DEFAULT CHARSET=latin1;
2) And ~1k worker threads populating said table like so:
UPDATE cache SET sum_clicks+=x WHERE campaign_id=y;
3) And finally, a job that runs every ~hour which does:
CREATE TABLE IF NOT EXISTS next_cache LIKE cache;
INSERT INTO next_cache (campaign_id) SELECT id FROM campaigns;
RENAME TABLE cache TO old_cache, next_cache TO cache;
SELECT * FROM old_cache...into somewhere else;
TRUNCATE old_cache;
RENAME TABLE old_cache TO next_cache; // for next time
~ Scenario B ~
1) I've got a memory table like this:
CREATE TABLE cache (
campaign_id MEDIUMINT UNSIGNED NOT NULL,
sum_clicks SMALLINT UNSIGNED NOT NULL DEFAULT 0
) Engine=MEMORY DEFAULT CHARSET=latin1;
2) And ~1k worker threads populating said table like so:
INSERT INTO cache VALUES (y,x);
3) And finally, a job that runs every ~hour which does:
(~same thing as scenario A's 3rd step)
Post Script
For those searching stackOverflow for this I found these stackOverflow questions & answers helpful, especially if you are open to using storage engines beyond the MEMORY engine. concurrent-insert-with-mysql and
insert-vs-update-mysql-7-million-rows
With 1K worker threads hitting this table, they will seriously stumble over each other. Note that MEMORY uses table locking. You are likely to be better off with an InnoDB table.
Regardless of the Engine, do 'batch' INSERTs/UPDATEs whenever practical. That is, insert/update multiple rows in a single statement and/or in a single transaction.
Tips on high-speed ingestion -- My 'staging' table is very similar to your 'cache', though used for a different purpose.