MySQL: How to achieve row-level transaction locking instead of table locking - mysql

Here's the use case:
I have a table with a bunch of unique codes which are either available or not available. As part of a transaction, I want to select a code that is available from the table, then later update that row later in the transaction. Since this can happen concurrently for a lot of sessions at the same time, I want to ideally select a random record and use row-level locking on the table, so that other transactions aren't blocked by the query which is selecting a row from the table.
I am using InnoDB for the storage engine, and my query looks something like this:
select * from tbl_codes where available = 1 order by rand() limit 1 for update
However, rather than locking just one row from the table, it ends up locking the whole table. Can anyone give me some pointers on how to make it so that this query doesn't lock the whole table but just the row?
Update
Addendum: I was able to achieve row-level locking by specifying an explicit key in my select rather than doing the rand(). When my queries look like this:
Query 1:
select * from tbl_codes where available = 1 and id=5 limit 1 for update
Query 2:
select * from tbl_codes where available = 1 and id=10 limit 1 for update
However, that doesn't really help solve the problem.
Addendum 2: Final Solution I went with
Given that rand() has some issues in MySQL, the strategy I chose is:
I select 50 code id's where available = 1, then I shuffle the array in the application layer to add a level of randomness to the order.
select id from tbl_codes where available = 1 limit 50
I start popping codes from my shuffled array in a loop until I am able to select one with a lock
select * from tbl_codes where available = 1 and id = :id

It may be useful to look at how this query is actually executed by MySQL:
select * from tbl_codes where available = 1 order by rand() limit 1 for update
This will read and sort all rows that match the WHERE condition, generate a random number using rand() into a virtual column for each row, sort all rows (in a temporary table) based on that virtual column, and then return rows to the client from the sorted set until the LIMIT is reached (in this case just one). The FOR UPDATE affects locking done by the entire statement while it is executing, and as such the clause is applied as rows are read within InnoDB, not as they are returned to the client.
Putting aside the obvious performance implications of the above (it's terrible), you're never going to get reasonable locking behavior from it.
Short answer:
Select the row you want, using RAND() or any other strategy you like, in order to find the PRIMARY KEY value of that row. E.g.: SELECT id FROM tbl_codes WHERE available = 1 ORDER BY rand() LIMIT 1
Lock the row you want using its PRIMARY KEY only. E.g.: SELECT * FROM tbl_codes WHERE id = N
Hopefully that helps.

Even if not exactly mapping to your question, the problem is somewhat discussed here: http://akinas.com/pages/en/blog/mysql_random_row/
The problem with this method is that it is very slow. The reason for
it being so slow is that MySQL creates a temporary table with all the
result rows and assigns each one of them a random sorting index. The
results are then sorted and returned.
The article does not deal with locks. However, maybe MySQL locks all the rows having available = 1 and does not release them until the end of the transaction!
That article proposes some solution, none of them seems to be good for you, except this one which is, unfortunately, very hacky and I didn't probe its correctness.
SELECT * FROM table WHERE id >= (SELECT FLOOR( MAX(id) * RAND())
FROM table ) ORDER BY id LIMIT 1;
This is the best I can do for you since I don't command MySQL internals. Moreover, the article is pretty old.

Related

Fastest way of randomising result with large dataset in mysql

I want to return rows order by random from a table with large number of rows to be scanned
Tried:
1) select * from table order by rand() limit 1
2) select * from table where id in (select id from table order by rand() limit 1)
2 is faster than 1 but still too slow on table with large rows
Update:
Query is used in real time app. Insert, select and update are roughly 10/sec. So caching will not be the ideal solution. Rows required for this specific case is 1. But looking for a general solution as well where query is fast and number of rows required>1
Fastest way is using prepared statement in mysql and limit
select #offset:=floor(rand()*total_rows_in_table);
PREPARE STMT FROM 'select id from table limit ?,1';
EXECUTE STMT USING #offset;
total_rows_in_table= total rows in table.
It is much faster as compared to above two.
Limitation: Fetching more than 1 rows is not truly random.
Generate a random set of IDs before doing the query (you can also get MAX(id) very quickly if you need it). Then do the query as id IN (your, list). This will use the index to look only at the IDs you requested, so it will be very fast.
Limitation: if some of your randomly chosen IDs don't exist, the query will return less results, so you'll need to do these operations in a loop until you have enough results.
If you can run two querys in the same "call" you can do something like this, sadly, this asumes there are no deleted records in your database... if they where some query's would not return anything.
I tested with some local records and the fastest i could do was this... that said i tested it on a table with no deleted rows.
SET #randy = CAST(rand()*(SELECT MAX(id) FROM yourtable) as UNSIGNED);
SELECT *
FROM yourtable
WHERE id = #randy;
Another solution that came from modifying a little the answer to this question, and from your own solution:
Using variables as OFFSET in SELECT statments inside mysql's stored functions
SET #randy = CAST(rand()*(SELECT MAX(id) FROM yourtable) as UNSIGNED);
SET #q1 = CONCAT('SELECT * FROM yourtable LIMIT 1 OFFSET ', #randy);
PREPARE stmt1 FROM #q1;
EXECUTE stmt1;
I imagine a table with, say, a million entries. You want to pick a row randomly, so you generate one random number per row, i.e. a million random numbers, and then seek the row with the minimum generated number. There are two tasks involved:
generating all those numbers
finding the minimum number
and then accessing the record of course.
If you wanted more than one row, the DBMS could sort all records and then return n records, but hopefully it would rather apply some part-sort operation where it only detects the n minimum numbers. Quite some task anyway.
There is no thorough way to circumvent this, I guess. If you want random access, this is the way to go.
If you would be ready to live with a less random result, however, I'd suggest to make ID buckets. Imagine ID buckets 000000-0999999, 100000-1999999, ... Then randomly choose one bucket and of this pick your random rows. Well, admittedly, this doesn't look very random and you would either get only old or only new records with such buckets; but it illustrates the technique.
Instead of creating the buckets by value, you'd create them with a modulo function. id % 1000 would give you 1000 buckets. The first with IDs xxx000, the second with IDs xxx001. This would solve the new/old records thing and get the buckets balanced. As IDs are a mere technical thing, it doesn't matter at all that the drawn IDs look so similar. And even if that bothers you, then don't make 1000 buckets, but say 997.
Now create a computed column:
alter table mytable add column bucket int generated always as (id % 997) stored;
Add an index:
create index idx on mytable(bucket);
And query the data:
select *
from mytable
where bucket = floor(rand() * 998)
order by rand()
limit 10;
Only about 0.1% of the table gets into the sorting here. So this should be rather fast. But I suppose that only pays with a very large table and a high number of buckets.
Disadvantages of the technique:
It can happen that you don't get as many rows as you want and you'd have to query again then.
You must choose the modulo number wisely. If there are just two thousand records in the table, you wouldn't make 1000 buckets of course, but maybe 100 and never demand more than, say, ten rows at a time.
If the table grows and grows, a once chosen number may no longer be optimal and you might want to alter it.
Rextester link: http://rextester.com/VDPIU7354
UPDATE: It just dawned on me that the buckets would be really random, if the generated column would not be based on a modulo on the ID, but on a RANDvalue instead:
alter table mytable add column bucket int generated always as (floor(rand() * 1000)) stored;
but MySQL throws an error "Expression of generated column 'bucket' contains a disallowed function". This doesn't seem to make sense, as a non-deterministic function should be okay with the STORED option, but at least in version 5.7.12 this doesn't work. Maybe in some later version?

SQL Optimization on SELECT random id (with WHERE clause)

I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.
I tried multiple solutions before finding this website :
http://jan.kneschke.de/projects/mysql/order-by-rand/
I tried with the following solution :
SELECT * FROM Table
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).
The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.
So I tried to adapt my request, doing this :
SELECT * FROM Table
JOIN (SELECT FLOOR(
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() )
AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).
I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.
Thank you for your help, please let me know if I'm not clear or if you need more details.
For your method to work more generally, you want max(id) rather than count(*):
SELECT t.*
FROM Table t JOIN
(SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
ON t.ID > R.Random
WHERE t.FOREIGNKEY_ID IS NULL
ORDER BY t.ID
LIMIT 1;
The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.
The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1, 2, 3, 1000. Your method will never get 1000000. The above will almost always get it.
Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID), which the subquery can use. That might speed the query.
I tend to favor something more along these lines:
SELECT t.id
FROM Table t
WHERE t.FOREIGNKEY_ID IS NULL AND
RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;
The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.
Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID).
EDIT:
If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.
Add a new column to the table called random, which is initialized with rand(). Build an index onrandom`. Then run a query such as:
select t.*
from ((select t.*
from t
where random >= #random
order by random
limit 10
) union all
(select t.*
from t
where random < #random
order by random desc
limit 10
)
) t
order by rand();
limit 1;
The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.
The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.
Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.
A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3. This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.
To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):
SELECT id FROM table;
You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:
SELECT * FROM table WHERE id = :id
Or:
SELECT * FROM table WHERE id IN (:id1, :id2, :id3)
This should be fast if the id column has an index, and it will give you a proper random distribution.
If prepared statement can be used, then this should work:
SELECT #skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING #skip;
LIMIT in SELECT statement can be used to skip rows

Is it a bad idea to store row count and number of row to speed up pagination?

My website has more than 20.000.000 entries, entries have categories (FK) and tags (M2M). As for query even like SELECT id FROM table ORDER BY id LIMIT 1000000, 10 MySQL needs to scan 1000010 rows, but that is really unacceptably slow (and pks, indexes, joins etc etc don't help much here, still 1000010 rows). So I am trying to speed up pagination by storing row count and row number with triggers like this:
DELIMITER //
CREATE TRIGGER #trigger_name
AFTER INSERT
ON entry_table FOR EACH ROW
BEGIN
UPDATE category_table SET row_count = (#rc := row_count + 1)
WHERE id = NEW.category_id;
NEW.row_number_in_category = #rc;
END //
And then I can simply:
SELECT *
FROM entry_table
WHERE row_number_in_category > 10
ORDER BY row_number_in_category
LIMIT 10
(now only 10 rows scanned and therefore selects are blazing fast, although inserts are slower, but they are rare comparing to selects, so it is ok)
Is it a bad approach and are there any good alternatives?
Although I like the solution in the question. It may present some issues if data in the entry_table is changed - perhaps deleted or assigned to different categories over time.
It also limits the ways in which the data can be sorted, the method assumes that data is only sorted by the insert order. Covering multiple sort methods requires additional triggers and summary data.
One alternate way of paginating is to pass in offset of the field you are sorting/paginating by instead of an offset to the limit parameter.
Instead of this:
SELECT id FROM table ORDER BY id LIMIT 1000000, 10
Do this - assuming in this scenario that the last result viewed had an id of 1000000.
SELECT id FROM table WHERE id > 1000000 ORDER BY id LIMIT 0, 10
By tracking the offset of the pagination, this can be passed to subsequent queries for data and avoids the database sorting rows that are not ever going to be part of the end result.
If you really only wanted 10 rows out of 20million, you could go further and guess that the next 10 matching rows will occur in the next 1000 overall results. Perhaps with some logic to repeat the query with a larger allowance if this is not the case.
SELECT id FROM table WHERE id BETWEEN 1000000 AND 1001000 ORDER BY id LIMIT 0, 10
This should be significantly faster because the sort will probably be able to limit the result in a single pass.

MySQL RAND() optimization with LIMIT option

I have 50,000 rows in table and i am running following query but i heard it is a bad idea but how do i make it work better way?
mysql> SELECT t_dnis,account_id FROM mytable WHERE o_dnis = '15623157085' AND enabled = 1 ORDER BY RAND() LIMIT 1;
+------------+------------+
| t_dnis | account_id |
+------------+------------+
| 5623157085 | 1127 |
+------------+------------+
Any other way i can make is query faster or user other options?
I am not DBA so sorry if this question asked before :(
Note: currently we are not seeing performance issue but we are growing so could be impact in future so just want to know + and - point before are are out of wood.
This query:
SELECT t_dnis, account_id
FROM mytable
WHERE o_dnis = '15623157085' AND enabled = 1
ORDER BY RAND()
LIMIT 1;
is not sorting 50,000 rows. It is sorting the number of rows that match the WHERE clause. As you state in the comments, this is in the low double digits. On a handful of rows, the use of ORDER BY rand() should not have much impact on performance.
You do want an index. The best index would be mytable(o_dnis, enabled, t_dnis, account_id). This is a covering index for the query, so the original data pages do not need to be accessed.
Under most circumstances, I would expect the ORDER BY to be fine up to at least a few hundred rows, if not several thousand. Of course, this depends on lots of factors, such as your response-time requirements, the hardware you are running on, and how many concurrent queries are running. My guess is that your current data/configuration does not pose a performance problem, and there is ample room for growth in the data without an issue arising.
Unless you are running on very slow hardware, you should not experience problems in sorting (much? less than) 50,000 rows. So if you still ask the question, this makes me suspect that your problem does not lie in the RAND().
For example one possible cause of slowness could be not having a proper index - in this case you can go for a covering index:
CREATE INDEX mytable_ndx ON enabled, o_dnis, t_dnis, account_id;
or the basic
CREATE INDEX mytable_ndx ON enabled, o_dnis;
At this point you should already have good performances.
Otherwise you can run the query twice, either by counting the rows or just priming a cache. Which to choose depends on the data structure and how many rows are returned; usually, the COUNT option is the safest bet.
SELECT COUNT(1) AS n FROM mytable WHERE ...
which gives you n, which allows you to generate a random number k in the same range as n, followed by
SELECT ... FROM mytable LIMIT k, 1
which ought to be really fast. Again, the index will help you speeding up the counting operation.
In some cases (MySQL only) you could perhaps do better with
SELECT SQL_CACHE SQL_CALC_FOUND_ROWS ... FROM mytable WHERE ...
using the calc_found_rows() function to recover n, then run the second query which should take advantage of the cache. It's best if you experiment first, though. And changes in the table demographics might cause performance to fall.
The problem with ORDER BY RAND() LIMIT 1 is that MySQL will give each row a random values and that sort, performing a full table scan and than drops all the results but one.
This is especially bad on a table with a lot of row, doing a query like
SELECT * FROM foo ORDER BY RAND() LIMIT 1
However in your case the query is already filtering on o_dnis and enabled. If there are only a limited number of rows that match (like a few hundred), doing an ORDER BY RAND() shouldn't cause a performance issue.
The alternative required two queries. One to count and the other one to fetch.
in pseudo code
count = query("SELECT COUNT(*) FROM mytable WHERE o_dnis = '15623157085' AND enabled = 1").value
offset = random(0, count - 1)
result = query("SELECT t_dnis, account_id FROM mytable WHERE o_dnis = '15623157085' AND enabled = 1 LIMIT 1 OFFSET " + offset).row
Note: For the pseudo code to perform well, there needs to be a (multi-column) index on o_dnis, enabled.

MySql update with low cost CPU

I am implementing a leader board in my application, I want to update it every few time.
For that I created two tables of leader boards, that each one looks like this:
user_id, score, rank
and this is my update query:
select score from leaderboard order by score for update;
select(#rankCounter := 0);
update leaderboard set rank = (select(#rankCounter := #rankCounter + 1)) order by score desc;
I am using my active table for queries and every few time I switch the active table.
The update is currently taking around 3 minutes (on my machine) to update 4M raws.
I wish to reduce the amount of CPU it takes, and I don't care that the update will take longer.
How can I do that?
I recommend you try adding an index ... ON leaderboard (score), to avoid a sort operation. I also recommend you remove the unnecessary SELECT from the UPDATE statement (but I don't know if that has any performance implications or not, but the SELECT keyword is not necessary in that context.
A sort operation is certainly going to use some CPU. It's not clear to me whether that SELECT inside the UPDATE statement is ignored by the optimizer, or whether the plan is somehow different with that (unnecessary?) SELECT in there. (What is the purpose of including the SELECT keyword in that context?)
Also, it is not necessary to return the score value from every row to obtain locks on all the rows in the leaderboard table. The ORDER BY on that SELECT statement could also be consuming CPU cycles (if there is no index with score as the leading column. The unnecessary preparation of a 4M row resultset is also consuming CPU cycles.
It's not clear why it's necessary to obtain locks on all those rows in the table, with a SELECT ... FOR UPDATE, when the UPDATE statement itself will obtain the necessary locks. (The SELECT ... FOR UPDATE statement will only obtain locks in the context of a BEGIN TRANSACTION, or if autocommit is disabled. (I'm assuming here that leaderboard is an InnoDB table.)
MySQL may be able to make use of an index to avoid a sort operation:
CREATE INDEX leaderboard_IX1 ON leaderboard (score) ;
And this should be sufficient to update the rank column:
SET #rankCounter := 0;
UPDATE leaderboard
SET rank = #rankCounter := #rankCounter + 1
ORDER BY score DESC ;