Hello again internet nerds!
Technically I have solved this problem, but want to know if there is a more optimal route I should take...
I have a large table (~4m rows) that is a collection of data, segmented using int "chip." There are 6 segments of data, so Chip IDs 1 through 6.
Of these 6 segments, I need to assign an order integer, which needs to be iterative, as it represents the exact location of the data on said segment.
My solution is (was) this:
# set iterative
set #i:=0;
# init
update table set `order` = #i:=(#i+1) where chip = 1;
This works. But it is so slow it sometimes triggers a timeout error. I need to run it 6 times and it may be triggered on occasion by our application whenever necessary. Maybe it's just that I need to alot more time to MySQL settings to account for the slow query, or is there an optimal, simpler solution for this?
Thanks for the advice.
Edit:
I've found a solution that works accurately and takes ~50 seconds to complete.
I'm now using an ordered select statement paired in the update, using a generated join table that's iterating within a column.
See:
set #count:= 0;
update
table as target,
(select
(#count := #count+1) as row_num,
t.*
from table as t
where chip = 1
order by t.id asc) as table_with_iterative
set target.`order` = table_with_iterative.row_num
where target.id = table_with_iterative.id;
I think that if possible you should assign the sequence numbers as you are inserting the rows into the table for the very first time. Strictly speaking, an SQL database does not promise that rows will initially appear inside the table in the order of of the INSERT statements.
One possibility that occurs to me is to use an auto-increment field in this table, because this will express the order in which the rows were added. But the field-values for any given chip, although ascending, might not be consecutive and undoubtedly would not start at 1.
If you did need such a field, I think I'd define a separate field (default value: NULL) to hold it. Then, a very simple stored procedure could query the rows for a given chip (ORDER BY auto-increment number) and assign a 1-based consecutive sequential number to that separate field. (A slightly more elaborate query could identify all the lists that don't have those numbers yet, and number all of them at once.)
I've found a solution that works accurately and takes ~50 seconds to complete.
This handles the out-of-order updates I was experiencing.
I'm now using an ordered select statement paired in the update statement, using a generated join table that's iterating within a column row_num.
See:
set #count:= 0;
update
table as target,
(select
(#count := #count+1) as row_num,
t.*
from table as t
where chip = 1
order by t.id asc) as table_with_iterative
set target.`order` = table_with_iterative.row_num
where target.id = table_with_iterative.id;
Related
I want to return rows order by random from a table with large number of rows to be scanned
Tried:
1) select * from table order by rand() limit 1
2) select * from table where id in (select id from table order by rand() limit 1)
2 is faster than 1 but still too slow on table with large rows
Update:
Query is used in real time app. Insert, select and update are roughly 10/sec. So caching will not be the ideal solution. Rows required for this specific case is 1. But looking for a general solution as well where query is fast and number of rows required>1
Fastest way is using prepared statement in mysql and limit
select #offset:=floor(rand()*total_rows_in_table);
PREPARE STMT FROM 'select id from table limit ?,1';
EXECUTE STMT USING #offset;
total_rows_in_table= total rows in table.
It is much faster as compared to above two.
Limitation: Fetching more than 1 rows is not truly random.
Generate a random set of IDs before doing the query (you can also get MAX(id) very quickly if you need it). Then do the query as id IN (your, list). This will use the index to look only at the IDs you requested, so it will be very fast.
Limitation: if some of your randomly chosen IDs don't exist, the query will return less results, so you'll need to do these operations in a loop until you have enough results.
If you can run two querys in the same "call" you can do something like this, sadly, this asumes there are no deleted records in your database... if they where some query's would not return anything.
I tested with some local records and the fastest i could do was this... that said i tested it on a table with no deleted rows.
SET #randy = CAST(rand()*(SELECT MAX(id) FROM yourtable) as UNSIGNED);
SELECT *
FROM yourtable
WHERE id = #randy;
Another solution that came from modifying a little the answer to this question, and from your own solution:
Using variables as OFFSET in SELECT statments inside mysql's stored functions
SET #randy = CAST(rand()*(SELECT MAX(id) FROM yourtable) as UNSIGNED);
SET #q1 = CONCAT('SELECT * FROM yourtable LIMIT 1 OFFSET ', #randy);
PREPARE stmt1 FROM #q1;
EXECUTE stmt1;
I imagine a table with, say, a million entries. You want to pick a row randomly, so you generate one random number per row, i.e. a million random numbers, and then seek the row with the minimum generated number. There are two tasks involved:
generating all those numbers
finding the minimum number
and then accessing the record of course.
If you wanted more than one row, the DBMS could sort all records and then return n records, but hopefully it would rather apply some part-sort operation where it only detects the n minimum numbers. Quite some task anyway.
There is no thorough way to circumvent this, I guess. If you want random access, this is the way to go.
If you would be ready to live with a less random result, however, I'd suggest to make ID buckets. Imagine ID buckets 000000-0999999, 100000-1999999, ... Then randomly choose one bucket and of this pick your random rows. Well, admittedly, this doesn't look very random and you would either get only old or only new records with such buckets; but it illustrates the technique.
Instead of creating the buckets by value, you'd create them with a modulo function. id % 1000 would give you 1000 buckets. The first with IDs xxx000, the second with IDs xxx001. This would solve the new/old records thing and get the buckets balanced. As IDs are a mere technical thing, it doesn't matter at all that the drawn IDs look so similar. And even if that bothers you, then don't make 1000 buckets, but say 997.
Now create a computed column:
alter table mytable add column bucket int generated always as (id % 997) stored;
Add an index:
create index idx on mytable(bucket);
And query the data:
select *
from mytable
where bucket = floor(rand() * 998)
order by rand()
limit 10;
Only about 0.1% of the table gets into the sorting here. So this should be rather fast. But I suppose that only pays with a very large table and a high number of buckets.
Disadvantages of the technique:
It can happen that you don't get as many rows as you want and you'd have to query again then.
You must choose the modulo number wisely. If there are just two thousand records in the table, you wouldn't make 1000 buckets of course, but maybe 100 and never demand more than, say, ten rows at a time.
If the table grows and grows, a once chosen number may no longer be optimal and you might want to alter it.
Rextester link: http://rextester.com/VDPIU7354
UPDATE: It just dawned on me that the buckets would be really random, if the generated column would not be based on a modulo on the ID, but on a RANDvalue instead:
alter table mytable add column bucket int generated always as (floor(rand() * 1000)) stored;
but MySQL throws an error "Expression of generated column 'bucket' contains a disallowed function". This doesn't seem to make sense, as a non-deterministic function should be okay with the STORED option, but at least in version 5.7.12 this doesn't work. Maybe in some later version?
I have a table (rather ugly designed, but anyway), which consists only of strings. The worst is that there is a script which adds records time at time. Records will never be deleted.
I believe, that MySQL store records in a random access file, and I can get last or any other record using C language or something, since I know the max length of the record and I can find EOF.
When I do something like "SELECT * FROM table" in MySQL I get all the records in the right order - cause MySQL reads this file from the beginning to the end. I need only the last one(s).
Is there a way to get the LAST record (or records) using MySQL query only, without ORDER BY?
Well, I suppose I've found a solution here, so my current query is
SELECT
#i:=#i+1 AS iterator,
t.*
FROM
table t,
(SELECT #i:=0) i
ORDER BY
iterator DESC
LIMIT 5
If there's a better solution, please let me know!
The order is not guaranteed unless you use an ORDER BY. It just happens that the records you're getting back are sorted the way need them.
Here is the importance of keys (primary key for example).
You can make some modification in your table by adding a primary key column with auto_increment default value.
Then you can query
select * from your_table where id =(select max(id) from your_table);
and get the last inserted row.
I am implementing a leader board in my application, I want to update it every few time.
For that I created two tables of leader boards, that each one looks like this:
user_id, score, rank
and this is my update query:
select score from leaderboard order by score for update;
select(#rankCounter := 0);
update leaderboard set rank = (select(#rankCounter := #rankCounter + 1)) order by score desc;
I am using my active table for queries and every few time I switch the active table.
The update is currently taking around 3 minutes (on my machine) to update 4M raws.
I wish to reduce the amount of CPU it takes, and I don't care that the update will take longer.
How can I do that?
I recommend you try adding an index ... ON leaderboard (score), to avoid a sort operation. I also recommend you remove the unnecessary SELECT from the UPDATE statement (but I don't know if that has any performance implications or not, but the SELECT keyword is not necessary in that context.
A sort operation is certainly going to use some CPU. It's not clear to me whether that SELECT inside the UPDATE statement is ignored by the optimizer, or whether the plan is somehow different with that (unnecessary?) SELECT in there. (What is the purpose of including the SELECT keyword in that context?)
Also, it is not necessary to return the score value from every row to obtain locks on all the rows in the leaderboard table. The ORDER BY on that SELECT statement could also be consuming CPU cycles (if there is no index with score as the leading column. The unnecessary preparation of a 4M row resultset is also consuming CPU cycles.
It's not clear why it's necessary to obtain locks on all those rows in the table, with a SELECT ... FOR UPDATE, when the UPDATE statement itself will obtain the necessary locks. (The SELECT ... FOR UPDATE statement will only obtain locks in the context of a BEGIN TRANSACTION, or if autocommit is disabled. (I'm assuming here that leaderboard is an InnoDB table.)
MySQL may be able to make use of an index to avoid a sort operation:
CREATE INDEX leaderboard_IX1 ON leaderboard (score) ;
And this should be sufficient to update the rank column:
SET #rankCounter := 0;
UPDATE leaderboard
SET rank = #rankCounter := #rankCounter + 1
ORDER BY score DESC ;
Here's the use case:
I have a table with a bunch of unique codes which are either available or not available. As part of a transaction, I want to select a code that is available from the table, then later update that row later in the transaction. Since this can happen concurrently for a lot of sessions at the same time, I want to ideally select a random record and use row-level locking on the table, so that other transactions aren't blocked by the query which is selecting a row from the table.
I am using InnoDB for the storage engine, and my query looks something like this:
select * from tbl_codes where available = 1 order by rand() limit 1 for update
However, rather than locking just one row from the table, it ends up locking the whole table. Can anyone give me some pointers on how to make it so that this query doesn't lock the whole table but just the row?
Update
Addendum: I was able to achieve row-level locking by specifying an explicit key in my select rather than doing the rand(). When my queries look like this:
Query 1:
select * from tbl_codes where available = 1 and id=5 limit 1 for update
Query 2:
select * from tbl_codes where available = 1 and id=10 limit 1 for update
However, that doesn't really help solve the problem.
Addendum 2: Final Solution I went with
Given that rand() has some issues in MySQL, the strategy I chose is:
I select 50 code id's where available = 1, then I shuffle the array in the application layer to add a level of randomness to the order.
select id from tbl_codes where available = 1 limit 50
I start popping codes from my shuffled array in a loop until I am able to select one with a lock
select * from tbl_codes where available = 1 and id = :id
It may be useful to look at how this query is actually executed by MySQL:
select * from tbl_codes where available = 1 order by rand() limit 1 for update
This will read and sort all rows that match the WHERE condition, generate a random number using rand() into a virtual column for each row, sort all rows (in a temporary table) based on that virtual column, and then return rows to the client from the sorted set until the LIMIT is reached (in this case just one). The FOR UPDATE affects locking done by the entire statement while it is executing, and as such the clause is applied as rows are read within InnoDB, not as they are returned to the client.
Putting aside the obvious performance implications of the above (it's terrible), you're never going to get reasonable locking behavior from it.
Short answer:
Select the row you want, using RAND() or any other strategy you like, in order to find the PRIMARY KEY value of that row. E.g.: SELECT id FROM tbl_codes WHERE available = 1 ORDER BY rand() LIMIT 1
Lock the row you want using its PRIMARY KEY only. E.g.: SELECT * FROM tbl_codes WHERE id = N
Hopefully that helps.
Even if not exactly mapping to your question, the problem is somewhat discussed here: http://akinas.com/pages/en/blog/mysql_random_row/
The problem with this method is that it is very slow. The reason for
it being so slow is that MySQL creates a temporary table with all the
result rows and assigns each one of them a random sorting index. The
results are then sorted and returned.
The article does not deal with locks. However, maybe MySQL locks all the rows having available = 1 and does not release them until the end of the transaction!
That article proposes some solution, none of them seems to be good for you, except this one which is, unfortunately, very hacky and I didn't probe its correctness.
SELECT * FROM table WHERE id >= (SELECT FLOOR( MAX(id) * RAND())
FROM table ) ORDER BY id LIMIT 1;
This is the best I can do for you since I don't command MySQL internals. Moreover, the article is pretty old.
I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!
You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...
You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.