Related
I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.
I tried multiple solutions before finding this website :
http://jan.kneschke.de/projects/mysql/order-by-rand/
I tried with the following solution :
SELECT * FROM Table
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).
The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.
So I tried to adapt my request, doing this :
SELECT * FROM Table
JOIN (SELECT FLOOR(
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() )
AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).
I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.
Thank you for your help, please let me know if I'm not clear or if you need more details.
For your method to work more generally, you want max(id) rather than count(*):
SELECT t.*
FROM Table t JOIN
(SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
ON t.ID > R.Random
WHERE t.FOREIGNKEY_ID IS NULL
ORDER BY t.ID
LIMIT 1;
The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.
The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1, 2, 3, 1000. Your method will never get 1000000. The above will almost always get it.
Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID), which the subquery can use. That might speed the query.
I tend to favor something more along these lines:
SELECT t.id
FROM Table t
WHERE t.FOREIGNKEY_ID IS NULL AND
RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;
The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.
Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID).
EDIT:
If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.
Add a new column to the table called random, which is initialized with rand(). Build an index onrandom`. Then run a query such as:
select t.*
from ((select t.*
from t
where random >= #random
order by random
limit 10
) union all
(select t.*
from t
where random < #random
order by random desc
limit 10
)
) t
order by rand();
limit 1;
The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.
The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.
Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.
A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3. This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.
To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):
SELECT id FROM table;
You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:
SELECT * FROM table WHERE id = :id
Or:
SELECT * FROM table WHERE id IN (:id1, :id2, :id3)
This should be fast if the id column has an index, and it will give you a proper random distribution.
If prepared statement can be used, then this should work:
SELECT #skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING #skip;
LIMIT in SELECT statement can be used to skip rows
I have an Eloquent query that is currently taking about 700ms to run and it will only increase as I add more websites to the user account. I'm trying to see what the best way to optimize it is so that it can run faster.
I really don't want to save the "results" of my calculations and then just fetch those in a smaller query later because they could update at any moment and that would mean they would not be accurate 100% of the time. Although I am pretty sure that would speed up the query, I don't want to sacrifice accuracy over performance.
This is essentially the raw query that runs:
select *
from
( SELECT `positions`.*,
#rank := IF(#group = keyword_id, #rank+1,
1) as rank_e0686ae02a55b8ad75aec0c7aaec0a21,
#group := keyword_id as group_e0686ae02a55b8ad75aec0c7aaec0a21
from
( SELECT #rank:=0, #group:=0 ) as vars,
positions
order by `keyword_id` asc, `created_at` desc
) as positions
where `rank_e0686ae02a55b8ad75aec0c7aaec0a21` <= '2'
and `positions`.`keyword_id` in ('hundreds of IDs listed here')
The query is generated using the solution mentioned here with regards to getting N number of relations per record.
I've tried running a simpler query without the N number of relations per record, and it actually ends up being even slower because it's fetching much more data. So the problem I think is that there are too many IDs that are trying to be matched up in the IN method of the query.
In my controller I have:
$user = auth()->user();
$websites = $user->websitesAndKeywords();
In my User model:
public function websitesAndKeywords() {
$user = auth()->user();
$websites = $user->websites()->orderBy('url')->get();
$websites->load('keywords', 'keywords.latestPositions');
return $websites;
}
I would appreciate any help anyone could provide in helping me speed this thing up.
EDIT: So I think I figured it out. The problem is the IN clause that Laravel uses every time eager loading is used to load relations. So I need to find a way to do a JOIN instead of eager loading.
Essentially need to convert this:
$websites->load('keywords', 'keywords.latestPositions');
Into:
$websites->load(['keywords' => function($query)
{
$query->join('positions', 'keywords.id', '=', 'positions.keyword_id');
}]);
That doesn't work, so I'm not sure what's the best way to do a JOIN on a current collection. Ideally I would also only fetch the latest N positions too and not all data.
Here are indexes on positions table:
And here is what explain returns for the query:
You need, if you don't have it yet, an index positions(keyword_id), or maybe positions(keyword_id,created_at), depending on your data, depending on if you want to keep using "lazy evaluate", and depending on if you want to use the trigger-solution.
And you have to, as Rick suggested, move your keyword_id in... into the inner query, as mysql will not be able to optimize it into the subquery since the optimizer doesn't understand that IF(#group = keyword_id, #rank+1, 1) will not need the other keywords to work properly.
This should give results for tables with several million rows (if you don't want to retrieve them all in IN) in less than 700ms, and might be improved by removing the "lazy evaluate" as Rick also suggested (so you do less table lookups for columns not included in your index), depending on your data.
If you still have troubles, you could however actually precalculate the data without loss of accuracy by using triggers. It will add a (most likely small) overhead to your inserts/updates, so if you insert/update a lot and only query once in a while, you might not want to do it.
For this, you should really use the index positions (keyword_id,created_at).
Add another table keywordrank with the columns keyword_id, rank, primarykeyofpositionstable, primary key keyword_id and rank. You need another table, since in a trigger, mysql can't update other rows in the table you are updating.
Create a trigger that will update these ranks on every insert to your positions-table:
delimiter $$
create trigger tr_positions_after_insert_updateranks after insert on positions
for each row
begin
delete from keywordrank where keyword_id = NEW.keyword_id;
insert into keywordrank (keyword_id, rank, primarykeyofpositionstable)
select NEW.keyword_id, ranks.rank, ranks.position_pk
from
(select NEW.keyword_id,
#rank := #rank+1 as rank,
`positions`.primarykeyofpositionstable as position_pk
from
(SELECT #rank:=0, #group:=0 ) as vars,
positions
where `positions`.keyword_id = NEW.keyword_id
order by `keyword_id` asc, `created_at` desc
) as ranks
where ranks.rank <= 2;
end$$
delimiter ;
If you want to be able to update or delete the entries (or to be safe if you do it at one time, so it might be a good idea anyway), add the same as an update/delete-trigger, just do it for both old.keyword_id and new.keyword_id - and you might want to put the code into a procedure to reuse it then. E.g. create a procedure fctname(kwid int), put the whole trigger code in it but replace all NEW.keyword_id with kwid and then just call fctname(new.keyword_id) for insert, fctname(new.keyword_id) and fctname(old.keyword_id) for update and fctname(old.keyword_id) for delete.
You need to init that table one time (and if you e.g. decide you may need more ranks or another order), you can use any version of your code, e.g.
delete from keywordrank;
insert into keywordrank (keyword_id, rank, primarykeyofpositionstable)
select ranks.keyword_id, ranks.rank, ranks.position_pk
from
( SELECT `positions`.primarykeyofpositionstable as position_pk,
#rank := IF(#group = keyword_id, #rank+1,
1) as rank,
#group := keyword_id as keyword_id
from
( SELECT #rank:=0, #group:=0 ) as vars,
positions
order by `keyword_id` asc, `created_at` desc
) as ranks
where ranks.rank <= 2;
You can put both the trigger(s) and the init in your migration files (without the delimiter).
You then can just use a join to get your desired rows.
Update The code without trigger, using index on (keyword_id, created_at). You can calculate the inner query completely from the index and then only look up the found ids in the tabledata. It depends on the number of rows in your result (in relation to your whole table), how much of an effect removing the lazy evaluate will have.
select positions.*, poslist.rank, poslist.group
from positions
join
( SELECT `positions`.id,
#rank := IF(#group = keyword_id, #rank+1,
1) as rank,
#group := keyword_id as group
from
( SELECT #rank:=0, #group:=0 ) as vars,
positions
where `positions`.`keyword_id` in ('hundreds of IDs listed here')
order by `keyword_id` asc, `created_at` desc
) as poslist
on positions.id = poslist.id
where poslist.rank <= 2;
Check explain if it actually uses the correct index (keyword_id, created_at). If that is not fast enough, you should try the trigger solution. (Or add the new explain-output and show profile-output to let us have a deeper look.)
The code for finding the "top 2 in each grouping" is the best I have ever seen. It is essentially the same as what I have in my blog on such.
However, there are two other things that we may be able to improve on.
Move keyword_id in... from the outer query to the inner. I assume you have an index starting with keyword_id?
"Lazy evaluate". That is, instead of doing SELECT positions.*, ..., do only SELECT id, ... where id is the PRIMARY KEY of positions. Then, in the outer query, JOIN back to positions to get the rest of the columns. With out seeing SHOW CREATE TABLE and knowing what percentage of the table in in the IN list, I can't be sure that this will help much.
The problem
I'm looking at the ranking use case in MySQL but I still haven't settled on a definite "best solution" for it. I have a table like this:
CREATE TABLE mytable (
item_id int unsigned NOT NULL,
# some other data fields,
item_score int unsigned NOT NULL,
PRIMARY KEY (item_id),
KEY item_score (item_score)
) ENGINE=MyISAM;
with some millions records in it, and the most common write operation is to update item_score with a new value. Given an item_id and/or its score, I need to get its ranking, and I currently know two ways to accomplish that.
COUNT() items with higher scores
SELECT COUNT(*) FROM mytable WHERE item_score > $foo;
assign row numbers
SET #rownum := 0;
SELECT rank FROM (
SELECT #rownum := #rownum + 1 AS rank, item_id
FROM mytable ORDER BY item_score DESC ) AS result
WHERE item_id = $foo;
which one?
Do they perform the same or behave differently? If so, why are they different and which one should I choose?
any better idea?
Is there any better / faster approach? The only thing I can come up with is a separate table/memcache/NoSQL/whatever to store pre-calculated rankings, but I still have to sort & read out mytable every time I update it. That makes me think it would be a good approach only if the number of "read rank" queries is (much?) greather than the number of updates, on the other hand it should be less useful with the "read rank" queries approaching the number of update queries.
Since you have indexes on your table the only queries to use that makes sense is
-- findByScore
SELECT COUNT(*) FROM mytable WHERE item_score > :item_score;
-- findById
SELECT COUNT(*) FROM mytable WHERE item_score > (select item_score from mytable where item_id = :item_id);
on findById since you only need rank of 1 item id, it is not much different from join counterpart on performance wise.
If you need the rank of many items then using join is better.
Usign "assign row numbers" can not compete here because it wont make use of indexes (in your query not at all and if we would even improve that it is still not as good)
Also there may be some hidden traps using the assign indexes: if there are multiple items with same score then it will give you rank of last one.
Unrelated: And please use PDO if possible to be safe from sql injections.
I have two tables: "servers" and "stats"
servers has a column called "id" that auto-increments.
stats has a column called "server" that corresponds to a row in the servers table, a column called "time" that represents the time it was added, and a column called "votes" that I would like to get the average of.
I would like to fetch all the servers (SELECT * FROM servers) along with the average votes of the 24 most recent rows that correspond to each server. I believe this is a "greatest-n-per-group" question.
This is what I tried to do, but it gave me 24 rows total, not 24 rows per group:
SELECT servers.*,
IFNULL(AVG(stats.votes), 0) AS avgvotes
FROM servers
LEFT OUTER JOIN
(SELECT server,
votes
FROM stats
GROUP BY server
ORDER BY time DESC LIMIT 24) AS stats ON servers.id = stats.server
GROUP BY servers.id
Like I said, I would like to get the 24 most recent rows for each server, not 24 most recent rows total.
Thanks for this great post.
alter table add index(server, time)
set #num:=0, #server:='';
select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes
from servers left outer join (
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from stats force index(server)
group by server, time
having row_number < 25) as stats
on servers.id = stats.server
group by servers.id
edit 1
I just noticed that above query gives the oldest 24 records for each groups.
set #num:=0, #server:='';
select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes
from servers left outer join (
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from (select * from stats order by server, time desc) as t
group by server, time
having row_number < 25) as stats
on servers.id = stats.server
group by servers.id
which will give the average of the 24 newest entity for each group
Edit2
#DrAgonmoray
you can try the inner query part first and see if it returns the the newest 24 records for each group. In my mysql 5.5, it works correctly.
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from (select * from stats order by server, time desc) as t
group by server, time
having row_number < 25
This is another approach.
This query is going to suffer the same performance problems as other queries here that return correct results, because the execution plan for this query is going to require a SORT operation on EVERY row in the stats table. Since there is no predicate (restriction) on the time column, EVERY row in the stats table will be considered. For a REALLY large stats table, this is going to blow out all available temporary space before it dies a horrible death. (More notes on performance below.)
SELECT r.*
, IFNULL(s.avg_votes,0)
FROM servers r
LEFT
JOIN ( SELECT t.server
, AVG(t.votes) AS avg_votes
FROM ( SELECT CASE WHEN u.server = #last_server
THEN #i := #i + 1
ELSE #i := 1
END AS i
, #last_server := u.server AS `server`
, u.votes AS votes
FROM (SELECT #i := 0, #last_server := NULL) i
JOIN ( SELECT v.server, v.votes
FROM stats v
ORDER BY v.server DESC, v.time DESC
) u
) t
WHERE t.i <= 24
GROUP BY t.server
) s
ON s.server = r.id
What this query is doing is sorting the stats table, by server and by descending order on the time column. (Inline view aliased as u.)
With the sorted result set, we assign a row numbers 1,2,3, etc. to each row for each server. (Inline view aliased as t.)
With that result set, we filter out any rows with a rownumber > 24, and we calculate an average of the votes column for the "latest" 24 rows for each server. (Inline view aliased as s.)
As a final step, we join that to the servers table, to return the requested resultset.
NOTE:
The execution plan for this query will be COSTLY for a large number of rows in the stats table.
To improve performance, there are several approaches we could take.
The simplest might to be include in the query a predicate the EXCLUDES a significant number of rows from the stats table (e.g. rows with time values over 2 days old, or over 2 weeks old). That would significantly reduce the number of rows that need to be sorted, to determine the "latest" 24 rows.
Also, with an index on stats(server,time), it's also possible that MySQL could do a relatively efficient "reverse scan" on the index, avoiding a sort operation.
We could also consider implementing an index on the stats table on (server,"reverse_time"). Since MySQL doesn't yet support descending indexes, the implementation would really be a regular (ascending) index on an a derived rtime value (a "reverse time" expression that is ascending for descending values of time (for example, -1*UNIX_TIMESTAMP(my_timestamp) or -1*TIMESTAMPDIFF('1970-01-01',my_datetime).
Another approach to improve performance would be to keep a shadow table containing the most recent 24 rows for each server. That would be simplest to implement if we can guarantee that "latest rows" won't be deleted from the stats table. We could maintain that table with a trigger. Basically, whenever a row is inserted into the stats table, we check if the time on the new rows is later than the earliest time stored for the server in the shadow table, if it is, we replace the earliest row in the shadow table with the new row, being sure to keep no more than 24 rows in the shadow table for each server.
And, yet another approach is to write a procedure or function that gets the result. The approach here would be to loop through each server, and run a separate query against the stats table to get the average votes for the latest 24 rows, and gather all of those results together. (That approach mighty really be more of a workaround to avoiding a sort on huge temporary set, just to enable the resultset to be returned, not necessarily making the return of the resultset blazingly fast.)
The bottom line for performance of this type of query on a LARGE table is restricting the number of rows considered by the query AND avoiding a sort operation on a large set. That's how we get a query like this to perform.
ADDENDUM
To get a "reverse index scan" operation (to get the rows from stats ordered using an index WITHOUT a filesort operation), I had to specify DESCENDING on both expressions in the ORDER BY clause. The query above previously had ORDER BY server ASC, time DESC, and MySQL always wanted to do a filesort, even specifying the FORCE INDEX FOR ORDER BY (stats_ix1) hint.
If the requirement is to return an 'average votes' for a server only if there are at least 24 associated rows in the stats table, then we can make a more efficient query, even if it is a bit more messy. (Most of the messiness in the nested IF() functions is to deal with NULL values, which do not get included in the average. It can be much less messy if we have a guarantee that votes is NOT NULL, or if we exclude any rows where votes is NULL.)
SELECT r.*
, IFNULL(s.avg_votes,0)
FROM servers r
LEFT
JOIN ( SELECT t.server
, t.tot/NULLIF(t.cnt,0) AS avg_votes
FROM ( SELECT IF(v.server = #last_server, #num := #num + 1, #num := 1) AS num
, #cnt := IF(v.server = #last_server,IF(#num <= 24, #cnt := #cnt + IF(v.votes IS NULL,0,1),#cnt := 0),#cnt := IF(v.votes IS NULL,0,1)) AS cnt
, #tot := IF(v.server = #last_server,IF(#num <= 24, #tot := #tot + IFNULL(v.votes,0) ,#tot := 0),#tot := IFNULL(v.votes,0) ) AS tot
, #last_server := v.server AS SERVER
-- , v.time
-- , v.votes
-- , #tot/NULLIF(#cnt,0) AS avg_sofar
FROM (SELECT #last_server := NULL, #num:= 0, #cnt := 0, #tot := 0) u
JOIN stats v FORCE INDEX FOR ORDER BY (stats_ix1)
ORDER BY v.server DESC, v.time DESC
) t
WHERE t.num = 24
) s
ON s.server = r.id
With a covering index on stats(server,time,votes), the EXPLAIN showed MySQL avoided a filesort operation, so it must have used a "reverse index scan" to return the rows in order. Absent the covering index, and index on '(server,time), MySQL used the index if I included an index hint, with theFORCE INDEX FOR ORDER BY (stats_ix1)` hint, MySQL avoided a filesort as well. (But since my table had less than 100 rows, I don't think MySQL put much emphasis on avoiding a filesort operation.)
The time, votes, and avg_sofar expressions are commented out (in the inline view aliased as t); they aren't needed, but they are for debugging.
The way that query stands, it needs at least 24 rows in stats for each server, in order to return an average. (That may be acceptable.) But I was thinking that in general, we could return a running total, total so far (tot) and a running count (cnt).
(If we replace the WHERE t.num = 24 with WHERE t.num <= 24, we can see the running average in action.)
To return the average where there aren't at least 24 rows in stats, that's really a matter of identifying the row (for each server) with the maximum value of num that is <= 24.
Try this solution, with the top-n-per-group technique in the INNER JOIN subselect credited to Bill Karwin and his post about it here.
SELECT
a.*,
AVG(b.votes) AS avgvotes
FROM
servers a
INNER JOIN
(
SELECT
aa.server,
aa.votes
FROM
stats aa
LEFT JOIN stats bb ON
aa.server = bb.server AND
aa.time < bb.time
GROUP BY
aa.time
HAVING
COUNT(*) < 24
) b ON a.id = b.server
GROUP BY
a.id
I'm developing a scoreboard of sorts. The table structure is ID, UID, points with UID being linked to a users account.
Now, I have this working somewhat, but I need one specific thing for this query to be pretty much perfect. To pick a user based on rank.
I'll show you my the SQL.
SELECT *, #rownum := #rownum + 1 AS `rank` FROM
(SELECT * FROM `points_table` `p`
ORDER BY `p`.`points` DESC
LIMIT 1)
`user_rank`,
(SELECT #rownum := 0) `r`, `accounts_table` `a`, `points_table` `p`
WHERE `a`.`ID` = `p`.`UID`
It's simple to have it pick people out by UID, but that's no good. I need this to pull the user by their rank (which is a, um, fake field ^_^' created on the fly). This is a bit too complex for me as my SQL knowledge is enough for simple queries, I have never delved into alias' or nested queries, so you'll have to explain fairly simply so I can get a grasp.
I think there is two problems here. From what I can gather you want to do a join on two tables, order them by points and then return the nth record.
I've put together an UNTESTED query. The inner query does a join on the two tables and the outer query specifies that only a specific row is returned.
This example returns the 4th row.
SELECT * FROM
(SELECT *, #rownum := #rownum + 1 AS rank
FROM `points_table` `p`
JOIN `accounts_table` `a` ON a.ID = p.UID,
(SELECT #rownum:=0) r
ORDER BY `p`.`points` DESC) mytable
WHERE rank = 4
Hopefully this works for you!
I've made a change to the answer which should hopefully resolve that problem. Incidentally, whether you use a php or mysql to get the rank, you are still putting a heavy strain on resources. Before mysql can calculate the rank it must create a table of every user and then order them. So you are just moving the work from one area to another. As the number of users increases, so too will the query execution time regardless of your solution. MySQL will probably take slightly longer to perform calculations which is why PHP is probably a more ideal solution. But I also know from experience, that sometimes extraneous details prevent you from having a completely elegant solution. Hope the altered code works.