Alright, here's a simple enough question about indices and subqueries. I'm using MariaDB 5.5.36 + MyISAM, here's my table structure for leaderboard. It contains about 50 million rows across ~2000 levels.
int userid,
int levelid,
int score,
index (userid, levelid),
index (levelid, score)
This query, meant to return the rank of each score in a level for a given user, runs very slow...
SELECT levelid, (
SELECT COUNT(*) + 1
FROM leaderboard
WHERE score > l.score AND levelid = l.levelid
) AS rank
FROM leaderboard AS l
WHERE userid = 12345;
I've tried using a self-join group approach as well, which runs in half the time as above but still unacceptably slow:
SELECT x.levelid, COUNT(y.score) AS rank
FROM leaderboard AS x
LEFT JOIN leaderboard AS y ON x.levelid = y.levelid AND y.score > x.score
WHERE x.userid = {0}
GROUP BY x.levelid;
... while this alternative runs about 100x faster (pseudocode, looping over the results in an application outside the DB or in a stored procedure or something and running the subquery separately 2000 times with a constant):
results = execute(""SELECT levelid, score
FROM leaderboard
WHERE userid = 12345"");
for each row in results:
execute(""SELECT COUNT(*) + 1
FROM leaderboard
WHERE score > %d AND levelid = %d
"".printf(row.score, row.levelid));
EXPLAIN tells me that the subquery in the slow example has a key_len of 4 bytes (just levelid) while the fast version uses 8 (levelid, score). Interesting side note, if "score > l.score" is replaced with "score = l.score" it switches to using all 8, but obviously that doesn't give me the answer I'm looking for.
Is there something I'm not understanding about how the index fundamentally works? Is there a better way to write this ranking query? Would it be more efficient to add a rank column to my table and update it every time a highscore is achieved (that could mean updating up to 400k rows for one single score achievement)?
Related
I have a table USERS with only one column USER_ID. These IDs are more than 200M, they are not consecutive and are not ordered. It has an index USER_ID_INDEX on that column. I have the DB in MySQL and also in Google Big Query, but I haven't been able to get what I need in any of them.
I need to know how to query these 2 things:
1) Which is the row number for a particular USER_ID (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SET #row := 0;
SELECT #row := #row + 1 AS row FROM USERS WHERE USER_ID = 100001366260516;
It goes fast but it returns row=1 because the row counting is from the data-set.
SELECT USER_ID, #row:=#row+1 as row FROM (SELECT USER_ID FROM USERS ORDER BY USER_ID ASC) WHERE USER_ID = 100002034141760
It takes forever (I didn't wait to see the result).
In Big Query:
SELECT ROW_NUMBER() OVER() row, USER_ID
FROM (SELECT USER_ID from USERS.USER_ID ORDER BY USER_ID ASC)
WHERE USER_ID = 1063650153
It takes forever (I didn't wait to see the result).
2) Which USER_ID is in a particular row (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SELECT USER_ID FROM USERS ORDER BY USER_ID ASC LIMIT 150000000000, 1
It takes 5 minutes in giving a result. Why? Isn't it supposed to be fast if it has an index?
In Big Query, I didn't find the way because LIMIT init, num_rows, doesn't even exist.
I could order the table in a new one, and add a column called RANK that orders the USER_ID, with an INDEX on it. But it will be a mess if I want to add or remove a row.
Any ideas on how to solve these two queries?
Thanks,
Natalia
For (1), try this:
SELECT count(user_id)
FROM USERS
WHERE USER_ID <= 100001366260516;
You can check the explain, but it should just be doing a scan of the index.
For (2). Your question: "Why? Isn't it supposed to be fast if it has an index?". Yes, it will use the index. Then it has to count up to row 150,000,000,000 using an index scan. Hmmm, that is being the end of the table (if it is not a typo). In any case, an index scan is quite different from doing an index lookup, which is fast. And, it will take time. And more time if the index does not fit into memory.
The proper syntax for row_number(), by the way, would be:
SELECT row, USER_ID
FROM (SELECT USER_ID, row_number() over (order by user_id) as row
from USERS.USER_ID )
WHERE USER_ID = 1063650153;
I don't know if it will be that much faster, but at least you are not explicitly ordering the rows first.
If these are the types of queries you need to do, then think about a way to include the ordering information as a column in the table.
I want to have randomized rows after a query, but using order by rand() is just exhausting on a table that has 120k+ rows. I have found a small solution that just outputs number of rows but it runs like it starts from a random index and then returns #number of rows after that. It is pretty fast but this just returns some rows after a random index. The code goes like:
SELECT *
FROM lieky AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(col_0)
FROM lieky)) AS id)
AS r2
WHERE r1.col_0 >= r2.id
ORDER BY r1.col_0 ASC
LIMIT 100
and i found it in here: http://jan.kneschke.de/projects/mysql/order-by-rand/
Is there something that would help me ?
I am trying to get randomized data into pagination, so when the user queries the database, he will always get the rows in a random order.
Thanks for help.
It should be noted that
(SELECT (RAND() * (SELECT MAX(col_0) FROM lieky)) AS id)
can return MAX(col_0), so you ll get only 1 row (because of WHERE r1.col_0 >= r2.id)
I think good solution should be somethink like:
add two columns groupId int, seed int; add index indexName (groupId , seed)
every x seconds (maybe every hour, day, ..) run script that ll be recalc these columns (see below)
when user open your rows list first time (or when you want to re-rand items) you save any random groupId to user's session; groupId can be from 0 to (select max(groupId) from lieky)
to show rows you use query like: (select * from lieky where groupId=%saved groupId% order by Seed limit x,100) — it should be very fast
About recalc script, it ll rather slow (so it's good idea to run it at night).
Seed you can update by using:
update lieky set Seed = rand()*1000000
Then set GroupId=0 for first N rows, GroupId=1 for following N rows, ...
N is max rows that you can show for user (max_page)*(per_page_count)
I have two tables: "servers" and "stats"
servers has a column called "id" that auto-increments.
stats has a column called "server" that corresponds to a row in the servers table, a column called "time" that represents the time it was added, and a column called "votes" that I would like to get the average of.
I would like to fetch all the servers (SELECT * FROM servers) along with the average votes of the 24 most recent rows that correspond to each server. I believe this is a "greatest-n-per-group" question.
This is what I tried to do, but it gave me 24 rows total, not 24 rows per group:
SELECT servers.*,
IFNULL(AVG(stats.votes), 0) AS avgvotes
FROM servers
LEFT OUTER JOIN
(SELECT server,
votes
FROM stats
GROUP BY server
ORDER BY time DESC LIMIT 24) AS stats ON servers.id = stats.server
GROUP BY servers.id
Like I said, I would like to get the 24 most recent rows for each server, not 24 most recent rows total.
Thanks for this great post.
alter table add index(server, time)
set #num:=0, #server:='';
select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes
from servers left outer join (
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from stats force index(server)
group by server, time
having row_number < 25) as stats
on servers.id = stats.server
group by servers.id
edit 1
I just noticed that above query gives the oldest 24 records for each groups.
set #num:=0, #server:='';
select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes
from servers left outer join (
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from (select * from stats order by server, time desc) as t
group by server, time
having row_number < 25) as stats
on servers.id = stats.server
group by servers.id
which will give the average of the 24 newest entity for each group
Edit2
#DrAgonmoray
you can try the inner query part first and see if it returns the the newest 24 records for each group. In my mysql 5.5, it works correctly.
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from (select * from stats order by server, time desc) as t
group by server, time
having row_number < 25
This is another approach.
This query is going to suffer the same performance problems as other queries here that return correct results, because the execution plan for this query is going to require a SORT operation on EVERY row in the stats table. Since there is no predicate (restriction) on the time column, EVERY row in the stats table will be considered. For a REALLY large stats table, this is going to blow out all available temporary space before it dies a horrible death. (More notes on performance below.)
SELECT r.*
, IFNULL(s.avg_votes,0)
FROM servers r
LEFT
JOIN ( SELECT t.server
, AVG(t.votes) AS avg_votes
FROM ( SELECT CASE WHEN u.server = #last_server
THEN #i := #i + 1
ELSE #i := 1
END AS i
, #last_server := u.server AS `server`
, u.votes AS votes
FROM (SELECT #i := 0, #last_server := NULL) i
JOIN ( SELECT v.server, v.votes
FROM stats v
ORDER BY v.server DESC, v.time DESC
) u
) t
WHERE t.i <= 24
GROUP BY t.server
) s
ON s.server = r.id
What this query is doing is sorting the stats table, by server and by descending order on the time column. (Inline view aliased as u.)
With the sorted result set, we assign a row numbers 1,2,3, etc. to each row for each server. (Inline view aliased as t.)
With that result set, we filter out any rows with a rownumber > 24, and we calculate an average of the votes column for the "latest" 24 rows for each server. (Inline view aliased as s.)
As a final step, we join that to the servers table, to return the requested resultset.
NOTE:
The execution plan for this query will be COSTLY for a large number of rows in the stats table.
To improve performance, there are several approaches we could take.
The simplest might to be include in the query a predicate the EXCLUDES a significant number of rows from the stats table (e.g. rows with time values over 2 days old, or over 2 weeks old). That would significantly reduce the number of rows that need to be sorted, to determine the "latest" 24 rows.
Also, with an index on stats(server,time), it's also possible that MySQL could do a relatively efficient "reverse scan" on the index, avoiding a sort operation.
We could also consider implementing an index on the stats table on (server,"reverse_time"). Since MySQL doesn't yet support descending indexes, the implementation would really be a regular (ascending) index on an a derived rtime value (a "reverse time" expression that is ascending for descending values of time (for example, -1*UNIX_TIMESTAMP(my_timestamp) or -1*TIMESTAMPDIFF('1970-01-01',my_datetime).
Another approach to improve performance would be to keep a shadow table containing the most recent 24 rows for each server. That would be simplest to implement if we can guarantee that "latest rows" won't be deleted from the stats table. We could maintain that table with a trigger. Basically, whenever a row is inserted into the stats table, we check if the time on the new rows is later than the earliest time stored for the server in the shadow table, if it is, we replace the earliest row in the shadow table with the new row, being sure to keep no more than 24 rows in the shadow table for each server.
And, yet another approach is to write a procedure or function that gets the result. The approach here would be to loop through each server, and run a separate query against the stats table to get the average votes for the latest 24 rows, and gather all of those results together. (That approach mighty really be more of a workaround to avoiding a sort on huge temporary set, just to enable the resultset to be returned, not necessarily making the return of the resultset blazingly fast.)
The bottom line for performance of this type of query on a LARGE table is restricting the number of rows considered by the query AND avoiding a sort operation on a large set. That's how we get a query like this to perform.
ADDENDUM
To get a "reverse index scan" operation (to get the rows from stats ordered using an index WITHOUT a filesort operation), I had to specify DESCENDING on both expressions in the ORDER BY clause. The query above previously had ORDER BY server ASC, time DESC, and MySQL always wanted to do a filesort, even specifying the FORCE INDEX FOR ORDER BY (stats_ix1) hint.
If the requirement is to return an 'average votes' for a server only if there are at least 24 associated rows in the stats table, then we can make a more efficient query, even if it is a bit more messy. (Most of the messiness in the nested IF() functions is to deal with NULL values, which do not get included in the average. It can be much less messy if we have a guarantee that votes is NOT NULL, or if we exclude any rows where votes is NULL.)
SELECT r.*
, IFNULL(s.avg_votes,0)
FROM servers r
LEFT
JOIN ( SELECT t.server
, t.tot/NULLIF(t.cnt,0) AS avg_votes
FROM ( SELECT IF(v.server = #last_server, #num := #num + 1, #num := 1) AS num
, #cnt := IF(v.server = #last_server,IF(#num <= 24, #cnt := #cnt + IF(v.votes IS NULL,0,1),#cnt := 0),#cnt := IF(v.votes IS NULL,0,1)) AS cnt
, #tot := IF(v.server = #last_server,IF(#num <= 24, #tot := #tot + IFNULL(v.votes,0) ,#tot := 0),#tot := IFNULL(v.votes,0) ) AS tot
, #last_server := v.server AS SERVER
-- , v.time
-- , v.votes
-- , #tot/NULLIF(#cnt,0) AS avg_sofar
FROM (SELECT #last_server := NULL, #num:= 0, #cnt := 0, #tot := 0) u
JOIN stats v FORCE INDEX FOR ORDER BY (stats_ix1)
ORDER BY v.server DESC, v.time DESC
) t
WHERE t.num = 24
) s
ON s.server = r.id
With a covering index on stats(server,time,votes), the EXPLAIN showed MySQL avoided a filesort operation, so it must have used a "reverse index scan" to return the rows in order. Absent the covering index, and index on '(server,time), MySQL used the index if I included an index hint, with theFORCE INDEX FOR ORDER BY (stats_ix1)` hint, MySQL avoided a filesort as well. (But since my table had less than 100 rows, I don't think MySQL put much emphasis on avoiding a filesort operation.)
The time, votes, and avg_sofar expressions are commented out (in the inline view aliased as t); they aren't needed, but they are for debugging.
The way that query stands, it needs at least 24 rows in stats for each server, in order to return an average. (That may be acceptable.) But I was thinking that in general, we could return a running total, total so far (tot) and a running count (cnt).
(If we replace the WHERE t.num = 24 with WHERE t.num <= 24, we can see the running average in action.)
To return the average where there aren't at least 24 rows in stats, that's really a matter of identifying the row (for each server) with the maximum value of num that is <= 24.
Try this solution, with the top-n-per-group technique in the INNER JOIN subselect credited to Bill Karwin and his post about it here.
SELECT
a.*,
AVG(b.votes) AS avgvotes
FROM
servers a
INNER JOIN
(
SELECT
aa.server,
aa.votes
FROM
stats aa
LEFT JOIN stats bb ON
aa.server = bb.server AND
aa.time < bb.time
GROUP BY
aa.time
HAVING
COUNT(*) < 24
) b ON a.id = b.server
GROUP BY
a.id
I have a website where people are saving highscores in games. I've been trying to assign a rank to a player's highscore by finding the number of other highscores that are higher. So if the query finds 5 people with higher highscores, the player's highscore rank will be 6.
The thing is that more than one score (for the same game) can be recorded in the database for every user. This is why I'm using GROUP BY user when I want to display only a player's best score in a game (not all of his scores in that game).
Here's the code I am using now. It doesn't work, since the query seems to always return 2 (like if it was always returning that there was only one score higher than the player's highscore). If I remove temp GROUP BY user, it returns an half-correct value, since counting all the scores (if a player as multiple scores in a game) from every player in a given game.
$count3 = mysql_result(mysql_query("SELECT COUNT(*) + 1 as Num FROM (SELECT * FROM ava_highscores WHERE game = $id AND leaderboard = $get_leaderboard[leaderboard_id] AND score > '$highscore2[score]') temp GROUP BY user");
When you use GROUP BY then COUNT returns a count of rows per group rather than a single count of all rows in the result set. Use COUNT(DISTINCT ...) instead. Also you don't actually need the inner select. You can write it all as a single query:
SELECT COUNT(DISTINCT `user`) + 1 AS Num
FROM ava_highscores
WHERE game = '3'
AND leaderboard = '5'
AND score > '1000'
Notes
Make sure that your score column is a numeric type (not a varchar type) so that the comparison works correctly.
Adding an index on (game, leaderboard, score) will make the query more efficient. The order of the columns in the index is also important.
I'm getting performance problems when LIMITing a mysql SELECT with a large offset:
SELECT * FROM table LIMIT m, n;
If the offset m is, say, larger than 1,000,000, the operation is very slow.
I do have to use limit m, n; I can't use something like id > 1,000,000 limit n.
How can I optimize this statement for better performance?
Perhaps you could create an indexing table which provides a sequential key relating to the key in your target table. Then you can join this indexing table to your target table and use a where clause to more efficiently get the rows you want.
#create table to store sequences
CREATE TABLE seq (
seq_no int not null auto_increment,
id int not null,
primary key(seq_no),
unique(id)
);
#create the sequence
TRUNCATE seq;
INSERT INTO seq (id) SELECT id FROM mytable ORDER BY id;
#now get 1000 rows from offset 1000000
SELECT mytable.*
FROM mytable
INNER JOIN seq USING(id)
WHERE seq.seq_no BETWEEN 1000000 AND 1000999;
If records are large, the slowness may be coming from loading the data. If the id column is indexed, then just selecting it will be much faster. You can then do a second query with an IN clause for the appropriate ids (or could formulate a WHERE clause using the min and max ids from the first query.)
slow:
SELECT * FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
fast:
SELECT id FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
SELECT * FROM table WHERE id IN (1,2,3...10)
There's a blog post somewhere on the internet on how you should best make the selection of the rows to show should be as compact as possible, thus: just the ids; and producing the complete results should in turn fetch all the data you want for only the rows you selected.
Thus, the SQL might be something like (untested, I'm not sure it actually will do any good):
select A.* from table A
inner join (select id from table order by whatever limit m, n) B
on A.id = B.id
order by A.whatever
If your SQL engine is too primitive to allow this kind of SQL statements, or it doesn't improve anything, against hope, it might be worthwhile to break this single statement into multiple statements and capture the ids into a data structure.
Update: I found the blog post I was talking about: it was Jeff Atwood's "All Abstractions Are Failed Abstractions" on Coding Horror.
I don't think there's any need to create a separate index if your table already has one. If so, then you can order by this primary key and then use values of the key to step through:
SELECT * FROM myBigTable WHERE id > :OFFSET ORDER BY id ASC;
Another optimisation would be not to use SELECT * but just the ID so that it can simply read the index and doesn't have to then locate all the data (reduce IO overhead). If you need some of the other columns then perhaps you could add these to the index so that they are read with the primary key (which will most likely be held in memory and therefore not require a disc lookup) - although this will not be appropriate for all cases so you will have to have a play.
Paul Dixon's answer is indeed a solution to the problem, but you'll have to maintain the sequence table and ensure that there is no row gaps.
If that's feasible, a better solution would be to simply ensure that the original table has no row gaps, and starts from id 1. Then grab the rows using the id for pagination.
SELECT * FROM table A WHERE id >= 1 AND id <= 1000;
SELECT * FROM table A WHERE id >= 1001 AND id <= 2000;
and so on...
I have run into this problem recently. The problem was two parts to fix. First I had to use an inner select in my FROM clause that did my limiting and offsetting for me on the primary key only:
$subQuery = DB::raw("( SELECT id FROM titles WHERE id BETWEEN {$startId} AND {$endId} ORDER BY title ) as t");
Then I could use that as the from part of my query:
'titles.id',
'title_eisbns_concat.eisbns_concat',
'titles.pub_symbol',
'titles.title',
'titles.subtitle',
'titles.contributor1',
'titles.publisher',
'titles.epub_date',
'titles.ebook_price',
'publisher_licenses.id as pub_license_id',
'license_types.shortname',
$coversQuery
)
->from($subQuery)
->leftJoin('titles', 't.id', '=', 'titles.id')
->leftJoin('organizations', 'organizations.symbol', '=', 'titles.pub_symbol')
->leftJoin('title_eisbns_concat', 'titles.id', '=', 'title_eisbns_concat.title_id')
->leftJoin('publisher_licenses', 'publisher_licenses.org_id', '=', 'organizations.id')
->leftJoin('license_types', 'license_types.id', '=', 'publisher_licenses.license_type_id')
The first time I created this query I had used the OFFSET and LIMIT in MySql. This worked fine until I got past page 100 then the offset started getting unbearably slow. Changing that to BETWEEN in my inner query sped it up for any page. I'm not sure why MySql hasn't sped up OFFSET but between seems to reel it back in.