MySQL - Average most recent columns in other table - mysql

I have two tables: "servers" and "stats"
servers has a column called "id" that auto-increments.
stats has a column called "server" that corresponds to a row in the servers table, a column called "time" that represents the time it was added, and a column called "votes" that I would like to get the average of.
I would like to fetch all the servers (SELECT * FROM servers) along with the average votes of the 24 most recent rows that correspond to each server. I believe this is a "greatest-n-per-group" question.
This is what I tried to do, but it gave me 24 rows total, not 24 rows per group:
SELECT servers.*,
IFNULL(AVG(stats.votes), 0) AS avgvotes
FROM servers
LEFT OUTER JOIN
(SELECT server,
votes
FROM stats
GROUP BY server
ORDER BY time DESC LIMIT 24) AS stats ON servers.id = stats.server
GROUP BY servers.id
Like I said, I would like to get the 24 most recent rows for each server, not 24 most recent rows total.

Thanks for this great post.
alter table add index(server, time)
set #num:=0, #server:='';
select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes
from servers left outer join (
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from stats force index(server)
group by server, time
having row_number < 25) as stats
on servers.id = stats.server
group by servers.id
edit 1
I just noticed that above query gives the oldest 24 records for each groups.
set #num:=0, #server:='';
select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes
from servers left outer join (
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from (select * from stats order by server, time desc) as t
group by server, time
having row_number < 25) as stats
on servers.id = stats.server
group by servers.id
which will give the average of the 24 newest entity for each group
Edit2
#DrAgonmoray
you can try the inner query part first and see if it returns the the newest 24 records for each group. In my mysql 5.5, it works correctly.
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from (select * from stats order by server, time desc) as t
group by server, time
having row_number < 25

This is another approach.
This query is going to suffer the same performance problems as other queries here that return correct results, because the execution plan for this query is going to require a SORT operation on EVERY row in the stats table. Since there is no predicate (restriction) on the time column, EVERY row in the stats table will be considered. For a REALLY large stats table, this is going to blow out all available temporary space before it dies a horrible death. (More notes on performance below.)
SELECT r.*
, IFNULL(s.avg_votes,0)
FROM servers r
LEFT
JOIN ( SELECT t.server
, AVG(t.votes) AS avg_votes
FROM ( SELECT CASE WHEN u.server = #last_server
THEN #i := #i + 1
ELSE #i := 1
END AS i
, #last_server := u.server AS `server`
, u.votes AS votes
FROM (SELECT #i := 0, #last_server := NULL) i
JOIN ( SELECT v.server, v.votes
FROM stats v
ORDER BY v.server DESC, v.time DESC
) u
) t
WHERE t.i <= 24
GROUP BY t.server
) s
ON s.server = r.id
What this query is doing is sorting the stats table, by server and by descending order on the time column. (Inline view aliased as u.)
With the sorted result set, we assign a row numbers 1,2,3, etc. to each row for each server. (Inline view aliased as t.)
With that result set, we filter out any rows with a rownumber > 24, and we calculate an average of the votes column for the "latest" 24 rows for each server. (Inline view aliased as s.)
As a final step, we join that to the servers table, to return the requested resultset.
NOTE:
The execution plan for this query will be COSTLY for a large number of rows in the stats table.
To improve performance, there are several approaches we could take.
The simplest might to be include in the query a predicate the EXCLUDES a significant number of rows from the stats table (e.g. rows with time values over 2 days old, or over 2 weeks old). That would significantly reduce the number of rows that need to be sorted, to determine the "latest" 24 rows.
Also, with an index on stats(server,time), it's also possible that MySQL could do a relatively efficient "reverse scan" on the index, avoiding a sort operation.
We could also consider implementing an index on the stats table on (server,"reverse_time"). Since MySQL doesn't yet support descending indexes, the implementation would really be a regular (ascending) index on an a derived rtime value (a "reverse time" expression that is ascending for descending values of time (for example, -1*UNIX_TIMESTAMP(my_timestamp) or -1*TIMESTAMPDIFF('1970-01-01',my_datetime).
Another approach to improve performance would be to keep a shadow table containing the most recent 24 rows for each server. That would be simplest to implement if we can guarantee that "latest rows" won't be deleted from the stats table. We could maintain that table with a trigger. Basically, whenever a row is inserted into the stats table, we check if the time on the new rows is later than the earliest time stored for the server in the shadow table, if it is, we replace the earliest row in the shadow table with the new row, being sure to keep no more than 24 rows in the shadow table for each server.
And, yet another approach is to write a procedure or function that gets the result. The approach here would be to loop through each server, and run a separate query against the stats table to get the average votes for the latest 24 rows, and gather all of those results together. (That approach mighty really be more of a workaround to avoiding a sort on huge temporary set, just to enable the resultset to be returned, not necessarily making the return of the resultset blazingly fast.)
The bottom line for performance of this type of query on a LARGE table is restricting the number of rows considered by the query AND avoiding a sort operation on a large set. That's how we get a query like this to perform.
ADDENDUM
To get a "reverse index scan" operation (to get the rows from stats ordered using an index WITHOUT a filesort operation), I had to specify DESCENDING on both expressions in the ORDER BY clause. The query above previously had ORDER BY server ASC, time DESC, and MySQL always wanted to do a filesort, even specifying the FORCE INDEX FOR ORDER BY (stats_ix1) hint.
If the requirement is to return an 'average votes' for a server only if there are at least 24 associated rows in the stats table, then we can make a more efficient query, even if it is a bit more messy. (Most of the messiness in the nested IF() functions is to deal with NULL values, which do not get included in the average. It can be much less messy if we have a guarantee that votes is NOT NULL, or if we exclude any rows where votes is NULL.)
SELECT r.*
, IFNULL(s.avg_votes,0)
FROM servers r
LEFT
JOIN ( SELECT t.server
, t.tot/NULLIF(t.cnt,0) AS avg_votes
FROM ( SELECT IF(v.server = #last_server, #num := #num + 1, #num := 1) AS num
, #cnt := IF(v.server = #last_server,IF(#num <= 24, #cnt := #cnt + IF(v.votes IS NULL,0,1),#cnt := 0),#cnt := IF(v.votes IS NULL,0,1)) AS cnt
, #tot := IF(v.server = #last_server,IF(#num <= 24, #tot := #tot + IFNULL(v.votes,0) ,#tot := 0),#tot := IFNULL(v.votes,0) ) AS tot
, #last_server := v.server AS SERVER
-- , v.time
-- , v.votes
-- , #tot/NULLIF(#cnt,0) AS avg_sofar
FROM (SELECT #last_server := NULL, #num:= 0, #cnt := 0, #tot := 0) u
JOIN stats v FORCE INDEX FOR ORDER BY (stats_ix1)
ORDER BY v.server DESC, v.time DESC
) t
WHERE t.num = 24
) s
ON s.server = r.id
With a covering index on stats(server,time,votes), the EXPLAIN showed MySQL avoided a filesort operation, so it must have used a "reverse index scan" to return the rows in order. Absent the covering index, and index on '(server,time), MySQL used the index if I included an index hint, with theFORCE INDEX FOR ORDER BY (stats_ix1)` hint, MySQL avoided a filesort as well. (But since my table had less than 100 rows, I don't think MySQL put much emphasis on avoiding a filesort operation.)
The time, votes, and avg_sofar expressions are commented out (in the inline view aliased as t); they aren't needed, but they are for debugging.
The way that query stands, it needs at least 24 rows in stats for each server, in order to return an average. (That may be acceptable.) But I was thinking that in general, we could return a running total, total so far (tot) and a running count (cnt).
(If we replace the WHERE t.num = 24 with WHERE t.num <= 24, we can see the running average in action.)
To return the average where there aren't at least 24 rows in stats, that's really a matter of identifying the row (for each server) with the maximum value of num that is <= 24.

Try this solution, with the top-n-per-group technique in the INNER JOIN subselect credited to Bill Karwin and his post about it here.
SELECT
a.*,
AVG(b.votes) AS avgvotes
FROM
servers a
INNER JOIN
(
SELECT
aa.server,
aa.votes
FROM
stats aa
LEFT JOIN stats bb ON
aa.server = bb.server AND
aa.time < bb.time
GROUP BY
aa.time
HAVING
COUNT(*) < 24
) b ON a.id = b.server
GROUP BY
a.id

Related

MySQL select AVG, ORDER BY, GROUP BY & LIMIT

The bellow statement does not work but i cant seem to figure out why
select AVG(delay_in_seconds) from A_TABLE ORDER by created_at DESC GROUP BY row_type limit 1000;
I want to get the avg's of the most recent 1000 rows for each row_type. created_at is of type DATETIME and row_type is of type VARCHAR
If you only want the 1000 most recent rows, regardless of row_type, and then get the average of delay_in_seconds for each row_type, that's a fairly straightforward query. For example:
SELECT t.row_type
, AVG(t.delay_in_seconds)
FROM (
SELECT r.row_type
, r.delay_in_seconds
FROM A_table r
ORDER BY r.created_at DESC
LIMIT 1000
) t
GROUP BY t.row_type
I suspect, however, that this query does not satisfy the requirements that were specified. (I know it doesn't satisfy what I understood as the specification.)
If what we want is the average of the most recent 1000 rows for each row_type, that would also be fairly straightforward... if we were using a database that supported analytic functions.
Unfortunately, MySQL doesn't provide support for analytic functions. But it is possible to emulate one in MySQL, but the syntax is a bit involved, and it is dependent on behavior that is not guaranteed.
As an example:
SELECT s.row_type
, AVG(s.delay_in_seconds)
FROM (
SELECT #row_ := IF(#prev_row_type = t.row_type, #row_ + 1, 1) AS row_
, #prev_row_type := t.row_type AS row_type
, t.delay_in_seconds
FROM A_table t
CROSS
JOIN (SELECT #prev_row_type := NULL, #row_ := NULL) i
ORDER BY t.row_type DESC, t.created_at DESC
) s
WHERE s.row_ <= 1000
GROUP
BY s.row_type
NOTES:
The inline view query is going to be expensive for large sets. What that's effectively doing is assigning a row number to each row. The "order by" is sorting the rows in descending sequence by created_at, what we want is for the most recent row to be assigned a value of 1, the next most recent 2, etc. This numbering of rows will be repeated for each distinct value of row_type.
For performance, we'd want a suitable index with leading columns (row_type,created_at,delay_seconds) to avoid an expensive "Using filesort" operation. We need at least those first two columns for that, including the delay_seconds makes it a covering index (the query can be satisfied entirely from the index.)
The outer query then runs against the resultset returned from the view query (a "derived table"). The predicate in the WHERE filters out all rows that were assigned a row number greater than 1000, the rest is a straighforward GROUP BY and and AVG aggregate.
A LIMIT clause is entirely unnecessary. It may be possible to incorporate some additional predicates for some additional performance enhancement... like, what if we specified the most recent 1000 rows, but only that were create_at within the past 30 or 90 days?
(I'm not entirely sure this answers the question that OP was asking. What this answers is: Is there a query that can return the specified resultset, making use of AVG aggregate and GROUP BY, ORDER BY and LIMIT clauses.)
N.B. This query is dependent on a behavior of MySQL user-defined variables which is not guaranteed.
The query above shows one approach, but there is also another approach. It's possible to use a "join" operation (of A_table with A_table) to get a row number assigned (getting a COUNT of the number of rows that are "more recent" than each row. With large sets, however, that can produce a humongous intermediate result, if we aren't careful to limit it.
Write the ORDER BY at the last of the statement.
SELECT AVG(delay_in_seconds) from A_TABLE GROUP BY row_type ORDER by created_at DESC limit 1000;
read mysql dev site for details.

MySQL index being ignored for subquery range

Alright, here's a simple enough question about indices and subqueries. I'm using MariaDB 5.5.36 + MyISAM, here's my table structure for leaderboard. It contains about 50 million rows across ~2000 levels.
int userid,
int levelid,
int score,
index (userid, levelid),
index (levelid, score)
This query, meant to return the rank of each score in a level for a given user, runs very slow...
SELECT levelid, (
SELECT COUNT(*) + 1
FROM leaderboard
WHERE score > l.score AND levelid = l.levelid
) AS rank
FROM leaderboard AS l
WHERE userid = 12345;
I've tried using a self-join group approach as well, which runs in half the time as above but still unacceptably slow:
SELECT x.levelid, COUNT(y.score) AS rank
FROM leaderboard AS x
LEFT JOIN leaderboard AS y ON x.levelid = y.levelid AND y.score > x.score
WHERE x.userid = {0}
GROUP BY x.levelid;
... while this alternative runs about 100x faster (pseudocode, looping over the results in an application outside the DB or in a stored procedure or something and running the subquery separately 2000 times with a constant):
results = execute(""SELECT levelid, score
FROM leaderboard
WHERE userid = 12345"");
for each row in results:
execute(""SELECT COUNT(*) + 1
FROM leaderboard
WHERE score > %d AND levelid = %d
"".printf(row.score, row.levelid));
EXPLAIN tells me that the subquery in the slow example has a key_len of 4 bytes (just levelid) while the fast version uses 8 (levelid, score). Interesting side note, if "score > l.score" is replaced with "score = l.score" it switches to using all 8, but obviously that doesn't give me the answer I'm looking for.
Is there something I'm not understanding about how the index fundamentally works? Is there a better way to write this ranking query? Would it be more efficient to add a rank column to my table and update it every time a highscore is achieved (that could mean updating up to 400k rows for one single score achievement)?

How to query a table with over 200 million rows?

I have a table USERS with only one column USER_ID. These IDs are more than 200M, they are not consecutive and are not ordered. It has an index USER_ID_INDEX on that column. I have the DB in MySQL and also in Google Big Query, but I haven't been able to get what I need in any of them.
I need to know how to query these 2 things:
1) Which is the row number for a particular USER_ID (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SET #row := 0;
SELECT #row := #row + 1 AS row FROM USERS WHERE USER_ID = 100001366260516;
It goes fast but it returns row=1 because the row counting is from the data-set.
SELECT USER_ID, #row:=#row+1 as row FROM (SELECT USER_ID FROM USERS ORDER BY USER_ID ASC) WHERE USER_ID = 100002034141760
It takes forever (I didn't wait to see the result).
In Big Query:
SELECT ROW_NUMBER() OVER() row, USER_ID
FROM (SELECT USER_ID from USERS.USER_ID ORDER BY USER_ID ASC)
WHERE USER_ID = 1063650153
It takes forever (I didn't wait to see the result).
2) Which USER_ID is in a particular row (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SELECT USER_ID FROM USERS ORDER BY USER_ID ASC LIMIT 150000000000, 1
It takes 5 minutes in giving a result. Why? Isn't it supposed to be fast if it has an index?
In Big Query, I didn't find the way because LIMIT init, num_rows, doesn't even exist.
I could order the table in a new one, and add a column called RANK that orders the USER_ID, with an INDEX on it. But it will be a mess if I want to add or remove a row.
Any ideas on how to solve these two queries?
Thanks,
Natalia
For (1), try this:
SELECT count(user_id)
FROM USERS
WHERE USER_ID <= 100001366260516;
You can check the explain, but it should just be doing a scan of the index.
For (2). Your question: "Why? Isn't it supposed to be fast if it has an index?". Yes, it will use the index. Then it has to count up to row 150,000,000,000 using an index scan. Hmmm, that is being the end of the table (if it is not a typo). In any case, an index scan is quite different from doing an index lookup, which is fast. And, it will take time. And more time if the index does not fit into memory.
The proper syntax for row_number(), by the way, would be:
SELECT row, USER_ID
FROM (SELECT USER_ID, row_number() over (order by user_id) as row
from USERS.USER_ID )
WHERE USER_ID = 1063650153;
I don't know if it will be that much faster, but at least you are not explicitly ordering the rows first.
If these are the types of queries you need to do, then think about a way to include the ordering information as a column in the table.

mysql randomizing result and optimization

I want to have randomized rows after a query, but using order by rand() is just exhausting on a table that has 120k+ rows. I have found a small solution that just outputs number of rows but it runs like it starts from a random index and then returns #number of rows after that. It is pretty fast but this just returns some rows after a random index. The code goes like:
SELECT *
FROM lieky AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(col_0)
FROM lieky)) AS id)
AS r2
WHERE r1.col_0 >= r2.id
ORDER BY r1.col_0 ASC
LIMIT 100
and i found it in here: http://jan.kneschke.de/projects/mysql/order-by-rand/
Is there something that would help me ?
I am trying to get randomized data into pagination, so when the user queries the database, he will always get the rows in a random order.
Thanks for help.
It should be noted that
(SELECT (RAND() * (SELECT MAX(col_0) FROM lieky)) AS id)
can return MAX(col_0), so you ll get only 1 row (because of WHERE r1.col_0 >= r2.id)
I think good solution should be somethink like:
add two columns groupId int, seed int; add index indexName (groupId , seed)
every x seconds (maybe every hour, day, ..) run script that ll be recalc these columns (see below)
when user open your rows list first time (or when you want to re-rand items) you save any random groupId to user's session; groupId can be from 0 to (select max(groupId) from lieky)
to show rows you use query like: (select * from lieky where groupId=%saved groupId% order by Seed limit x,100) — it should be very fast
About recalc script, it ll rather slow (so it's good idea to run it at night).
Seed you can update by using:
update lieky set Seed = rand()*1000000
Then set GroupId=0 for first N rows, GroupId=1 for following N rows, ...
N is max rows that you can show for user (max_page)*(per_page_count)

Referencing a MySQL sub-select in another query

I have a basic MySQL table, terms, comprised of an id and term field.
I want to create an alphabetically sorted dictionary index (in the literal sense), that would list ten 10 terms above the selected term, and 20 below it. An example of this could be found here http://www.urbandictionary.com/define.php?term=GD2&defid=3561357 where on the left column you see the current term highlighted, and a number of terms above it, and some below, all sorted alphabetically.
As we all know, MySQL doesn't support a ROW_NUMBER() or a similar function so we end up resorting to user variables and sub-selects. I also cannot create a View with user defined variables because MySQL doesn't allow that. Here's what I managed to come up with (and it works):
SET #row_num := 0;
SELECT
#term_index := ordered.row_number
FROM
(
SELECT
#row_num := #row_num + 1 AS row_number, terms.*
FROM
terms
ORDER BY
term ASC
) AS ordered
WHERE
ordered.term = 'example term';
SET #row_num := 0;
SELECT *
FROM
(
SELECT
#row_num := #row_num + 1 AS row_number, terms.*
FROM
terms
ORDER BY
term ASC
) AS ordered
WHERE
row_number BETWEEN #term_index - 10 AND #term_index + 20
The first SELECT simply finds out the row number of our target term across the entire alphabetically sorted terms table. The second SELECT uses that information to get 10 terms above it and 20 terms below it.
I wonder if there's a way to avoid running the sub-select in the second SELECT query and instead just reference the first one aliased ordered. Is there a more efficient way of accomplishing this without having to resort to manually create a temporary table? What am I doing wrong here?
Update:
See this article in my blog for performance details:
MySQL: selecting rows before and after filtered one
If your term is indexed, you can just run:
SELECT *
FROM (
SELECT *
FROM terms
WHERE term <= #myterm
ORDER BY
term DESC
LIMIT 10
) q
UNION ALL
SELECT *
FROM (
SELECT *
FROM terms
WHERE term > #myterm
ORDER BY
term
LIMIT 20
) q
ORDER BY
term
, which will be more efficient.