Referencing a MySQL sub-select in another query - mysql

I have a basic MySQL table, terms, comprised of an id and term field.
I want to create an alphabetically sorted dictionary index (in the literal sense), that would list ten 10 terms above the selected term, and 20 below it. An example of this could be found here http://www.urbandictionary.com/define.php?term=GD2&defid=3561357 where on the left column you see the current term highlighted, and a number of terms above it, and some below, all sorted alphabetically.
As we all know, MySQL doesn't support a ROW_NUMBER() or a similar function so we end up resorting to user variables and sub-selects. I also cannot create a View with user defined variables because MySQL doesn't allow that. Here's what I managed to come up with (and it works):
SET #row_num := 0;
SELECT
#term_index := ordered.row_number
FROM
(
SELECT
#row_num := #row_num + 1 AS row_number, terms.*
FROM
terms
ORDER BY
term ASC
) AS ordered
WHERE
ordered.term = 'example term';
SET #row_num := 0;
SELECT *
FROM
(
SELECT
#row_num := #row_num + 1 AS row_number, terms.*
FROM
terms
ORDER BY
term ASC
) AS ordered
WHERE
row_number BETWEEN #term_index - 10 AND #term_index + 20
The first SELECT simply finds out the row number of our target term across the entire alphabetically sorted terms table. The second SELECT uses that information to get 10 terms above it and 20 terms below it.
I wonder if there's a way to avoid running the sub-select in the second SELECT query and instead just reference the first one aliased ordered. Is there a more efficient way of accomplishing this without having to resort to manually create a temporary table? What am I doing wrong here?

Update:
See this article in my blog for performance details:
MySQL: selecting rows before and after filtered one
If your term is indexed, you can just run:
SELECT *
FROM (
SELECT *
FROM terms
WHERE term <= #myterm
ORDER BY
term DESC
LIMIT 10
) q
UNION ALL
SELECT *
FROM (
SELECT *
FROM terms
WHERE term > #myterm
ORDER BY
term
LIMIT 20
) q
ORDER BY
term
, which will be more efficient.

Related

Optimization of my Custom RAND() query

I use the following query to get a random row in MySql. And, I think it to be pretty faster than the ORDER BY RAND() as it just returns a row after a random count of rows, and doesn't require any ordering of rows.
SELECT COUNT(ID) FROM TABLE_NAME
!-- GENERATE A RANDOM NUMBER BETWEEN 0 and COUNT(ID)-1 --!
SELECT x FROM TABLE_NAME LIMIT RANDOM_NUMBER,1
But, I need to know if in any way I could optimize it more and is there a faster method.
I would also be grateful to know if I can combine the 2 queries as LIMIT doesn't support such sub-queries (As I know).
EDIT- The way my query works is not by randomly generating any ID. But instead it generates a random no. between 0 and total no. of rows. And, then I use that no. as offset to get a row next to that random count.
EDIT : My answer assumes MySql<5.5.6 where you cannot pass a variable to LIMIT and OFFSET. Otherwise, OP's method is the best.
The most reliable solution, imo, would be to rank your results to eliminate the gaps. My solution might not be optimal since I'm not used to MySQL, but the logic works (or worked in my SQLFiddle).
SET #total = 0;
SELECT #total := COUNT(1) FROM test;
SET #random=FLOOR(RAND()*#total)+1;
SET #rank=0;
SELECT * from
(SELECT #rank:=#rank+1 as rank, id, name
FROM test
order by id) derived_table
where rank = #random;
I'm not sure how this structure will old if you use it on a massive query, but as long as you're within a few hundreds of rows it should be instant.
Basically, you generate a random row number with (this is one of the place where there's most probably optimization to be made) :
SET #total = 0;
SELECT #total := COUNT(1) FROM test;
SET #random=FLOOR(RAND()*#total)+1;
Then, you rank all of your rows to eliminate gaps :
SELECT #rank:=#rank+1 as rank, id, name
FROM test
order by id
And, you select the randomly selected row :
SELECT * from
(ranked derived table) derived_table
where rank = #random;
I think the query you want is:
select x.*
from tablename x
where x.id >= random_number
order by x.id
limit 1;
This should use an index on x.id and should be quite fast. You can combine them as:
select x.*
from tablename x cross join
(select cast(max(id) * rand() as int) as random_number from tablename
) c
where x.id >= random_number
order by x.id
limit 1;
Note that you should use max(id) rather than count(), because there can be gaps in the ids. The subquery should also make use of an index on id.
EDIT:
I won't be defensive about the above solution. It returns a random id, but the id is not uniformly distributed.
My preferred method, in any case, is:
select x.*
from tablename x cross join
(select count(*) as cnt from x) cnt
where rand() < 100 / cnt
order by rand()
limit 1;
It is highly, highly unlikely that you will get no rows with the where condition (it is possible, but highly unlikely). The final order by rand() is only processing 100 rows, so it should go pretty fast.
There are 5 techniques in http://mysql.rjweb.org/doc.php/random . None of them have to look at the entire table.
Do you have an AUTO_INCREMENT? With or without gaps? And other questions need answering to know which technique in that link is even applicable.
Try caching the result of the first query and the use in the second query. Using both in the same query will be very heavy on the system.
As for the second query, try the following:
SELECT x FROM TABLE_NAME WHERE ID = RANDOM_NUMBER
The above query is much faster than yours (assuming ID is indexed)
Of course, the above query assumes that you are using sequential IDs (no gaps). If there are gaps, then you will need to create another sequential field (maybe call it ID2) and then execute the above query on that field.

Order a database by a field value, and then update each row with its order number

cant think of the best way to do this.
So (example):
I have a table with 10 rows. In this table there is a column called 'Points'. Each row has a value for the points table. This so far works fine.
I now want to have a column called 'Ranking'. The aim is to somehow order all of the rows in that table by the points field, and then update each rows 'Ranking' field with its order number / ranking created by ordering the rows by the points value.
So rows get ordered by the points Ascending, then i update the rows with 1 - 10 depending on their rank.
How do i go about doing this?
I already use a Cron job to update the points field so was going to include it in with this.
Thanks, Craig.
Example of how i would be ordering the rows:
SELECT * FROM blogs ORDER BY points ASC
Foreach row:
UDPATE blogs SET ranking = ranking WHERE blogid = blogID
Thanks, P.S. Thats not the actual queries, just plain English explanation of how imagine this working.
Perhaps this does what you want:
update blogs cross join
(select #rn := 0) vars
set ranking = (#rn := #rn + 1)
order by points;
It uses variables and order by to do the ordering inside the update.
EDIT:
You can set the variable before the update as well:
set #rn := 0;
update blogs
set ranking = (#rn := #rn + 1)
order by points;
Have you considered rank in sql?
http://msdn.microsoft.com/en-us/library/ms176102.aspx
I would imagine something similar
SELECT name, points
,RANK() OVER
(PARTITION BY point ORDER BY points) AS Rank
FROM table
ORDER BY points
you can perhaps store this in a temp table, and update the values based on the rank numbers.
However, you might have to add logic if you don't want ties to show up as the same number.

MySQL ranking items: is it better to count rows in a subquery or COUNT(*) WHERE foo > bar?

The problem
I'm looking at the ranking use case in MySQL but I still haven't settled on a definite "best solution" for it. I have a table like this:
CREATE TABLE mytable (
item_id int unsigned NOT NULL,
# some other data fields,
item_score int unsigned NOT NULL,
PRIMARY KEY (item_id),
KEY item_score (item_score)
) ENGINE=MyISAM;
with some millions records in it, and the most common write operation is to update item_score with a new value. Given an item_id and/or its score, I need to get its ranking, and I currently know two ways to accomplish that.
COUNT() items with higher scores
SELECT COUNT(*) FROM mytable WHERE item_score > $foo;
assign row numbers
SET #rownum := 0;
SELECT rank FROM (
SELECT #rownum := #rownum + 1 AS rank, item_id
FROM mytable ORDER BY item_score DESC ) AS result
WHERE item_id = $foo;
which one?
Do they perform the same or behave differently? If so, why are they different and which one should I choose?
any better idea?
Is there any better / faster approach? The only thing I can come up with is a separate table/memcache/NoSQL/whatever to store pre-calculated rankings, but I still have to sort & read out mytable every time I update it. That makes me think it would be a good approach only if the number of "read rank" queries is (much?) greather than the number of updates, on the other hand it should be less useful with the "read rank" queries approaching the number of update queries.
Since you have indexes on your table the only queries to use that makes sense is
-- findByScore
SELECT COUNT(*) FROM mytable WHERE item_score > :item_score;
-- findById
SELECT COUNT(*) FROM mytable WHERE item_score > (select item_score from mytable where item_id = :item_id);
on findById since you only need rank of 1 item id, it is not much different from join counterpart on performance wise.
If you need the rank of many items then using join is better.
Usign "assign row numbers" can not compete here because it wont make use of indexes (in your query not at all and if we would even improve that it is still not as good)
Also there may be some hidden traps using the assign indexes: if there are multiple items with same score then it will give you rank of last one.
Unrelated: And please use PDO if possible to be safe from sql injections.

MySQL - Average most recent columns in other table

I have two tables: "servers" and "stats"
servers has a column called "id" that auto-increments.
stats has a column called "server" that corresponds to a row in the servers table, a column called "time" that represents the time it was added, and a column called "votes" that I would like to get the average of.
I would like to fetch all the servers (SELECT * FROM servers) along with the average votes of the 24 most recent rows that correspond to each server. I believe this is a "greatest-n-per-group" question.
This is what I tried to do, but it gave me 24 rows total, not 24 rows per group:
SELECT servers.*,
IFNULL(AVG(stats.votes), 0) AS avgvotes
FROM servers
LEFT OUTER JOIN
(SELECT server,
votes
FROM stats
GROUP BY server
ORDER BY time DESC LIMIT 24) AS stats ON servers.id = stats.server
GROUP BY servers.id
Like I said, I would like to get the 24 most recent rows for each server, not 24 most recent rows total.
Thanks for this great post.
alter table add index(server, time)
set #num:=0, #server:='';
select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes
from servers left outer join (
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from stats force index(server)
group by server, time
having row_number < 25) as stats
on servers.id = stats.server
group by servers.id
edit 1
I just noticed that above query gives the oldest 24 records for each groups.
set #num:=0, #server:='';
select servers.*, IFNULL(AVG(stats.votes), 0) AS avgvotes
from servers left outer join (
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from (select * from stats order by server, time desc) as t
group by server, time
having row_number < 25) as stats
on servers.id = stats.server
group by servers.id
which will give the average of the 24 newest entity for each group
Edit2
#DrAgonmoray
you can try the inner query part first and see if it returns the the newest 24 records for each group. In my mysql 5.5, it works correctly.
select server,
time,votes,
#num := if(#server = server, #num + 1, 1) as row_number,
#server:= server as dummy
from (select * from stats order by server, time desc) as t
group by server, time
having row_number < 25
This is another approach.
This query is going to suffer the same performance problems as other queries here that return correct results, because the execution plan for this query is going to require a SORT operation on EVERY row in the stats table. Since there is no predicate (restriction) on the time column, EVERY row in the stats table will be considered. For a REALLY large stats table, this is going to blow out all available temporary space before it dies a horrible death. (More notes on performance below.)
SELECT r.*
, IFNULL(s.avg_votes,0)
FROM servers r
LEFT
JOIN ( SELECT t.server
, AVG(t.votes) AS avg_votes
FROM ( SELECT CASE WHEN u.server = #last_server
THEN #i := #i + 1
ELSE #i := 1
END AS i
, #last_server := u.server AS `server`
, u.votes AS votes
FROM (SELECT #i := 0, #last_server := NULL) i
JOIN ( SELECT v.server, v.votes
FROM stats v
ORDER BY v.server DESC, v.time DESC
) u
) t
WHERE t.i <= 24
GROUP BY t.server
) s
ON s.server = r.id
What this query is doing is sorting the stats table, by server and by descending order on the time column. (Inline view aliased as u.)
With the sorted result set, we assign a row numbers 1,2,3, etc. to each row for each server. (Inline view aliased as t.)
With that result set, we filter out any rows with a rownumber > 24, and we calculate an average of the votes column for the "latest" 24 rows for each server. (Inline view aliased as s.)
As a final step, we join that to the servers table, to return the requested resultset.
NOTE:
The execution plan for this query will be COSTLY for a large number of rows in the stats table.
To improve performance, there are several approaches we could take.
The simplest might to be include in the query a predicate the EXCLUDES a significant number of rows from the stats table (e.g. rows with time values over 2 days old, or over 2 weeks old). That would significantly reduce the number of rows that need to be sorted, to determine the "latest" 24 rows.
Also, with an index on stats(server,time), it's also possible that MySQL could do a relatively efficient "reverse scan" on the index, avoiding a sort operation.
We could also consider implementing an index on the stats table on (server,"reverse_time"). Since MySQL doesn't yet support descending indexes, the implementation would really be a regular (ascending) index on an a derived rtime value (a "reverse time" expression that is ascending for descending values of time (for example, -1*UNIX_TIMESTAMP(my_timestamp) or -1*TIMESTAMPDIFF('1970-01-01',my_datetime).
Another approach to improve performance would be to keep a shadow table containing the most recent 24 rows for each server. That would be simplest to implement if we can guarantee that "latest rows" won't be deleted from the stats table. We could maintain that table with a trigger. Basically, whenever a row is inserted into the stats table, we check if the time on the new rows is later than the earliest time stored for the server in the shadow table, if it is, we replace the earliest row in the shadow table with the new row, being sure to keep no more than 24 rows in the shadow table for each server.
And, yet another approach is to write a procedure or function that gets the result. The approach here would be to loop through each server, and run a separate query against the stats table to get the average votes for the latest 24 rows, and gather all of those results together. (That approach mighty really be more of a workaround to avoiding a sort on huge temporary set, just to enable the resultset to be returned, not necessarily making the return of the resultset blazingly fast.)
The bottom line for performance of this type of query on a LARGE table is restricting the number of rows considered by the query AND avoiding a sort operation on a large set. That's how we get a query like this to perform.
ADDENDUM
To get a "reverse index scan" operation (to get the rows from stats ordered using an index WITHOUT a filesort operation), I had to specify DESCENDING on both expressions in the ORDER BY clause. The query above previously had ORDER BY server ASC, time DESC, and MySQL always wanted to do a filesort, even specifying the FORCE INDEX FOR ORDER BY (stats_ix1) hint.
If the requirement is to return an 'average votes' for a server only if there are at least 24 associated rows in the stats table, then we can make a more efficient query, even if it is a bit more messy. (Most of the messiness in the nested IF() functions is to deal with NULL values, which do not get included in the average. It can be much less messy if we have a guarantee that votes is NOT NULL, or if we exclude any rows where votes is NULL.)
SELECT r.*
, IFNULL(s.avg_votes,0)
FROM servers r
LEFT
JOIN ( SELECT t.server
, t.tot/NULLIF(t.cnt,0) AS avg_votes
FROM ( SELECT IF(v.server = #last_server, #num := #num + 1, #num := 1) AS num
, #cnt := IF(v.server = #last_server,IF(#num <= 24, #cnt := #cnt + IF(v.votes IS NULL,0,1),#cnt := 0),#cnt := IF(v.votes IS NULL,0,1)) AS cnt
, #tot := IF(v.server = #last_server,IF(#num <= 24, #tot := #tot + IFNULL(v.votes,0) ,#tot := 0),#tot := IFNULL(v.votes,0) ) AS tot
, #last_server := v.server AS SERVER
-- , v.time
-- , v.votes
-- , #tot/NULLIF(#cnt,0) AS avg_sofar
FROM (SELECT #last_server := NULL, #num:= 0, #cnt := 0, #tot := 0) u
JOIN stats v FORCE INDEX FOR ORDER BY (stats_ix1)
ORDER BY v.server DESC, v.time DESC
) t
WHERE t.num = 24
) s
ON s.server = r.id
With a covering index on stats(server,time,votes), the EXPLAIN showed MySQL avoided a filesort operation, so it must have used a "reverse index scan" to return the rows in order. Absent the covering index, and index on '(server,time), MySQL used the index if I included an index hint, with theFORCE INDEX FOR ORDER BY (stats_ix1)` hint, MySQL avoided a filesort as well. (But since my table had less than 100 rows, I don't think MySQL put much emphasis on avoiding a filesort operation.)
The time, votes, and avg_sofar expressions are commented out (in the inline view aliased as t); they aren't needed, but they are for debugging.
The way that query stands, it needs at least 24 rows in stats for each server, in order to return an average. (That may be acceptable.) But I was thinking that in general, we could return a running total, total so far (tot) and a running count (cnt).
(If we replace the WHERE t.num = 24 with WHERE t.num <= 24, we can see the running average in action.)
To return the average where there aren't at least 24 rows in stats, that's really a matter of identifying the row (for each server) with the maximum value of num that is <= 24.
Try this solution, with the top-n-per-group technique in the INNER JOIN subselect credited to Bill Karwin and his post about it here.
SELECT
a.*,
AVG(b.votes) AS avgvotes
FROM
servers a
INNER JOIN
(
SELECT
aa.server,
aa.votes
FROM
stats aa
LEFT JOIN stats bb ON
aa.server = bb.server AND
aa.time < bb.time
GROUP BY
aa.time
HAVING
COUNT(*) < 24
) b ON a.id = b.server
GROUP BY
a.id

Selecting last row WITHOUT any kind of key

I need to get the last (newest) row in a table (using MySQL's natural order - i.e. what I get without any kind of ORDER BY clause), however there is no key I can ORDER BY on!
The only 'key' in the table is an indexed MD5 field, so I can't really ORDER BY on that. There's no timestamp, autoincrement value, or any other field that I could easily ORDER on either. This is why I'm left with only the natural sort order as my indicator of 'newest'.
And, unfortunately, changing the table structure to add a proper auto_increment is out of the question. :(
Anyone have any ideas on how this can be done w/ plain SQL, or am I SOL?
If it's MyISAM you can do it in two queries
SELECT COUNT(*) FROM yourTable;
SELECT * FROM yourTable LIMIT useTheCountHere - 1,1;
This is unreliable however because
It assumes rows are only added to this table and never deleted.
It assumes no other writes are performed to this table in the meantime (you can lock the table)
MyISAM tables can be reordered using ALTER TABLE, so taht the insert order is no longer preserved.
It's not reliable at all in InnoDB, since this engine can reorder the table at will.
Can I ask why you need to do this?
In oracle, possibly the same for MySQL too but the optimiser will choose the quickest record / order to return you results. So there is potential if your data was static to run the same query twice and get a different answer.
You can assign row numbers using the ROW_NUMBER function and then sort by this value using the ORDER BY clause.
SELECT *,
ROW_NUMBER() OVER() AS rn
FROM table
ORDER BY rn DESC
LIMIT 1;
Basically, you can't do that.
Normally I'd suggest adding a surrogate primary key with auto-incrememt and ORDER BY that:
SELECT *
FROM yourtable
ORDER BY id DESC
LIMIT 1
But in your question you write...
changing the table structure to add a proper auto_increment is out of the question.
So another less pleasant option I can think of is using a simulated ROW_NUMBER using variables:
SELECT * FROM
(
SELECT T1.*, #rownum := #rownum + 1 AS rn
FROM yourtable T1, (SELECT #rownum := 0) T2
) T3
ORDER BY rn DESC
LIMIT 1
Please note that this has serious performance implications: it requires a full scan and the results are not guaranteed to be returned in any particular order in the subquery - you might get them in sort order, but then again you might not - when you dont' specify the order the server is free to choose any order it likes. Now it probably will choose the order they are stored on disk in order to do as little work as possible, but relying on this is unwise.
Without an order by clause you have no guarantee of the order in which you will get your result. The SQL engine is free to choose any order.
But if for some reason you still want to rely on this order, then the following will indeed return the last record from the result (MySql only):
select *
from (select *,
#rn := #rn + 1 rn
from mytable,
(select #rn := 0) init
) numbered
where rn = #rn
In the sub query the records are retrieved without order by, and are given a sequential number. The outer query then selects only the one that got the last attributed number.
We can use the having for that kind of problem-
SELECT MAX(id) as last_id,column1,column2 FROM table HAVING id=last_id;