Selecting last row WITHOUT any kind of key - mysql

I need to get the last (newest) row in a table (using MySQL's natural order - i.e. what I get without any kind of ORDER BY clause), however there is no key I can ORDER BY on!
The only 'key' in the table is an indexed MD5 field, so I can't really ORDER BY on that. There's no timestamp, autoincrement value, or any other field that I could easily ORDER on either. This is why I'm left with only the natural sort order as my indicator of 'newest'.
And, unfortunately, changing the table structure to add a proper auto_increment is out of the question. :(
Anyone have any ideas on how this can be done w/ plain SQL, or am I SOL?

If it's MyISAM you can do it in two queries
SELECT COUNT(*) FROM yourTable;
SELECT * FROM yourTable LIMIT useTheCountHere - 1,1;
This is unreliable however because
It assumes rows are only added to this table and never deleted.
It assumes no other writes are performed to this table in the meantime (you can lock the table)
MyISAM tables can be reordered using ALTER TABLE, so taht the insert order is no longer preserved.
It's not reliable at all in InnoDB, since this engine can reorder the table at will.

Can I ask why you need to do this?
In oracle, possibly the same for MySQL too but the optimiser will choose the quickest record / order to return you results. So there is potential if your data was static to run the same query twice and get a different answer.

You can assign row numbers using the ROW_NUMBER function and then sort by this value using the ORDER BY clause.
SELECT *,
ROW_NUMBER() OVER() AS rn
FROM table
ORDER BY rn DESC
LIMIT 1;

Basically, you can't do that.
Normally I'd suggest adding a surrogate primary key with auto-incrememt and ORDER BY that:
SELECT *
FROM yourtable
ORDER BY id DESC
LIMIT 1
But in your question you write...
changing the table structure to add a proper auto_increment is out of the question.
So another less pleasant option I can think of is using a simulated ROW_NUMBER using variables:
SELECT * FROM
(
SELECT T1.*, #rownum := #rownum + 1 AS rn
FROM yourtable T1, (SELECT #rownum := 0) T2
) T3
ORDER BY rn DESC
LIMIT 1
Please note that this has serious performance implications: it requires a full scan and the results are not guaranteed to be returned in any particular order in the subquery - you might get them in sort order, but then again you might not - when you dont' specify the order the server is free to choose any order it likes. Now it probably will choose the order they are stored on disk in order to do as little work as possible, but relying on this is unwise.

Without an order by clause you have no guarantee of the order in which you will get your result. The SQL engine is free to choose any order.
But if for some reason you still want to rely on this order, then the following will indeed return the last record from the result (MySql only):
select *
from (select *,
#rn := #rn + 1 rn
from mytable,
(select #rn := 0) init
) numbered
where rn = #rn
In the sub query the records are retrieved without order by, and are given a sequential number. The outer query then selects only the one that got the last attributed number.

We can use the having for that kind of problem-
SELECT MAX(id) as last_id,column1,column2 FROM table HAVING id=last_id;

Related

Optimizing Select SQL request with millions of entries

I'm working on a table counting around 40,000,000 rows, and I'm trying to extract first entry for each "subscription_id" (foreign key from another table), here is my acutal request:
SELECT * FROM billing bill WHERE bill.billing_value not like 'not_ok%'
AND
(SELECT bill2.billing_id
FROM billing bill2
WHERE bill2.subscription_id = bill.subscription_id
ORDER BY bill2.billing_id ASC LIMIT 1
)= bill.billing_id;
This request is working correctly, when I put a small limit on it, but I cannot seem to process it for all the database.
Is there a way I could optimise it somehow ? Or do things in an other way ?
Table indexes and structure:
Indexes:
This is an example of the ROW_NUMBER() solution mentioned in the comments above.
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where billing_value not like 'not_ok%'
) t
where rownum = 1;
The ROW_NUMBER() function is available in MySQL 8.0, so if you haven't upgraded yet, you must do so to use this function.
Unfortunately, this won't be much of an improvement, because the NOT LIKE causes a table-scan regardless of the pattern you search for.
I believe it requires a virtual column with an index to optimize that condition:
alter table billing
add column ok as tinyint(1) as (billing_value not like 'not_ok%'),
add index (ok);
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where ok = true
) t
where rownum = 1;
Now it will use the index on the ok virtual column to reduce the set of examined rows.
This still might be a costly query on a 40 million row table, because the derived table subquery creates a large temporary table. If it's not fast enough, you'll have to really reconsider how you store and query this data.
For example, adding a column first_ok with an index, which is true only on the rows you need to fetch (the first row per subscriber_id without 'not_ok' as the billing value). But you must maintain this new column manually, and risk it being wrong if you don't do that. This is a denormalized design, but tailored to the query you want to run.
I haven't tried it, because I don't have an MySQL DB at hand, but this query seems much simpler:
select *
from billing
where billing_id in (select min(billing_id)
from billing
group by subscription_id)
and billing_value not like 'not_ok%';
The inner select get the minimum billing_id for all subscriptions. The outer gets the rest of the billing record.
If performance is an issue, I'd add the billing_id field in the third index, so you get an index with (subscription_id,billing_id). This will help for the inner query.

Best practice for doing select by max and column to return one record

I just learned that in my environment that full_group_by mode has been disabled. For example, if I wanted to get a column (name) and a max (id) on another column and return only that one record I would have done something like
SELECT max(id), name
FROM TABLE;
However, this only works if only_full_group_by is disabled. The alternative to this and I think most likely the better way and accepted practice to write this would be,
SELECT id, name
FROM TABLE
WHERE id = (SELECT max(id) from TABLE);
What is the correct and best practice here? I like the first perhaps because that is how I have been doing it forever and it's less code. The second way does seem to read better and is more clear in what it will return, but seems maybe slower since I am doing another SELECT statement in the WHERE.
Your first query is invalid SQL, going by the ANSI standard, and should be avoided. If you only expect a single record having the maximum id value, or, if there are ties and you don't care which single record is returned, then you may use a LIMIT query:
SELECT id, name
FROM yourTable
ORDER BY id DESC
LIMIT 1;
Otherwise, if you need all ties, then use your second version:
SELECT id, name
FROM yourTable
WHERE id = (SELECT MAX(id) FROM yourTable);
Note that as of MySQL 8+, we can also use the RANK() analytic function here to get back all ties:
WITH cte AS (
SELECT *, RANK() OVER (ORDER BY id DESC) rnk
FROM yourTable
)
SELECT id, name
FROM cte
WHERE rnk = 1;

Optimization of my Custom RAND() query

I use the following query to get a random row in MySql. And, I think it to be pretty faster than the ORDER BY RAND() as it just returns a row after a random count of rows, and doesn't require any ordering of rows.
SELECT COUNT(ID) FROM TABLE_NAME
!-- GENERATE A RANDOM NUMBER BETWEEN 0 and COUNT(ID)-1 --!
SELECT x FROM TABLE_NAME LIMIT RANDOM_NUMBER,1
But, I need to know if in any way I could optimize it more and is there a faster method.
I would also be grateful to know if I can combine the 2 queries as LIMIT doesn't support such sub-queries (As I know).
EDIT- The way my query works is not by randomly generating any ID. But instead it generates a random no. between 0 and total no. of rows. And, then I use that no. as offset to get a row next to that random count.
EDIT : My answer assumes MySql<5.5.6 where you cannot pass a variable to LIMIT and OFFSET. Otherwise, OP's method is the best.
The most reliable solution, imo, would be to rank your results to eliminate the gaps. My solution might not be optimal since I'm not used to MySQL, but the logic works (or worked in my SQLFiddle).
SET #total = 0;
SELECT #total := COUNT(1) FROM test;
SET #random=FLOOR(RAND()*#total)+1;
SET #rank=0;
SELECT * from
(SELECT #rank:=#rank+1 as rank, id, name
FROM test
order by id) derived_table
where rank = #random;
I'm not sure how this structure will old if you use it on a massive query, but as long as you're within a few hundreds of rows it should be instant.
Basically, you generate a random row number with (this is one of the place where there's most probably optimization to be made) :
SET #total = 0;
SELECT #total := COUNT(1) FROM test;
SET #random=FLOOR(RAND()*#total)+1;
Then, you rank all of your rows to eliminate gaps :
SELECT #rank:=#rank+1 as rank, id, name
FROM test
order by id
And, you select the randomly selected row :
SELECT * from
(ranked derived table) derived_table
where rank = #random;
I think the query you want is:
select x.*
from tablename x
where x.id >= random_number
order by x.id
limit 1;
This should use an index on x.id and should be quite fast. You can combine them as:
select x.*
from tablename x cross join
(select cast(max(id) * rand() as int) as random_number from tablename
) c
where x.id >= random_number
order by x.id
limit 1;
Note that you should use max(id) rather than count(), because there can be gaps in the ids. The subquery should also make use of an index on id.
EDIT:
I won't be defensive about the above solution. It returns a random id, but the id is not uniformly distributed.
My preferred method, in any case, is:
select x.*
from tablename x cross join
(select count(*) as cnt from x) cnt
where rand() < 100 / cnt
order by rand()
limit 1;
It is highly, highly unlikely that you will get no rows with the where condition (it is possible, but highly unlikely). The final order by rand() is only processing 100 rows, so it should go pretty fast.
There are 5 techniques in http://mysql.rjweb.org/doc.php/random . None of them have to look at the entire table.
Do you have an AUTO_INCREMENT? With or without gaps? And other questions need answering to know which technique in that link is even applicable.
Try caching the result of the first query and the use in the second query. Using both in the same query will be very heavy on the system.
As for the second query, try the following:
SELECT x FROM TABLE_NAME WHERE ID = RANDOM_NUMBER
The above query is much faster than yours (assuming ID is indexed)
Of course, the above query assumes that you are using sequential IDs (no gaps). If there are gaps, then you will need to create another sequential field (maybe call it ID2) and then execute the above query on that field.

SQL `group by` vs. `order by` Performance

tl;dr - lots of accepted stackoverflow answers suggest using a subquery to affect the row returned by a GROUP BY clause. While this works, is it the best advice?
I understand there are many questions already about how to retrieve a specific row in a GROUP BY statement. Most of them revolve around using a subquery in the FROM clause. The subquery will order the table appropriately and the group by will be run against the now-ordered temporary table. Some examples,
MySQL order by before group by
MySQL "Group By" and "Order By"
PostgreSQL removes the need for the subquery with the distinct on() clause.
Postgresql DISTINCT ON with different ORDER BY
However, what I'm not understanding in any of these cases is how badly I'm shooting myself in the foot trying to do something the system may not have originally been designed for. Take the following two examples in PostgreSQL and MySQL,
http://sqlfiddle.com/#!15/3b0f2/1
http://sqlfiddle.com/#!2/6d337/1
In both cases I have a table of posts that contain multiple versions of the same post (signified by its UUID). I want to select the most recently published version of each post ordered by it's created_at field.
My biggest concern is that given the MySQL approach a temporary table is necessary. Ratchet this up to "web scale" (lolz) and I'm wondering if I'm in for a world of hurt. Should I rethink my schema or are there ways to optimize the subquery-parentquery relationship enough that it'll be alright?
It is definitely not the best advice. SQL itself (and the MySQL documentation as far as I can tell) has little to say about the results from a subquery with an order by. Although they may be ordered in practice, they are not guaranteed to be.
The more important issue is the use of "hidden columns" in the aggregation. Consider this basic query:
select t.*
from (select t.* from table t order by datecol) t
group by t.col;
Everything except t.col in the select comes from an indeterminate row. The specific documentation is (emphasis is mine):
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
Furthermore, the selection of values from each group cannot be
influenced by adding an ORDER BY clause. Sorting of the result set
occurs after values have been chosen, and ORDER BY does not affect
which values within each group the server chooses.
A safe way to write such a query is:
select t.*
from table t
where not exists (select 1
from table t2
where t2.col = t.col and t2.datecol < t.datecol
);
This is not exactly the same, because it will return multiple values if the minimum is not unique. The logic is "get me all rows in the table where there are no rows with the same col value and a smaller datecol value.
EDIT:
The question in your comment doesn't make sense, because nothing is discussing two queries. In MySQL you can use order by with variables to solve this:
select t.*
from (select t.*,
#rn := if(#col = col, #rn := #rn + 1, 1) as rn,
#col := col
from table t cross join
(select #col := '', #rn := 0) vars
order by col, datecol) t
where rn = 1;
This method should be faster than the order by with group by.

ranking entries in mysql table

I have a MySQL table with many rows. The table has a popularity column. If I sort by popularity, I can get the rank of each item. Is it possible to retrieve the rank of a particular item without sorting the entire table? I don't think so. Is that correct?
An alternative would be to create a new column for storing rank, sort the entire table, and then loop through all the rows and update the rank. That is extremely inefficient. Is there perhaps a way to do this in a single query?
There is no way to calculate the order (what you call rank) of something without first sorting the table or storing the rank.
If your table is properly indexed however (index on popularity) it is trivial for the database to sort this so you can get your rank. I'd suggest something like the following:
Select all, including rank
SET #rank := 0;
SELECT t.*, #rank := #rank + 1
FROM table t
ORDER BY t.popularity;
To fetch an item with a specific "id" then you can simply use a subquery as follows:
Select one, including rank
SET #rank := 0;
SELECT * FROM (
SELECT t.*, #rank := #rank + 1
FROM table t
ORDER BY t.popularity
) t2
WHERE t2.id = 1;
You are right that the second approach is inefficent, if the rank column is updated on every table read. However, depending on how many updates there are to the database, you could calculate the rank on every update, and store that - it is a form of caching. You are then turning a calculated field into a fixed value field.
This video covers caching in mysql, and although it is rails specific, and is a slightly different form of caching, is a very similar caching strategy.
If you are using an InnoDb table then you may consider building a clustered index on the popularity column. (only if the order by on popularity is a frequent query). The decision also depends on how varied the popularity column is (0 - 3 not so good).
You can look at this info on clustered index to see if this works for your case: http://msdn.microsoft.com/en-us/library/ms190639.aspx
This refers to SQL server but the concept is the same, also look up mysql documentation on this.
If you're doing this using PDO then you need to modify the query to all be within the single statement in order to get it to work properly. See PHP/PDO/MySQL: Convert Multiple Queries Into Single Query
So hobodave's answer becomes something like:
SELECT t.*, (#count := #count + 1) as rank
FROM table t
CROSS JOIN (SELECT #count := 0) CONST
ORDER BY t.popularity;
hobodave's solution is very good. Alternatively, you could add a separate rank column and then, whenever a row's popularity is UPDATEd, query to determine whether that popularity update changed its ranking relative to the row above and below it, then UPDATE the 3 rows affected. You'd have to profile to see which method is more efficient.