Get last distinct set of records

Get last distinct set of records - mysql

I have a database table containing the following columns:
id code value datetime timestamp
In this table the only unique values reside in id i.e. primary key.
I want to retrieve the last distinct set of records in this table based on the datetime value. For example, let's say below is my table
id code value datetime timestamp
1 1023 23.56 2011-04-05 14:54:52 1234223421
2 1024 23.56 2011-04-05 14:55:52 1234223423
3 1025 23.56 2011-04-05 14:56:52 1234223424
4 1023 23.56 2011-04-05 14:57:52 1234223425
5 1025 23.56 2011-04-05 14:58:52 1234223426
6 1025 23.56 2011-04-05 14:59:52 1234223427
7 1024 23.56 2011-04-05 15:00:12 1234223428
8 1026 23.56 2011-04-05 15:01:14 1234223429
9 1025 23.56 2011-04-05 15:02:22 1234223430
I want to retrieve the records with IDs 4, 7, 8, and 9 i.e. the last set of records with distinct codes (based on datetime value). What I have highlighted is simply an example of what I'm trying to achieve, as this table is going to eventually contain millions of records, and hundreds of individual code values.
What SQL statement can I use to achieve this? I can't seem to get it done with a single SQL statement. My database is MySQL 5.

This should work for you.
SELECT *
FROM [tableName]
WHERE id IN (SELECT MAX(id) FROM [tableName] GROUP BY code)
If id is AUTO_INCREMENT, there's no need to worry about the datetime which is far more expensive to compute, as the most recent datetime will also have the highest id.
Update: From a performance standpoint, make sure the id and code columns are indexed when dealing with a large number of records. If id is the primary key, this is built in, but you may need to add a non-clustered index covering code and id.

Try this:
SELECT *
FROM <YOUR_TABLE>
WHERE (code, datetime, timestamp) IN
(
SELECT code, MAX(datetime), MAX(timestamp)
FROM <YOUR_TABLE>
GROUP BY code
)

It's and old post, but testing #smdrager answer with large tables was very slow. My fix to this was using "inner join" instead of "where in".
SELECT *
FROM [tableName] as t1
INNER JOIN (SELECT MAX(id) as id FROM [tableName] GROUP BY code) as t2
ON t1.id = t2.id
This worked really fast.

I'll try something like this :
select * from table
where id in (
select id
from table
group by code
having datetime = max(datetime)
)
(disclaimer: this is not tested)
If the row with the bigger datetime also have the bigger id, the solution proposed by smdrager is quicker.

Looks like all existing answers suggest to do GROUP BY code on the whole table. When it's logically correct, in reality this query will go through the whole(!) table (use EXPLAIN to make sure). In my case, I have less than 500k of rows in the table and executing ...GROUP BY codetakes 0.3 seconds which is absolutely not acceptable.
However I can use knowledge of my data here (read as "show last comments for posts"):
I need to select just top-20 records
Amount of records with same code across last X records is relatively small (~uniform distribution of comments across posts, there are no "viral" post which got all the recent comments)
Total amount of records >> amount of available code's >> amount of "top" records you want to get
By experimenting with numbers I found out that I can always find 20 different code if I select just last 50 records. And in this case following query works (keeping in mind #smdrager comment about high probability to use id instead of datetime)
SELECT id, code
FROM tablename
ORDER BY id DESC
LIMIT 50
Selecting just last 50 entries is super quick, because it doesn't need to check the whole table. And the rest is to select top-20 with distinct code out of those 50 entries.
Obviously, queries on the set of 50 (100, 500) elements are significantly faster than on the whole table with hundreds of thousands entries.
Raw SQL "Postprocessing"
SELECT MAX(id) as id, code FROM
(SELECT id, code
FROM tablename
ORDER BY id DESC
LIMIT 50) AS nested
GROUP BY code
ORDER BY id DESC
LIMIT 20
This will give you list of id's really quick and if you want to perform additional JOINs, put this query as yet another nested query and perform all joins on it.
Backend-side "Postprocessing"
And after that you need to process the data in your programming language to include to the final set only the records with distinct code.
Some kind of Python pseudocode:
records = select_simple_top_records(50)
added_codes = set()
top_records = []
for record in records:
# If record for this code was already found before
# Note: this is not optimal, better to use structure allowing O(1) search and insert
if record['code'] in added_codes:
continue
# Save record
top_records.append(record)
added_codes.add(record['code'])
# If we found all top-20 required, finish
if len(top_records) >= 20:
break

Related

Reset the count using "keyset pagination"

I'm trying to use "keyset pagination", on this no problem, I do my query and save the last id found for the next one.
My doubt was how to reset the count, to return to 0.
Currently, I am running an additional query every time to check if the id I saved is equal to the SELECT MAX(id) FROM users of the table and in that case update the saved id as 0, otherwise update it keeping the correct count..
Is there a better way?
I was thinking something like (it's a "pseudo-sql" just to show my idea):
SELECT 0 OR MAX(id) FROM users_table WHERE (SELECT MAX(id) FROM users_table) =/!= :actual_count
Update
Perhaps it is better to use an example:
Suppose I have 1000 entries in my table and I browse these 100 entries per request to an endpoint.
INSERT INTO util_table (`key`, `value`) VALUES ("last_visited_id", 0)
SELECT * FROM users
WHERE id >= (SELECT `value` FROM util_table WHERE `key` = "last_visited_id")
ORDER BY id ASC
LIMIT 100
After this query, I update the value of the last_visited_id key in the util table.
So that the second time, I can continue counting from where I left off (100 to 200).
Now let's say I redo the query a tenth time so that I end up with rows from 900 to 1000.
The eleventh time and further, if I just kept saving the id value (1000..1100..1200..etc..), would give an empty result.
And with this back to my question, what is the best method to reset that key to 0?

If you are, say, display 10 items per 'page', SELECT 11 each time.
Then observe how many rows were returned by the Select:
<= 10 -- That's the 'last' page.
11 -- There are more page(s). Show 10 on this page; fetch 10 for the next page (that will include re-fetching the 11th).
There is no need to fetch MAX(id).
More discussion: http://mysql.rjweb.org/doc.php/pagination

MySQL: optimize pagination queries [duplicate]

Scenario in short: A table with more than 16 million records [2GB in size]. The higher LIMIT offset with SELECT, the slower the query becomes, when using ORDER BY *primary_key*
So
SELECT * FROM large ORDER BY `id` LIMIT 0, 30
takes far less than
SELECT * FROM large ORDER BY `id` LIMIT 10000, 30
That only orders 30 records and same eitherway. So it's not the overhead from ORDER BY.
Now when fetching the latest 30 rows it takes around 180 seconds. How can I optimize that simple query?

I had the exact same problem myself. Given the fact that you want to collect a large amount of this data and not a specific set of 30 you'll be probably running a loop and incrementing the offset by 30.
So what you can do instead is:
Hold the last id of a set of data(30) (e.g. lastId = 530)
Add the condition WHERE id > lastId limit 0,30
So you can always have a ZERO offset. You will be amazed by the performance improvement.

It's normal that higher offsets slow the query down, since the query needs to count off the first OFFSET + LIMIT records (and take only LIMIT of them). The higher is this value, the longer the query runs.
The query cannot go right to OFFSET because, first, the records can be of different length, and, second, there can be gaps from deleted records. It needs to check and count each record on its way.
Assuming that id is the primary key of a MyISAM table, or a unique non-primary key field on an InnoDB table, you can speed it up by using this trick:
SELECT t.*
FROM (
SELECT id
FROM mytable
ORDER BY
id
LIMIT 10000, 30
) q
JOIN mytable t
ON t.id = q.id
See this article:
MySQL ORDER BY / LIMIT performance: late row lookups

MySQL cannot go directly to the 10000th record (or the 80000th byte as your suggesting) because it cannot assume that it's packed/ordered like that (or that it has continuous values in 1 to 10000). Although it might be that way in actuality, MySQL cannot assume that there are no holes/gaps/deleted ids.
So, as bobs noted, MySQL will have to fetch 10000 rows (or traverse through 10000th entries of the index on id) before finding the 30 to return.
EDIT : To illustrate my point
Note that although
SELECT * FROM large ORDER BY id LIMIT 10000, 30
would be slow(er),
SELECT * FROM large WHERE id > 10000 ORDER BY id LIMIT 30
would be fast(er), and would return the same results provided that there are no missing ids (i.e. gaps).

I found an interesting example to optimize SELECT queries ORDER BY id LIMIT X,Y.
I have 35million of rows so it took like 2 minutes to find a range of rows.
Here is the trick :
select id, name, address, phone
FROM customers
WHERE id > 990
ORDER BY id LIMIT 1000;
Just put the WHERE with the last id you got increase a lot the performance. For me it was from 2minutes to 1 second :)
Other interesting tricks here : http://www.iheavy.com/2013/06/19/3-ways-to-optimize-for-paging-in-mysql/
It works too with strings

The time-consuming part of the two queries is retrieving the rows from the table. Logically speaking, in the LIMIT 0, 30 version, only 30 rows need to be retrieved. In the LIMIT 10000, 30 version, 10000 rows are evaluated and 30 rows are returned. There can be some optimization can be done my the data-reading process, but consider the following:
What if you had a WHERE clause in the queries? The engine must return all rows that qualify, and then sort the data, and finally get the 30 rows.
Also consider the case where rows are not processed in the ORDER BY sequence. All qualifying rows must be sorted to determine which rows to return.

For those who are interested in a comparison and figures :)
Experiment 1: The dataset contains about 100 million rows. Each row contains several BIGINT, TINYINT, as well as two TEXT fields (deliberately) containing about 1k chars.
Blue := SELECT * FROM post ORDER BY id LIMIT {offset}, 5
Orange := #Quassnoi's method. SELECT t.* FROM (SELECT id FROM post ORDER BY id LIMIT {offset}, 5) AS q JOIN post t ON t.id = q.id
Of course, the third method, ... WHERE id>xxx LIMIT 0,5, does not appear here since it should be constant time.
Experiment 2: Similar thing, except that one row only has 3 BIGINTs.
green := the blue before
red := the orange before

I want to extract a random id from a MYSQL database

I am trying to extract a random article who has a picture from a database.
SELECT FLOOR(MAX(id) * RAND()) FROM `table` WHERE `picture` IS NOT NULL
My table is 33 MB big and has 1,006,394 articles but just 816 with pictures.
My problem is this query takes 0.4640 sek
I need this to be much much more faster.
Any idea is welcome.
P.S.
1. of course I have a index on id.
2. there is no index on the picture field. should I add one?
3. the product name is unique, also the product number, but thats out of question.
RESULT OF TESTING SESSION.
#cHao's Solution is faster when I use it to select one of the random entries with a picture.(les then 0.1 sec.
But its slower if I try to do the opposite, to select a random article without picture. 2..3 sec.
#Kickstart's Solution is a bit slower when trying to find a entry with picture, but is almost same speed when trying to find a entry without picture. average 0,149 sec.
#bob-kruithof's Solution don't work for me.
when trying to find a entry with picture, it selects a entry without picture.
and #ganesh-bora, yes you are right, in my case the speed difference is about 5..15 times.
I want to thank you all for your help, and I decided for #Kickstart.

You need to get a range of values with matching records and then find a matching record within that range.
Something like this:-
SELECT r1.id
FROM `table` AS r1
INNER JOIN (
SELECT RAND( ) * ( MAX( id ) - MIN( id ) ) + MIN( id ) AS id
FROM `table`
WHERE `picture` IS NOT NULL
) AS r2
ON r1.id >= r2.id
WHERE `picture` IS NOT NULL
ORDER BY r1.id ASC
LIMIT 1
However for any hope of efficiency you need an index on the field it is checking (ie, picture in your example)
Just an explanation of how this works.
The sub select finds a random id from the table which is between the min and max ids for records for a picture. This random id may or may not be for a picture.
The resulting id from this sub select is joined back against the main table, but using >= and with a WHERE clause specifying that the record is a picture record. Hence it joins against all picture records where the id is greater than or equal to the random id. The highest random id will be the one for the picture record with the highest id, so it will always find a record (if there are any picture records). The ORDER BY / LIMIT is then used to bring back that single id.
Note that there is an obvious flaw to this, but most of the time it will be irrelevant. The record retrieved may not be entirely random. The picture with the lowest id is unlikely to be returned (will only be returned if the RAND() returns exactly 0), but if this is important this is easy enough to fix by rounding the resulting random id. The other flaw is that if the ids are not vaguely equally distributed in the full range of ids then some will be returned more often than others. For example, take the situation where the first 1000 ids were pictures, then no more until the last (33 millionth) record. The random id could be any of those 33 million, but unless it is less than or equal to 1000 then it will be the 33 millionth record that will be returned.

You might try attaching a random number to each row, then sorting by that. The row with the lowest number will be at the top.
SELECT `table`.`id`, RAND() as `order`
FROM `table`
WHERE `picture` IS NOT NULL
ORDER BY `order`
LIMIT 1;
This is of course slower than just magicking up an ID with RAND(), but (1) it'll always give you a valid ID (as long as there's a record with a non-null picture field in the table, anyway), and (2) the WTF ratio is pretty low; most people can tell what's going on here. :) Its performance rivals Kickstart's solution with a decently indexed table, when the number of items to select from is relatively small (around 1%). Definitely don't try to select from a whole huge table like this; limit it first with a WHERE clause on some indexed field(s).
Performancewise, if you have a long-running app (ie: not PHP; i'm talking about Java, .net, etc where the app is alive even between requests), you might try to keep a list of all the IDs of items with pictures, select a random ID from that list, and load the article. You could do that in PHP too, if you wanted. It might not work as well when you have to query all the IDs each time, but it could be very useful if you can cache the list of IDs in APC or something.

for performance you can first add index on picture column so 814 records get sorted out at the top while executing the query and then you can fire your query.

How has someone else solved the problem?
I would suggest looking at the this article about different possible ways of selecting random rows in mysql.
Modified example from the article
SELECT name
FROM random JOIN
( SELECT CEIL( RAND() * (
SELECT MAX( id ) FROM random WHERE picture IS NOT NULL
) ) AS id ) AS r2 USING ( id );
This might work in your case.
Efficiency
As user Kickstart mentioned: Do you have an index on the column picture? This might help getting you the results a bit faster.
Are your tables optimized?

mySQL select rows from a single table for each user_id which are close in timestamp

Not sure how best to word this, so please bear with me. My table (simplified) is as follows:
id - Integer - auto increment
user_id - Integer
timestamp - Datetime
What I need is the ability to query this table and select all records where the timestamp columns are within a predefined time range (potentially arbitrary, but lets say 10 minutes) for each user_id.So, for example, I would like to know if there is an entry for hypothetical user_id 5 at "2011-01-29 03:00:00" and then next at "2011-01-29 03:02:00" but not if a user searched once at "2011-01-29 03:00:00" and then next at "2011-01-29 05:00:00". This would also need to capture instances where a user searches more than 2 times, each within the time range of the previous.
For background, this is a table of site searches, and I would like to know all instances where a user searches for something, then searches again (presumably because their previous search did not provide the results they were looking for).
I know this is probably simpler than I am making it out to be, but I can't seem to figure it out. I can clarify or provide additional info if needed. Thanks!
EDIT:
I am interested in the search returning results for all of the users in the table, not just user #5, and also to search without input of the actual times. The timestamp should not be something which is manually input, but should instead should find rows by each user which are within 10 minutes of one another.

SELECT distinct t1.user_id, t1.another_field, t1.another_field
FROM
table t1,
table t2
WHERE
t1.user_id = t2.user_id
AND abs(timestampdiff(MINUTE, t1.timestamp, t2.timestamp)) < 10
if you want to further limit the results, you can add
AND t1.user_id = any_number (or IN or between, etc)
To restrict date range add,
AND t1.timestamp BETWEEN A and B (or > or <)

This should give you all users and theirs number of searches within time limit:
SELECT user_id, COUNT(*) AS cnt FROM table
WHERE timestamp BETWEEN "2011-01-29 03:00:00" AND "2011-01-29 03:02:00"
GROUP BY user_id
ORDER BY user_id
This will show you number of searches made just by user_id #5:
SELECT COUNT(*) AS cnt FROM table
WHERE user_id=5
AND timestamp BETWEEN "2011-01-29 03:00:00" AND "2011-01-29 03:02:00"
Depending on actual DB the syntax might be somewhat different, especially the format of dates passed to BETWEEN condition.

How can I speed up a MySQL query with a large offset in the LIMIT clause?

I'm getting performance problems when LIMITing a mysql SELECT with a large offset:
SELECT * FROM table LIMIT m, n;
If the offset m is, say, larger than 1,000,000, the operation is very slow.
I do have to use limit m, n; I can't use something like id > 1,000,000 limit n.
How can I optimize this statement for better performance?

Perhaps you could create an indexing table which provides a sequential key relating to the key in your target table. Then you can join this indexing table to your target table and use a where clause to more efficiently get the rows you want.
#create table to store sequences
CREATE TABLE seq (
seq_no int not null auto_increment,
id int not null,
primary key(seq_no),
unique(id)
);
#create the sequence
TRUNCATE seq;
INSERT INTO seq (id) SELECT id FROM mytable ORDER BY id;
#now get 1000 rows from offset 1000000
SELECT mytable.*
FROM mytable
INNER JOIN seq USING(id)
WHERE seq.seq_no BETWEEN 1000000 AND 1000999;

If records are large, the slowness may be coming from loading the data. If the id column is indexed, then just selecting it will be much faster. You can then do a second query with an IN clause for the appropriate ids (or could formulate a WHERE clause using the min and max ids from the first query.)
slow:
SELECT * FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
fast:
SELECT id FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
SELECT * FROM table WHERE id IN (1,2,3...10)

There's a blog post somewhere on the internet on how you should best make the selection of the rows to show should be as compact as possible, thus: just the ids; and producing the complete results should in turn fetch all the data you want for only the rows you selected.
Thus, the SQL might be something like (untested, I'm not sure it actually will do any good):
select A.* from table A
inner join (select id from table order by whatever limit m, n) B
on A.id = B.id
order by A.whatever
If your SQL engine is too primitive to allow this kind of SQL statements, or it doesn't improve anything, against hope, it might be worthwhile to break this single statement into multiple statements and capture the ids into a data structure.
Update: I found the blog post I was talking about: it was Jeff Atwood's "All Abstractions Are Failed Abstractions" on Coding Horror.

I don't think there's any need to create a separate index if your table already has one. If so, then you can order by this primary key and then use values of the key to step through:
SELECT * FROM myBigTable WHERE id > :OFFSET ORDER BY id ASC;
Another optimisation would be not to use SELECT * but just the ID so that it can simply read the index and doesn't have to then locate all the data (reduce IO overhead). If you need some of the other columns then perhaps you could add these to the index so that they are read with the primary key (which will most likely be held in memory and therefore not require a disc lookup) - although this will not be appropriate for all cases so you will have to have a play.

Paul Dixon's answer is indeed a solution to the problem, but you'll have to maintain the sequence table and ensure that there is no row gaps.
If that's feasible, a better solution would be to simply ensure that the original table has no row gaps, and starts from id 1. Then grab the rows using the id for pagination.
SELECT * FROM table A WHERE id >= 1 AND id <= 1000;
SELECT * FROM table A WHERE id >= 1001 AND id <= 2000;
and so on...

I have run into this problem recently. The problem was two parts to fix. First I had to use an inner select in my FROM clause that did my limiting and offsetting for me on the primary key only:
$subQuery = DB::raw("( SELECT id FROM titles WHERE id BETWEEN {$startId} AND {$endId} ORDER BY title ) as t");
Then I could use that as the from part of my query:
'titles.id',
'title_eisbns_concat.eisbns_concat',
'titles.pub_symbol',
'titles.title',
'titles.subtitle',
'titles.contributor1',
'titles.publisher',
'titles.epub_date',
'titles.ebook_price',
'publisher_licenses.id as pub_license_id',
'license_types.shortname',
$coversQuery
)
->from($subQuery)
->leftJoin('titles', 't.id', '=', 'titles.id')
->leftJoin('organizations', 'organizations.symbol', '=', 'titles.pub_symbol')
->leftJoin('title_eisbns_concat', 'titles.id', '=', 'title_eisbns_concat.title_id')
->leftJoin('publisher_licenses', 'publisher_licenses.org_id', '=', 'organizations.id')
->leftJoin('license_types', 'license_types.id', '=', 'publisher_licenses.license_type_id')
The first time I created this query I had used the OFFSET and LIMIT in MySql. This worked fine until I got past page 100 then the offset started getting unbearably slow. Changing that to BETWEEN in my inner query sped it up for any page. I'm not sure why MySql hasn't sped up OFFSET but between seems to reel it back in.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Get last distinct set of records - mysql

Try this: SELECT * FROM <YOUR_TABLE> WHERE (code, datetime, timestamp) IN ( SELECT code, MAX(datetime), MAX(timestamp) FROM <YOUR_TABLE> GROUP BY code )

It's and old post, but testing #smdrager answer with large tables was very slow. My fix to this was using "inner join" instead of "where in". SELECT * FROM [tableName] as t1 INNER JOIN (SELECT MAX(id) as id FROM [tableName] GROUP BY code) as t2 ON t1.id = t2.id This worked really fast.

I'll try something like this : select * from table where id in ( select id from table group by code having datetime = max(datetime) ) (disclaimer: this is not tested) If the row with the bigger datetime also have the bigger id, the solution proposed by smdrager is quicker.

Related

Reset the count using "keyset pagination"

MySQL: optimize pagination queries [duplicate]

I want to extract a random id from a MYSQL database

mySQL select rows from a single table for each user_id which are close in timestamp

How can I speed up a MySQL query with a large offset in the LIMIT clause?

Categories

Resources