Reset the count using "keyset pagination" - mysql

I'm trying to use "keyset pagination", on this no problem, I do my query and save the last id found for the next one.
My doubt was how to reset the count, to return to 0.
Currently, I am running an additional query every time to check if the id I saved is equal to the SELECT MAX(id) FROM users of the table and in that case update the saved id as 0, otherwise update it keeping the correct count..
Is there a better way?
I was thinking something like (it's a "pseudo-sql" just to show my idea):
SELECT 0 OR MAX(id) FROM users_table WHERE (SELECT MAX(id) FROM users_table) =/!= :actual_count
Update
Perhaps it is better to use an example:
Suppose I have 1000 entries in my table and I browse these 100 entries per request to an endpoint.
INSERT INTO util_table (`key`, `value`) VALUES ("last_visited_id", 0)
SELECT * FROM users
WHERE id >= (SELECT `value` FROM util_table WHERE `key` = "last_visited_id")
ORDER BY id ASC
LIMIT 100
After this query, I update the value of the last_visited_id key in the util table.
So that the second time, I can continue counting from where I left off (100 to 200).
Now let's say I redo the query a tenth time so that I end up with rows from 900 to 1000.
The eleventh time and further, if I just kept saving the id value (1000..1100..1200..etc..), would give an empty result.
And with this back to my question, what is the best method to reset that key to 0?

If you are, say, display 10 items per 'page', SELECT 11 each time.
Then observe how many rows were returned by the Select:
<= 10 -- That's the 'last' page.
11 -- There are more page(s). Show 10 on this page; fetch 10 for the next page (that will include re-fetching the 11th).
There is no need to fetch MAX(id).
More discussion: http://mysql.rjweb.org/doc.php/pagination

Related

SQL Optimization on SELECT random id (with WHERE clause)

I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.
I tried multiple solutions before finding this website :
http://jan.kneschke.de/projects/mysql/order-by-rand/
I tried with the following solution :
SELECT * FROM Table
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).
The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.
So I tried to adapt my request, doing this :
SELECT * FROM Table
JOIN (SELECT FLOOR(
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() )
AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).
I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.
Thank you for your help, please let me know if I'm not clear or if you need more details.
For your method to work more generally, you want max(id) rather than count(*):
SELECT t.*
FROM Table t JOIN
(SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
ON t.ID > R.Random
WHERE t.FOREIGNKEY_ID IS NULL
ORDER BY t.ID
LIMIT 1;
The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.
The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1, 2, 3, 1000. Your method will never get 1000000. The above will almost always get it.
Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID), which the subquery can use. That might speed the query.
I tend to favor something more along these lines:
SELECT t.id
FROM Table t
WHERE t.FOREIGNKEY_ID IS NULL AND
RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;
The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.
Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID).
EDIT:
If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.
Add a new column to the table called random, which is initialized with rand(). Build an index onrandom`. Then run a query such as:
select t.*
from ((select t.*
from t
where random >= #random
order by random
limit 10
) union all
(select t.*
from t
where random < #random
order by random desc
limit 10
)
) t
order by rand();
limit 1;
The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.
The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.
Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.
A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3. This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.
To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):
SELECT id FROM table;
You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:
SELECT * FROM table WHERE id = :id
Or:
SELECT * FROM table WHERE id IN (:id1, :id2, :id3)
This should be fast if the id column has an index, and it will give you a proper random distribution.
If prepared statement can be used, then this should work:
SELECT #skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING #skip;
LIMIT in SELECT statement can be used to skip rows

SQL Query to return most recent row per user

Here is my table structure
So I have these columns in my table:
UserId Location Lastactivity
Let's say that there are 4 results with a UserId of 1 and each location is different. Let's say
index.php, chat.php, test.php, test1.php
.Then there are also timestamps.
Let's also add one more with a UserId of 4 location of chat.php and time of whatever.
Time is in the timestamp format.
I want to get it so that my sql query shows one result from each userid but only the latest one. So in 2 it would show the row which was added to the table most recently. Also I don't want it to show any results that have a lastactivity that was 15 or more minutes ago.
For the example I would just be displaying two rows returned.
Does anyone know what I should do?
I have tried:
SELECT * FROM session WHERE location='chat.php' GROUP BY userid
That returns two results but I believe if there are multiple results for the userid it returns a random one, it also returns results that have a lastactivity of more than 15 minutes.
I am using mysql
------MORE INFO-------
I want to query the database for all the rows where location='chat.php'. I only want one row per userid which is determined by whichever is the most recent submission. Then I also don't want any that are older than 15 minutes. Finally I want to count the number of rows returned and put them into a variable called testVar
Help would be appreciated.
Essentially what you are looking for boils down to you wanting the username and location with the most recent time stamp. So, you want to ignore all the records whose last activity is not the greatest.
Select sum(1) as testVar
From session s
Where location='chat.php'
and datediff(minute, s.lastactivity, getdate()) < 15
and not exists (Select *
From session s2
Where s.userid = s2.userid
and s2.lastactivity > s.lastactivity);
For each record, this query checks to see if there is another record for the same user where the time stamp is more recent. If there is, we want to ignore that record. We only want the rows where a record with more recent activity doesn't exist. It is a little strange to think about it this way, but the logic is equivalent.
By default this query will only grab one row per user, so a group by is not necessary. (This does get a little hairy if the time stamps are exactly the same for two records. In this case, no records will be pulled back)

MySQL - RANDOMLY choose a row in a 14Millions rows table - testing does not make sense

I have been looking on the web on how to select a random row on big tables, I have found various results, but then I analyzed my data and figured out that the best way for me to go is to count the rows and select a random one of those with LIMIT
While testing I start to wonder why this works:
SET #t = CEIL(RAND()*(SELECT MAX(id) FROM logo));
SELECT id
FROM logo
WHERE
current_status_id=29 AND
logo_type_id=4 AND
active='y' AND
id>=#t
ORDER BY id
LIMIT 1;
and gives random results, but this always returns the same 4 or 5 results ?
SELECT id
FROM logo
WHERE
current_status_id=29 AND
logo_type_id=4 AND
active='y' AND
id>=CEIL(RAND()*(SELECT MAX(id) FROM logo))
ORDER BY id
LIMIT 1;
the table has MANY fields (almost 100) and quite a few indexes. over 14 Million records and counting. When I select a random it is almost NEVER that I have to select it from the table, I always have to select depending on various fields values (all indexed).
Could it be a bug of my MySQL server version (5.6.13-log Source distribution)?
One possibility is that this statement in the documentation:
RAND() in a WHERE clause is re-evaluated every time the WHERE is executed.
is simply not always true. It is true when you do:
where rand() < 0.01
to get an approximate 1% sample of the rows. Perhaps the MySQL optimizer says something like "Oh, I'll evaluate the subquery to get one value back. And, just to be more efficient, I'll multiply that row by rand() before defining the constant."
If I had to guess, that would be the case.
Another possibility is that the data is arranged so the values you are looking for has one row with a large id. Or, it could be that there are lots of rows with small ids at the very beginning, and then a very large gap.
Your method of getting a random row, by the way is not guaranteed to return a result when you are doing filtering. I don't know if that is important to you.
EDIT:
Check to see if this version works as you expect:
SELECT id
FROM logo cross join
(SELECT MAX(id) as maxid FROM logo) c
WHERE current_status_id = 29 AND
logo_type_id = 4 AND
active = 'y' AND
id >= RAND() * maxid
ORDER BY id
LIMIT 1;
If so, the problem is that the max id is being calculated and then there is an extra step of multiplying it by rand() as execution of the query begins.

SELECT last 250 rows from a table with no auto id

I know there is a few posts out there already but some are conflicting.
I have taken on a project in which I have inherited a table with a few 1000 entries.
The problem is, there is no auto increment ID field on the table and I have been asked to extract the last 300 rows that were entered into it.
It it possible to extract the last 300 entries from a table? Is there a "systems row id"?
The strict answer is "no" unless you have a date or something else that indicates order. Tables are inherently unordered.
In practice, you generally fetch the data back in the order you put it in. The more true the statement, "I loaded the data once, with no subsequent inserts, into a system with only one processor and one disk", the more likely that the data is actually in order.
Having a system row id would not help you, because you might have deletes and subsequent inserts. A later record would be put in an earlier page, in this case.
You have a small table. Do a select *, copy the data into a spreadsheet and do the work from there.
Alternatively, you can select the table with an increasing row number, insert into another table, and then do the select from there. Something like this pseudocode:
insert into NewTable (seqnum, cols)
select :rownum=:rownum+1, cols
from YourTable
There is a chance you'll get what you want.
One last point. If you did inserts and have the log files since the inserts, you might be able to get the information from there. With a little work.
Try this:
SELECT col1, col2, ...
FROM (SELECT col1,col2, ..., (#auto:=#auto+1) indx
FROM tablename, (SELECT #auto:=1) AS a
) AS b
ORDER BY indx DESC
LIMIT 30
If you at least have a record insertion time in the table, then you can use this.. Otherwise no.
SELECT * FROM
yourtable
ORDER BY inserted_time desc limit 300;

Get last distinct set of records

I have a database table containing the following columns:
id code value datetime timestamp
In this table the only unique values reside in id i.e. primary key.
I want to retrieve the last distinct set of records in this table based on the datetime value. For example, let's say below is my table
id code value datetime timestamp
1 1023 23.56 2011-04-05 14:54:52 1234223421
2 1024 23.56 2011-04-05 14:55:52 1234223423
3 1025 23.56 2011-04-05 14:56:52 1234223424
4 1023 23.56 2011-04-05 14:57:52 1234223425
5 1025 23.56 2011-04-05 14:58:52 1234223426
6 1025 23.56 2011-04-05 14:59:52 1234223427
7 1024 23.56 2011-04-05 15:00:12 1234223428
8 1026 23.56 2011-04-05 15:01:14 1234223429
9 1025 23.56 2011-04-05 15:02:22 1234223430
I want to retrieve the records with IDs 4, 7, 8, and 9 i.e. the last set of records with distinct codes (based on datetime value). What I have highlighted is simply an example of what I'm trying to achieve, as this table is going to eventually contain millions of records, and hundreds of individual code values.
What SQL statement can I use to achieve this? I can't seem to get it done with a single SQL statement. My database is MySQL 5.
This should work for you.
SELECT *
FROM [tableName]
WHERE id IN (SELECT MAX(id) FROM [tableName] GROUP BY code)
If id is AUTO_INCREMENT, there's no need to worry about the datetime which is far more expensive to compute, as the most recent datetime will also have the highest id.
Update: From a performance standpoint, make sure the id and code columns are indexed when dealing with a large number of records. If id is the primary key, this is built in, but you may need to add a non-clustered index covering code and id.
Try this:
SELECT *
FROM <YOUR_TABLE>
WHERE (code, datetime, timestamp) IN
(
SELECT code, MAX(datetime), MAX(timestamp)
FROM <YOUR_TABLE>
GROUP BY code
)
It's and old post, but testing #smdrager answer with large tables was very slow. My fix to this was using "inner join" instead of "where in".
SELECT *
FROM [tableName] as t1
INNER JOIN (SELECT MAX(id) as id FROM [tableName] GROUP BY code) as t2
ON t1.id = t2.id
This worked really fast.
I'll try something like this :
select * from table
where id in (
select id
from table
group by code
having datetime = max(datetime)
)
(disclaimer: this is not tested)
If the row with the bigger datetime also have the bigger id, the solution proposed by smdrager is quicker.
Looks like all existing answers suggest to do GROUP BY code on the whole table. When it's logically correct, in reality this query will go through the whole(!) table (use EXPLAIN to make sure). In my case, I have less than 500k of rows in the table and executing ...GROUP BY codetakes 0.3 seconds which is absolutely not acceptable.
However I can use knowledge of my data here (read as "show last comments for posts"):
I need to select just top-20 records
Amount of records with same code across last X records is relatively small (~uniform distribution of comments across posts, there are no "viral" post which got all the recent comments)
Total amount of records >> amount of available code's >> amount of "top" records you want to get
By experimenting with numbers I found out that I can always find 20 different code if I select just last 50 records. And in this case following query works (keeping in mind #smdrager comment about high probability to use id instead of datetime)
SELECT id, code
FROM tablename
ORDER BY id DESC
LIMIT 50
Selecting just last 50 entries is super quick, because it doesn't need to check the whole table. And the rest is to select top-20 with distinct code out of those 50 entries.
Obviously, queries on the set of 50 (100, 500) elements are significantly faster than on the whole table with hundreds of thousands entries.
Raw SQL "Postprocessing"
SELECT MAX(id) as id, code FROM
(SELECT id, code
FROM tablename
ORDER BY id DESC
LIMIT 50) AS nested
GROUP BY code
ORDER BY id DESC
LIMIT 20
This will give you list of id's really quick and if you want to perform additional JOINs, put this query as yet another nested query and perform all joins on it.
Backend-side "Postprocessing"
And after that you need to process the data in your programming language to include to the final set only the records with distinct code.
Some kind of Python pseudocode:
records = select_simple_top_records(50)
added_codes = set()
top_records = []
for record in records:
# If record for this code was already found before
# Note: this is not optimal, better to use structure allowing O(1) search and insert
if record['code'] in added_codes:
continue
# Save record
top_records.append(record)
added_codes.add(record['code'])
# If we found all top-20 required, finish
if len(top_records) >= 20:
break