We're currently dealing with a slow query in an odd situation. The issue comes into play when we LIMIT the results by 1, 2, 3, 4, 5, 6, but it works with any other limits. This issue is also limited to this one specific user. We can't reproduce the slowness/timeouts with any other user.
We can change the ORDER BY to use a different column, and the query works. We can remove the LIMIT 1, and the query works. Once we change the LIMIT to anything between 1-6, it timesout.
We could get away with setting the ORDER BY on a different column, but this may cause reporting issues in the future and doesn't address the 'why' is this happening.
The query:
SELECT
*
FROM
table_name tn
WHERE
tn.user = '123'
ORDER BY
timestamp_col DESC
LIMIT 1
And our data:
user --- timestamp_col ---
123 2005-02-23 02:02:34
123 2005-03-21 00:12:30
123 2006-01-09 14:23:48
123 2006-01-10 15:01:05
123 2006-01-20 13:11:13
123 2006-10-20 20:08:00
123 2006-11-01 18:31:03
123 2006-12-01 09:10:12
Are there special needs when ordering by a timestamp?
Add the composite
INDEX(user, timestamp_col)
That way both the WHERE, the ORDER BY, and the LIMIT are all handled by the index. And it will stop after getting the desired LIMIT.
Any single-column index needs to read lots of rows and/or sort those rows.
Related
The data looks as follows:
13 users with id's 2-13. User 2 got 2 likes, user 10 got 2 likes, user 3 got 1 like. The rest didn't get any likes.
Prisma query looks like this:
return this.prisma.user.findMany({
skip: Number(page) * Number(size),
take: Number(size),
orderBy: { likesReceived: { _count: "desc" }
});
When I send a query to the database, ordering by likesReceived I get these responses:
page
size
items id's
0
5
2, 10, 3, 4, 14
1
5
6, 7, 8, 9, 11
2
5
12, 13, 14
User 14 appears twice, and user 5 is missing. Why?
Additional sorting by id fixes the problem:
return this.prisma.user.findMany({
skip: Number(page) * Number(size),
take: Number(size),
orderBy: [{ likesReceived: { _count: "desc" } }, { id: "asc" }],
});
Results:
page
size
items id's
0
5
2, 10, 3, 4, 5
1
5
6, 7, 8, 9, 11
2
5
12, 13, 14
When is specifying a second parameter in orderBy with pagination necessary for it to work well?
I agree with the provided answer, just posting here what I replied on the Prisma repo issue directly:
What I suspect is happening here is that in the first case where you order only by the count, the count values are not unique (if I got your description correctly a lot of them have count 0). In this case I don't think the order within the values with the same count is stable between different requests at the database level. This is normal in the SQL world. The solution is to add a tiebreaker, a second field to order by that ideally is unique. This you did in the second request and then got a stable ordering.
So I'd say this is not a bug but expected behaviour from the database and therefore Prisma. The fix you already found, just add a second unique field to break ties or use that one as cursor directly.
I had a similar issue with Laravel using sqlserver.
Laravel was doing a different query for the first page than the subsequent pages. For page 1 they used...
SELECT TOP 100 * FROM users
while for subsequent pages they used row_number(), something like...
SELECT * FROM (
SELECT
ROW_NUMBER() row_num,
*
FROM
users
) u
WHERE
row_num > 100 AND row_num <=201;
Sqlserver doesn't do a default order by Default row order in SELECT query - SQL Server 2008 vs SQL 2012, rather each time it will choose the most optimized way. Therefore on the page 1 query using TOP it chose one way to order and on page 2 with row_number() it chose a different way to order. Thereby returning duplicate results in page 2 that were already in page 1. This was true even though I had many other order bys.
Mysql also seems not to have a default order by SQL: What is the default Order By of queries?.
I don't know if Prisma does the same thing with mysql. Printing out the queries may shed light on if different queries are used for different pages.
Either way if you're using pagination it may make sense, to do as you mentioned and to always use id as a final order by. Like this even if your other intended order bys allow the same record to be on multiple pages the final order by id will ensure that doesn't occur, since now you're forcing it to order by ids instead of choosing a more optimal approach that doesn't order by ids.
In your case since user 14 has 0 likes it can be on any page after 2, 10 and 3 and still satisfy your likesReceived orderBy. But with the id order by then it'll always be on page 2, since page 1 will now have 4 and 6 as the last records, instead of 14, due to the 2nd orderBy of id.
I am trying to query a dataset from a single table, which contains quiz answers/entries from multiple users. I want to pull out the highest scoring entry from each individual user.
My data looks like the following:
ID TP_ID quiz_id name num_questions correct incorrect percent created_at
1 10154312970149546 1 Joe 3 2 1 67 2015-09-20 22:47:10
2 10154312970149546 1 Joe 3 3 0 100 2015-09-21 20:15:20
3 125564674465289 1 Test User 3 1 2 33 2015-09-23 08:07:18
4 10153627558393996 1 Bob 3 3 0 100 2015-09-23 11:27:02
My query looks like the following:
SELECT * FROM `entries`
WHERE `TP_ID` IN('10153627558393996', '10154312970149546')
GROUP BY `TP_ID`
ORDER BY `correct` DESC
In my mind, what that should do is get the two users from the IN clause, order them by the number of correct answers and then group them together, so I should be left with the 2 highest scores from those two users.
In reality it's giving me two results, but the one from Joe gives me the lower of the two values (2), with Bob first with a score of 3. Swapping to ASC ordering keeps the scores the same but places Joe first.
So, how could I achieve what I need?
You're after the groupwise maximum, which can be obtained by joining the grouped results back to the table:
SELECT * FROM entries NATURAL JOIN (
SELECT TP_ID, MAX(correct) correct
FROM entries
WHERE TP_ID IN ('10153627558393996', '10154312970149546')
GROUP BY TP_ID
) t
Of course, if a user has multiple records with the maximal score, it will return all of them; should you only want some subset, you'll need to express the logic for determining which.
MySql is quite lax when it comes to group-by-clauses - but as a rule of thumb you should try to follow the rule that other DBMSs enforce:
In a group-by-query each column should either be part of the group-by-clause or contain a column-function.
For your query I would suggest:
SELECT `TP_ID`,`name`,max(`correct`) FROM `entries`
WHERE `TP_ID` IN('10153627558393996', '10154312970149546')
GROUP BY `TP_ID`,`name`
Since your table seems quite denormalized the group by name-par could be omitted, but it might be necessary in other cases.
ORDER BY is only used to specify in which order the results are returned but does nothing about what results are returned - so you need to apply the max()-function to get the highest number of right answers.
I'm looking for an efficient way of randomly selecting 100 rows satisfying certain conditions from a MySQL table with potentially millions of rows.
Almost everything I've found suggests avoiding the use of ORDER BY RAND(), because of poor performance and scalability.
However, this article suggests ORDER BY RAND() may still be used as a "nice and fast way" to fetch randow data.
Based on this article, below is some example code showing what I'm trying to accomplish. My questions are:
Is this an efficient way of randomly selecting 100 (or up to several hundred) rows from a table with potentially millions of rows?
When will performance become an issue?
SELECT user.*
FROM (
SELECT id
FROM user
WHERE is_active = 1
AND deleted = 0
AND expiretime > '.time().'
AND id NOT IN (10, 13, 15)
AND id NOT IN (20, 30, 50)
AND id NOT IN (103, 140, 250)
ORDER BY RAND()
LIMIT 100
)
AS random_users
STRAIGHT JOIN user
ON user.id = random_users.id
Is strongly urge you to read this article. The last segment will be covering the selection of multiple random row. And you should be able to notice the SELECT statement in the PROCEDURE that will be described there. That would be the spot where you add your specific WHERE conditions.
The problem with ORDER BY RAND() is that this operation has complexity of n*log2(n), while the method described in the article that I linked, has almost constant complexity.
Lets assume, that selecting random row from table, which contains 10 entries, using ORDER BY RAND() takes 1 time unit:
entries | time units
-------------------------
10 | 1 /* if this takes 0.001s */
100 | 20
1'000 | 300
10'000 | 4'000
100'000 | 50'000
1'000'000 | 600'000 /* then this will need 10 minutes */
And you wrote that you are dealing with table on scale of millions.
I'm afraid no-one's going to be able to answer your question with any accuracy. If you really want to know you'll need to run some benchmarks against your system (not the live one ideally but an exact copy). Benchmark this solution against a different solution (getting the random rows using PHP for example) and compare the numbers to what you/your client consider "good performance). Then ramp up your data trying to keep the distribution of column values as close to real as you can and see where performance starts to drop off. To be honest if it works for you now with a bit of headroom, then I'd go for it. When (if!) it becomes a bottleneck then you can look at it again - or just chuck extra iron at your database...
Preprocess as much as possible
try something like (VB-like example)
Dim sRND = New StringBuilder : Dim iRandom As New Random()
Dim iMaxID As Integer = **put you maxId here**
Dim Cnt as Integer=0
While Cnt < 100
Dim RndVal As Integer = iRandom.Next(1, iMaxID)
If Not ("10,13,15,20,30,50,103,140,250").Contains(RndVal) Then
Cnt += 1
sRND.Append("," & RndVal)
end if
End While
String.Format("SELECT * FROM (Select ID FROM(User) WHERE(is_active = 1) AND deleted = 0 AND expiretime > {0} AND id IN ({1}) .blahblablah.... LIMIT 100",time(), Mid(sRND.ToString, 2))
I didn't check for syntax but you'll get my drift I hope.
This will make MySql read records that fit the 'IN' and stop when it reaches 100 without the need to preprocess all records first.
Please let me know the elapsedtime difference if you try it. (I'm qurious)
I'm stumped with how to do the following purely in MySQL, and I've resorted to taking my result set and manipulating it in ruby afterwards, which doesn't seem ideal.
Here's the question. With a dataset of 'items' like:
id state_id price issue_date listed
1 5 450 2011 1
1 5 455 2011 1
1 5 490 2011 1
1 5 510 2012 0
1 5 525 2012 1
...
I'm trying to get something like:
SELECT * FROM items
WHERE ([some conditions], e.g. issue_date >= 2011 and listed=1)
AND state_id = 5
GROUP BY id
HAVING AVG(price) <= 500
ORDER BY price DESC
LIMIT 25
Essentially I want to grab a "group" of items whose average price fall under a certain threshold. I know that my above example "group by" and "having" are not correct since it's just going to give the AVG(price) of that one item, which doesn't really make sense. I'm just trying to illustrate my desired result.
The important thing here is I want all of the individual items in my result set, I don't just want to see one row with the average price, total, etc.
Currently I'm just doing the above query without the HAVING AVG(price) and adding up the individual items one-by-one (in ruby) until I reach the desired average. It would be really great if I could figure out how to do this in SQL. Using subqueries or something clever like joining the table onto itself are certainly acceptable solutions if they work well! Thanks!
UPDATE: In response to Tudor's answer below, here are some clarifications. There is always going to be a target quantity in addition to the target average. And we would always sort the results by price low to high, and by date.
So if we did have 10 items that were all priced at $5 and we wanted to find 5 items with an average < $6, we'd simply return the first 5 items. We wouldn't return the first one only, and we wouldn't return the first 3 grouped with the last 2. That's essentially how my code in ruby is working right now.
I would do almost an inverse of what Jasper provided... Start your query with your criteria to explicitly limit the few items that MAY qualify instead of getting all items and running a sub-select on each entry. Could pose as a larger performance hit... could be wrong, but here's my offering..
select
i2.*
from
( SELECT i.id
FROM items i
WHERE
i.issue_date > 2011
AND i.listed = 1
AND i.state_id = 5
GROUP BY
i.id
HAVING
AVG( i.price) <= 500 ) PreQualify
JOIN items i2
on PreQualify.id = i2.id
AND i2.issue_date > 2011
AND i2.listed = 1
AND i2.state_id = 5
order by
i2.price desc
limit
25
Not sure of the order by, especially if you wanted grouping by item... In addition, I would ensure an index on (state_id, Listed, id, issue_date)
CLARIFICATION per comments
I think I AM correct on it. Don't confuse "HAVING" clause with "WHERE". WHERE says DO or DONT include based on certain conditions. HAVING means after all the where clauses and grouping is done, the result set will "POTENTIALLY" accept the answer. THEN the HAVING is checked, and if IT STILL qualifies, includes in the result set, otherwise throws it out. Try the following from the INNER query alone... Do once WITHOUT the HAVING clause, then again WITH the HAVING clause...
SELECT i.id, avg( i.price )
FROM items i
WHERE i.issue_date > 2011
AND i.listed = 1
AND i.state_id = 5
GROUP BY
i.id
HAVING
AVG( i.price) <= 500
As you get more into writing queries, try the parts individually to see what you are getting vs what you are thinking... You'll find how / why certain things work. In addition, you are now talking in your updated question about getting multiple IDs and prices at apparent low and high range... yet you are also applying a limit. If you had 20 items, and each had 10 qualifying records, your limit of 25 would show all of the first item and 5 into the second... which is NOT what I think you want... you may want 25 of each qualified "id". That would wrap this query into yet another level...
What MySQL does makes perfectly sense. What you want to do does not make sense:
if you have let's say 4 items, each with price of 5 and you put HAVING AVERAGE <= 7 what you say is that the query should return ALL the permutations, like:
{1} - since item with id 1, can be a group by itself
{1,2}
{1,3}
{1,4}
{1,2,3}
{1,2,4}
...
and so on?
Your algorithm of computing the average in ruby is also not valid, if you have items with values 5, 1, 7, 10 - and seek for an average value of less than 7, element with value 10 can be returned just in a group with element of value 1. But, by your algorithm (if I understood correctly), element with value 1 is returned in the first group.
Update
What you want is something like the Knapsack problem and your approach is using some kind of Greedy Algorithm to solve it. I don't think there are straight, easy and correct ways to implement that in SQL.
After a google search, I found this article which tries to solve the knapsack problem with AI written in SQL.
By considering your item price as a weight, having the number of items and the desired average, you could compute the maximum value that can be entered in the 'knapsack' by multiplying desired_cost with number_of_items
I'm not entirely sure from your question, but I think this is a solution to your problem:
SELECT * FROM items
WHERE (some "conditions", e.g. issue_date > 2011 and listed=1)
AND state_id = 5
AND id IN (SELECT id
FROM items
GROUP BY id
HAVING AVG(price) <= 500)
ORDER BY price DESC
LIMIT 25
note: This is off the top of my head and I haven't done complex SQL in a while, so it might be wrong. I think this or something like it should work, though.
For some reason, after adding 2 more ORDER BY, my query became very slow. Can someone help me to fix it? I really need to use the following ORDER BY clause:
ORDER BY candidates.user_id DESC, candidates.usr_type ASC, all_users.user_id DESC
The main problem is that you are mixing sort orders, this is very slow in MySQL.
Either ORDER BY everything ASC or DESC, but don't mix them.
One solution is to define an extra copy field for usr_type that runs in the opposite order.
Like this
Example
-------------------------------
id usr_type alt_usr_type
1 1 99
2 2 98
3 1 99
4 5 95
Now you can define the select as
ORDER BY candidates.user_id DESC
, candidates.alt_usr_type DESC
, all_users.user_id DESC
And your query will run much faster.
See: http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
Make sure you have indexes on all fields you're ordering by.