When I try to get random row from table by id using RAND() function I get unexpected unstable results. The following query (where id column is primary key) returns 1, 2 or even more rows:
I tried next variant as well which produces same result:
SELECT id, word FROM words WHERE id = FLOOR(RAND() * 1000)
I found another solution for my task:
SELECT id, word FROM words ORDER BY RAND() LIMIT 1
But I want to know why MySQL behavior is so unexpected with using so elementary functionality. It scares me.
I experimented in different IDE with the same results.
The behavior is not unexpected. The RAND() function is evaluated per-row:
SELECT RAND() FROM sometable LIMIT 10
+----------------------+
| RAND() |
+----------------------+
| 0.7383128467372738 |
| 0.6141578719151746 |
| 0.8558508500976961 |
| 0.4367806654766022 |
| 0.6163508078235674 |
| 0.7714120734216757 |
| 0.0080079743713214 |
| 0.7258036823252251 |
| 0.6049945192458057 |
| 0.8475615799869984 |
+----------------------+
Keeping this in mind, this query:
SELECT * FROM words WHERE id = FLOOR(RAND() * 1000)
means that every row with id between 0 and 999 has 1/1000 probability of being SELECTed!
I need to sort a query's results by a varchar that may contain numbers, but also sort by other fields beforehand.
Say my table looks like:
+------------------------+-----------------+
| AnotherField | VarCharWithNums |
+------------------------+-----------------|
| Same Data in Every Row | Numbers 5-10 |
| Same Data in Every Row | Numbers 10-13 |
| Same Data in Every Row | Numbers 13-15 |
+------------------------+-----------------|
This query:
SELECT VarCharWithNums, AnotherField
FROM MyTable
ORDER BY CAST(VarCharWithNums AS UNSIGNED) ASC
Gives me this:
+------------------------+-----------------+
| AnotherField | VarCharWithNums |
+------------------------+-----------------|
| Same Data in Every Row | Numbers 5-10 |
| Same Data in Every Row | Numbers 10-13 |
| Same Data in Every Row | Numbers 13-15 |
+------------------------+-----------------|
This query:
SELECT VarCharWithNums, AnotherField
FROM MyTable
ORDER BY AnotherField ASC, CAST(VarCharWithNums AS UNSIGNED) ASC
Gives me this:
+------------------------+-----------------+
| AnotherField | VarCharWithNums |
+------------------------+-----------------|
| Same Data in Every Row | Numbers 10-13 |
| Same Data in Every Row | Numbers 5-10 |
| Same Data in Every Row | Numbers 13-15 |
+------------------------+-----------------|
It doesn't matter what priority I give the fields in the ORDER BY clause, it never sorts VarCharWithNums correctly when I order it alongside other fields.
You mentioned it in your last paragraph, but truly the only error in what you've shown is that, to get the sort you described, CAST(VarCharWithNums AS UNSIGNED) ASC needs to be the first thing in the ORDER BY clause. It should work, and it works when I create a contrived example on my machine.
Given a structure like this in a MySQL database
#data_table
(id) | user_id | time | (...)
#relations_table
(id) | user_id | user_coach_id | (...)
we can select all data_table rows belonging to a certain user_coach_id (let's say 1) with
SELECT rel.`user_coach_id`, dat.*
FROM `relations_table` rel
LEFT JOIN `data_table` dat ON rel.`uid` = dat.`uid`
WHERE rel.`user_coach_id` = 1
ORDER BY val.`time` DESC
returning something like
| user_coach_id | id | user_id | time | data1 | data2 | ...
| 1 | 9 | 4 | 15 | foo | bar | ...
| 1 | 7 | 3 | 12 | oof | rab | ...
| 1 | 6 | 4 | 11 | ofo | abr | ...
| 1 | 4 | 4 | 5 | foo | bra | ...
(And so on. Of course time are not integers in reality but to keep it simple.)
But now I would like to query (ideally) only up to an arbitrary number of rows from data_table per distinct user_id but still have those ordered (i.e. newest first). Is that even possible?
I know I can use GROUP BY user_id to only return 1 row per user, but then the ordering doesn't work and it seems kind of unpredictable which row will be in the result. I guess it's doable with a subquery, but I haven't figured it out yet.
To limit the number of rows in each GROUP is complicated. It is probably best done with an #variable to count, plus an outer query to throw out the rows beyond the limit.
My blog on Groupwise Max gives some hints of how to do such.
Suppose I have the following database setup (a simplified version from what I actually have):
Table: news_posting (500,000+ entries)
| --------------------------------------------------------------|
| posting_id | name | is_active | released_date | token |
| 1 | posting_1 | 1 | 2013-01-10 | 123 |
| 2 | posting_2 | 1 | 2013-01-11 | 124 |
| 3 | posting_3 | 0 | 2013-01-12 | 125 |
| --------------------------------------------------------------|
PRIMARY posting_id
INDEX sorting ON (is_active, released_date, token)
Table: news_category (500 entries)
| ------------------------------|
| category_id | name |
| 1 | category_1 |
| 2 | category_2 |
| 3 | category_3 |
| ------------------------------|
PRIMARY category_id
Table: news_cat_match (1,000,000+ entries)
| ------------------------------|
| category_id | posting_id |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 2 | 2 |
| 3 | 2 |
| 1 | 3 |
| 2 | 3 |
| ------------------------------|
UNIQUE idx (category_id, posting_id)
My task is as follows. I must get a list of 50 latest news postings (at some offset) that are active, that are before today's date, and that are in one of the 20 or so categories that are specified in the request. Before I choose the 50 news postings to return, I must sort the appropriate news postings by token in descending order. My query is currently similar to the following:
SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50
With just one specified category_id the query does not involve a filesort and is reasonably fast because it does not have to process removal of duplicate results. However, calling EXPLAIN on the above query that has multiple category_id's returns a table that says that there is filesort to be done. And, the query is extremely slow on my data set.
Is there any way to optimize the table setup and/or the query?
I was able to get the above query to run even faster than with a single-value category list version by rewriting it as follows:
SELECT posting_id
FROM news_posting np
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
AND EXISTS (
SELECT ncm.posting_id
FROM news_cat_match ncm
WHERE ncm.posting_id = np.posting_id
AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
LIMIT 1
)
ORDER BY np.token DESC LIMIT 50
This now takes under a second on my data set.
The sad part is that this is even faster than if there is just one category_id specified. That's because the subset of news items is bigger than with just one category_id, so it finds the results more quickly.
Now my next question is whether this can be optimized for cases when a category has only few news that are spread in time?
The following is still pretty slow on my development machine. Although it's fast enough on the production server, I would like to optimize this if possible.
SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id = 1)
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50
Does anyone have any further suggestions?
I need to sum a result that I'm getting from an existing query. And the it has to extend the current query and remain a single query
(by this I mean NOT - DO 1; DO 2; DO3;)
My current query is:
SELECT SUM((count)/(SELECT COUNT(*) FROM mobile_site_statistics WHERE campaign_id='1201' AND start_time BETWEEN CURDATE()-1 AND CURDATE())*100) AS percentage FROM mobile_site_statistics WHERE device NOT LIKE '%Pingdom%' AND campaign_id='1201' AND start_time BETWEEN (CURDATE()-1) AND CURDATE() GROUP BY device ORDER BY 1 DESC LIMIT 10;
This returns:
+------------+
| percentage |
+------------+
| 47.3813 |
| 19.7940 |
| 5.6672 |
| 5.0801 |
| 3.9603 |
| 3.8500 |
| 3.1294 |
| 2.9924 |
| 2.9398 |
| 2.7136 |
+------------+
What I need is the total of that table (total percent used by the top 10 devices)(that's all) but it has to be a single query (Has to include the initial query)(Has to be a single query due to another program that's using the query)
Is this possible? every way I have tried so far has failed. We tried temporary tables, but that turned into multiple queries.
Just do a
SELECT SUM(percentage) AS total FROM (<YOUR_QUERY>) a
and replace the sub-query <YOUR_QUERY> with your initial query