The scenario: (I'm using MySQL)
Here is my schema:
CREATE TABLE so_time_diff(
OwnerUserId int(11),
time_diff int(10)
);
There are many OwnerUserId's with each OwnerUserId having many time_diff values.
I would like to pick 1000 random distinct OwnerUserIds and for each OwnerUserId, pick just one random time_difference value.
I already got 1000 distinct OwnerUserIds from else where and stored in a different table:
mysql> create table so_OwnerUserId select distinct(Id) as OwnerUserId
from so_users order by RAND() limit 1000;
I have written the following query:
select #td := time_diff from so_time_diff sotd, so_OwnerUserId soui
where sotd.OwnerUserId = soui.OwnerUserId group by sotd.OwnerUserId
order by rand() limit 1;
This doesn't seem to accomplish what I want. It obviously returns just one row. But I want one random row from each OwnerUserId's time_diff collection. Could someone guide me on how to accomplish this?
FYI - the size of dataset is huge - ~56m records. So I'm looking for an optimal query.
Any help appreciated.
Thanks!
One approach is to use a correlated subquery. It's not a very efficient approach, since that subquery will be executed for each row in the outer table, which would be a 1000 times if there are 1000 rows in so_OwnerUserId.
SELECT r.OwnerUserId
, ( SELECT d.time_diff
FROM so_time_diff d
WHERE d.OwnerUserId = r.OwnerUserId
ORDER BY RAND()
LIMIT 1
) AS random_time_diff
FROM so_OwnerUserId r
For any kind of performance, you're going to need an index with a leading column of OwnerUserId on the so_time_diff table. Better yet, a covering index
... ON so_time_diff (OwnerUserId, time_diff)
(For InnoDB, if those are the only two columns in the table, you'd want that to be the cluster key.)
Related
I have a table USERS with only one column USER_ID. These IDs are more than 200M, they are not consecutive and are not ordered. It has an index USER_ID_INDEX on that column. I have the DB in MySQL and also in Google Big Query, but I haven't been able to get what I need in any of them.
I need to know how to query these 2 things:
1) Which is the row number for a particular USER_ID (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SET #row := 0;
SELECT #row := #row + 1 AS row FROM USERS WHERE USER_ID = 100001366260516;
It goes fast but it returns row=1 because the row counting is from the data-set.
SELECT USER_ID, #row:=#row+1 as row FROM (SELECT USER_ID FROM USERS ORDER BY USER_ID ASC) WHERE USER_ID = 100002034141760
It takes forever (I didn't wait to see the result).
In Big Query:
SELECT ROW_NUMBER() OVER() row, USER_ID
FROM (SELECT USER_ID from USERS.USER_ID ORDER BY USER_ID ASC)
WHERE USER_ID = 1063650153
It takes forever (I didn't wait to see the result).
2) Which USER_ID is in a particular row (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SELECT USER_ID FROM USERS ORDER BY USER_ID ASC LIMIT 150000000000, 1
It takes 5 minutes in giving a result. Why? Isn't it supposed to be fast if it has an index?
In Big Query, I didn't find the way because LIMIT init, num_rows, doesn't even exist.
I could order the table in a new one, and add a column called RANK that orders the USER_ID, with an INDEX on it. But it will be a mess if I want to add or remove a row.
Any ideas on how to solve these two queries?
Thanks,
Natalia
For (1), try this:
SELECT count(user_id)
FROM USERS
WHERE USER_ID <= 100001366260516;
You can check the explain, but it should just be doing a scan of the index.
For (2). Your question: "Why? Isn't it supposed to be fast if it has an index?". Yes, it will use the index. Then it has to count up to row 150,000,000,000 using an index scan. Hmmm, that is being the end of the table (if it is not a typo). In any case, an index scan is quite different from doing an index lookup, which is fast. And, it will take time. And more time if the index does not fit into memory.
The proper syntax for row_number(), by the way, would be:
SELECT row, USER_ID
FROM (SELECT USER_ID, row_number() over (order by user_id) as row
from USERS.USER_ID )
WHERE USER_ID = 1063650153;
I don't know if it will be that much faster, but at least you are not explicitly ordering the rows first.
If these are the types of queries you need to do, then think about a way to include the ordering information as a column in the table.
I'm trying to optimize a query.
My question seems to be similar to MySQL, Union ALL and LIMIT and the answer might be the same (I'm afraid). However in my case there's a stricter limit (1) as well as an index on a datetime column.
So here we go:
For simplicity, let's have just one table with three: columns:
md5 (varchar)
value (varchar).
lastupdated (datetime)
There's an index on (md5, updated) so selecting on a md5 key, ordering by updated and limiting to 1 will be optimized.
The search shall return a maximum of one record matching one of 10 md5 keys. The keys have a priority. So if there's a record with prio 1 it will be preferred over any record with prio 2, 3 etc.
Currently UNION ALL is used:
select * from
(
(
select 0 prio, value
from mytable
where md5 = '7b76e7c87e1e697d08300fd9058ed1db'
order by lastupdated desc
limit 1
)
union all
(
select 1 prio, value
from mytable
where md5 = 'eb36cd1c563ffedc6adaf8b74c259723'
order by lastupdated desc
limit 1
)
) x
order by prio
limit 1;
It works, but the UNION seems to execute all 10 queries if 10 keys are provided.
However, from a business perspective, it would be ok to run the selects sequentially and stop after the first match.
Is that possible though plain SQL?
Or would the only option be a stored procedure?
There's a much better way to do this that doesn't need UNION. You really want the groupwise max for each key, with a custom ordering.
Groupwise Max
Order by FIELD()
There's no way the optimizer for UNION ALL can figure out what you're up to.
I don't know if you can do this, but suppose you had a md5prio table with the list of hash codes you know you're looking for. For example.
prio md5
0 '7b76e7c87e1e697d08300fd9058ed1db'
1 'eb36cd1c563ffedc6adaf8b74c259723'
etc
in it.
Then your query could be:
select mytable.*
from mytable
join md5prio on mytable.md5 = md5prio.md5
order by md5prio.prio, mytable.lastupdated desc
limit 1
This might save the repeated queries. You'll definitely need your index on mytable.md5. I am not sure whether your compound index on lastupdated will help; you'll need to try it.
In your case, the most efficient solution may be to build an index on (md5, lastupdated). This index should be used to resolve each subquery very efficiently (looking up the values in the index and then looking up one data page).
Unfortunately, the groupwise max referenced by Gavin will produce multiple rows when there are duplicate lastupdated values (admittedly, perhaps not a concern in your case).
There is, actually, a MySQL way to get this answer, using group_concat and substring_index:
select p.prio,
substring_index(group_concat(mt.value order by mt.lastupdated desc), ',', 1)
from mytable mt join
(select 0 as prio, '7b76e7c87e1e697d08300fd9058ed1db' as md5 union all
select 1 as prio, 'eb36cd1c563ffedc6adaf8b74c259723' as md5 union all
. . .
) p
on mt.md5 = p.md5
Currently I am using:
SELECT *
FROM
table AS t1
JOIN (
SELECT (RAND() * (SELECT MAX(id) FROM table where column_x is null)) AS id
) AS t2
WHERE
t1.id >= t2.id
and column_x is null
ORDER BY t1.id ASC
LIMIT 1
This is normally extremely fast however when I include the highlighted column_x being Y (null) condition, it gets slow.
What would be the fastest random querying solution where the records' column X is null?
ID is PK, column X is int(4). Table contains about a million records and over 1 GB in total size doubling itself every 24 hours currently.
column_x is indexed.
Column ID may not be consecutive.
The DB engine used in this case is InnoDB.
Thank you.
Getting a genuinely random record can be slow. There's not really much getting around this fact; if you want it to be truly random, then the query has to load all the relevant data in order to know which records it has to choose from.
Fortunately however, there are quicker ways of doing it. They're not properly random, but if you're happy to trade a bit of pure randomness for speed, then they should be good enough for most purposes.
With that in mind, the fastest way to get a "random" record is to add an extra column to your DB, which is populated with a random value. Perhaps a salted MD5 hash of the primary key? Whatever. Add appropriate indexes on this column, and then simply add the column to your ORDER BY clause in the query, and you'll get your records back in a random order.
To get a single random record, simply specify LIMIT 1 and add a WHERE random_field > $random_value where random value would be a value in the range of your new field (say an MD5 hash of a random number, for example).
Of course the down side here is that although your records will be in a random order, they'll be stuck in the same random order. I did say it was trading perfection for query speed. You can get around this by updating them periodically with fresh values, but I guess that could be a problem for you if you need to keep it fresh.
The other down-side is that adding an extra column might be too much to ask if you have storage constraints and your DB is already massive in size, or if you have a strict DBA to get past before you can add columns. But again, you have to trade off something; if you want the query speed, you need this extra column.
Anyway, I hope that helped.
I don't think you need a join, nor an order by, nor a limit 1 (providing the ids are unique).
SELECT *
FROM myTable
WHERE column_x IS NULL
AND id = ROUND(RAND() * (SELECT MAX(Id) FROM myTable), 0)
Have you ran explain on the query? What was the output?
Why not store or cache the value of : SELECT MAX(id) FROM table where column_x is null and use that as a variable. your query would then become:
$rand = rand(0, $storedOrCachedMaxId);
SELECT *
FROM
table AS t1
WHERE
t1.id >= $rand
and column_x is null
ORDER BY t1.id ASC
LIMIT 1
A simpler query will likely be easier on the db.
Know that if your data contains sizable holes - you aren't going to get consistently random results with these kind of queries.
I'm new to MySQL syntax, but digging a little further I think a dynamic query might work. We select the Nth row, where the Nth is random:
SELECT #r := CAST(COUNT(1)*RAND() AS UNSIGNED) FROM table WHERE column_x is null;
PREPARE stmt FROM
'SELECT *
FROM table
WHERE column_x is null
LIMIT 1 OFFSET ?';
EXECUTE stmt USING #r;
Currently I have a table with close to 1 million rows, which I need to query from. What I need to be able to do is stack rank packages on the number of products they include from a given list of product id's.
SELECT count(productID) AS commonProducts, packageID
FROM supply
WHERE productID IN (2,3,4,5,6,7,8,9,10)
GROUP BY packageID
ORDER BY commonProducts
DESC LIMIT 10
The query works fine, but I would like to improve upon it. I tried a multi-column index on productID and packageID, but it seemed to seek more rows than just having a separate index for each of the columns.
MySQL Explain
select_type: SIMPLE
table: supply
type: range
possible_keys: supplyID
key: supplyID
key_len: 3
ref: null
rows: 996
extra: Using where; Using temporary; Using filesort
My main concern is that the query is using a temporary table and filesort. How could I go about optimizing this query? I presume that the biggest issues is count() and the ORDER BY on the results of count().
You can remove the temp table using a Dependent Subquery:
select * from
(
SELECT count(productID) AS commonProducts, s.productId, s.packageID
FROM supply as s
WHERE EXISTS
(
select 1 from supply as innerS
where innerS.productID in (2,3,4,5,6,7,8,9,10)
and s.productId = innerS.productId
)
GROUP BY s.packageID
) AS t
ORDER BY t.commonProducts
DESC LIMIT 10
The inner query links to the outer query and preserves the index. You'll find that any query that sorts on commonProducts, including the above query, will use a filesort, as count(*) is definitely not indexed. But fear not, filesort is just a fancy word for sort -- mysql can choose to use an effective in-memory sort -- and whether you did it now or as a mergesort on the way to an indexed temporary table, you'll have to pay for that sorting somewhere. However, this case is pretty good because filesort will stop sorting once it hits the LIMIT you've put in place. It will not sort the entire list of commonProducts.
Update
If this query is going to be run all the time, I would recommend (without getting too fancy) setting triggers on the supply table to update a legitimate table that tracks counters like this one.
Creatng a temporary resulte set:
SELECT TMP.*
FROM ( SELECT count(productID) AS commonProducts, packageID
FROM supply
WHERE productID IN (2,3,4,5,6,7,8,9,10)
GROUP BY packageID
) AS TMP
ORDER BY commonProducts
DESC LIMIT 10
Perhaps it's not the most elegant way and I cannot guarantee it will be faster because everything depends on your particular data. But in some cases this gives much better results:
SELECT count(*) AS commonProducts, packageID
FROM (
SELECT packageID FROM supply WHERE productID = 2
UNION ALL
SELECT packageID FROM supply WHERE productID = 3
UNION ALL
.
.
.
SELECT packageID FROM supply WHERE productID = 10
) AS t
GROUP BY packageID
ORDER BY commonProducts DESC
LIMIT 10
I'm getting performance problems when LIMITing a mysql SELECT with a large offset:
SELECT * FROM table LIMIT m, n;
If the offset m is, say, larger than 1,000,000, the operation is very slow.
I do have to use limit m, n; I can't use something like id > 1,000,000 limit n.
How can I optimize this statement for better performance?
Perhaps you could create an indexing table which provides a sequential key relating to the key in your target table. Then you can join this indexing table to your target table and use a where clause to more efficiently get the rows you want.
#create table to store sequences
CREATE TABLE seq (
seq_no int not null auto_increment,
id int not null,
primary key(seq_no),
unique(id)
);
#create the sequence
TRUNCATE seq;
INSERT INTO seq (id) SELECT id FROM mytable ORDER BY id;
#now get 1000 rows from offset 1000000
SELECT mytable.*
FROM mytable
INNER JOIN seq USING(id)
WHERE seq.seq_no BETWEEN 1000000 AND 1000999;
If records are large, the slowness may be coming from loading the data. If the id column is indexed, then just selecting it will be much faster. You can then do a second query with an IN clause for the appropriate ids (or could formulate a WHERE clause using the min and max ids from the first query.)
slow:
SELECT * FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
fast:
SELECT id FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
SELECT * FROM table WHERE id IN (1,2,3...10)
There's a blog post somewhere on the internet on how you should best make the selection of the rows to show should be as compact as possible, thus: just the ids; and producing the complete results should in turn fetch all the data you want for only the rows you selected.
Thus, the SQL might be something like (untested, I'm not sure it actually will do any good):
select A.* from table A
inner join (select id from table order by whatever limit m, n) B
on A.id = B.id
order by A.whatever
If your SQL engine is too primitive to allow this kind of SQL statements, or it doesn't improve anything, against hope, it might be worthwhile to break this single statement into multiple statements and capture the ids into a data structure.
Update: I found the blog post I was talking about: it was Jeff Atwood's "All Abstractions Are Failed Abstractions" on Coding Horror.
I don't think there's any need to create a separate index if your table already has one. If so, then you can order by this primary key and then use values of the key to step through:
SELECT * FROM myBigTable WHERE id > :OFFSET ORDER BY id ASC;
Another optimisation would be not to use SELECT * but just the ID so that it can simply read the index and doesn't have to then locate all the data (reduce IO overhead). If you need some of the other columns then perhaps you could add these to the index so that they are read with the primary key (which will most likely be held in memory and therefore not require a disc lookup) - although this will not be appropriate for all cases so you will have to have a play.
Paul Dixon's answer is indeed a solution to the problem, but you'll have to maintain the sequence table and ensure that there is no row gaps.
If that's feasible, a better solution would be to simply ensure that the original table has no row gaps, and starts from id 1. Then grab the rows using the id for pagination.
SELECT * FROM table A WHERE id >= 1 AND id <= 1000;
SELECT * FROM table A WHERE id >= 1001 AND id <= 2000;
and so on...
I have run into this problem recently. The problem was two parts to fix. First I had to use an inner select in my FROM clause that did my limiting and offsetting for me on the primary key only:
$subQuery = DB::raw("( SELECT id FROM titles WHERE id BETWEEN {$startId} AND {$endId} ORDER BY title ) as t");
Then I could use that as the from part of my query:
'titles.id',
'title_eisbns_concat.eisbns_concat',
'titles.pub_symbol',
'titles.title',
'titles.subtitle',
'titles.contributor1',
'titles.publisher',
'titles.epub_date',
'titles.ebook_price',
'publisher_licenses.id as pub_license_id',
'license_types.shortname',
$coversQuery
)
->from($subQuery)
->leftJoin('titles', 't.id', '=', 'titles.id')
->leftJoin('organizations', 'organizations.symbol', '=', 'titles.pub_symbol')
->leftJoin('title_eisbns_concat', 'titles.id', '=', 'title_eisbns_concat.title_id')
->leftJoin('publisher_licenses', 'publisher_licenses.org_id', '=', 'organizations.id')
->leftJoin('license_types', 'license_types.id', '=', 'publisher_licenses.license_type_id')
The first time I created this query I had used the OFFSET and LIMIT in MySql. This worked fine until I got past page 100 then the offset started getting unbearably slow. Changing that to BETWEEN in my inner query sped it up for any page. I'm not sure why MySql hasn't sped up OFFSET but between seems to reel it back in.