I'm working on a table counting around 40,000,000 rows, and I'm trying to extract first entry for each "subscription_id" (foreign key from another table), here is my acutal request:
SELECT * FROM billing bill WHERE bill.billing_value not like 'not_ok%'
AND
(SELECT bill2.billing_id
FROM billing bill2
WHERE bill2.subscription_id = bill.subscription_id
ORDER BY bill2.billing_id ASC LIMIT 1
)= bill.billing_id;
This request is working correctly, when I put a small limit on it, but I cannot seem to process it for all the database.
Is there a way I could optimise it somehow ? Or do things in an other way ?
Table indexes and structure:
Indexes:
This is an example of the ROW_NUMBER() solution mentioned in the comments above.
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where billing_value not like 'not_ok%'
) t
where rownum = 1;
The ROW_NUMBER() function is available in MySQL 8.0, so if you haven't upgraded yet, you must do so to use this function.
Unfortunately, this won't be much of an improvement, because the NOT LIKE causes a table-scan regardless of the pattern you search for.
I believe it requires a virtual column with an index to optimize that condition:
alter table billing
add column ok as tinyint(1) as (billing_value not like 'not_ok%'),
add index (ok);
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where ok = true
) t
where rownum = 1;
Now it will use the index on the ok virtual column to reduce the set of examined rows.
This still might be a costly query on a 40 million row table, because the derived table subquery creates a large temporary table. If it's not fast enough, you'll have to really reconsider how you store and query this data.
For example, adding a column first_ok with an index, which is true only on the rows you need to fetch (the first row per subscriber_id without 'not_ok' as the billing value). But you must maintain this new column manually, and risk it being wrong if you don't do that. This is a denormalized design, but tailored to the query you want to run.
I haven't tried it, because I don't have an MySQL DB at hand, but this query seems much simpler:
select *
from billing
where billing_id in (select min(billing_id)
from billing
group by subscription_id)
and billing_value not like 'not_ok%';
The inner select get the minimum billing_id for all subscriptions. The outer gets the rest of the billing record.
If performance is an issue, I'd add the billing_id field in the third index, so you get an index with (subscription_id,billing_id). This will help for the inner query.
I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.
I tried multiple solutions before finding this website :
http://jan.kneschke.de/projects/mysql/order-by-rand/
I tried with the following solution :
SELECT * FROM Table
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).
The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.
So I tried to adapt my request, doing this :
SELECT * FROM Table
JOIN (SELECT FLOOR(
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() )
AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).
I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.
Thank you for your help, please let me know if I'm not clear or if you need more details.
For your method to work more generally, you want max(id) rather than count(*):
SELECT t.*
FROM Table t JOIN
(SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
ON t.ID > R.Random
WHERE t.FOREIGNKEY_ID IS NULL
ORDER BY t.ID
LIMIT 1;
The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.
The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1, 2, 3, 1000. Your method will never get 1000000. The above will almost always get it.
Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID), which the subquery can use. That might speed the query.
I tend to favor something more along these lines:
SELECT t.id
FROM Table t
WHERE t.FOREIGNKEY_ID IS NULL AND
RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;
The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.
Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID).
EDIT:
If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.
Add a new column to the table called random, which is initialized with rand(). Build an index onrandom`. Then run a query such as:
select t.*
from ((select t.*
from t
where random >= #random
order by random
limit 10
) union all
(select t.*
from t
where random < #random
order by random desc
limit 10
)
) t
order by rand();
limit 1;
The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.
The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.
Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.
A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3. This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.
To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):
SELECT id FROM table;
You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:
SELECT * FROM table WHERE id = :id
Or:
SELECT * FROM table WHERE id IN (:id1, :id2, :id3)
This should be fast if the id column has an index, and it will give you a proper random distribution.
If prepared statement can be used, then this should work:
SELECT #skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING #skip;
LIMIT in SELECT statement can be used to skip rows
I have a table with call records. Each call has a 'state' CALLSTART and CALLEND, and each call has a unique 'callid'. Also for each record there is a unique autoincrement 'id'. Each row has a MySQL TIMESTAMP field.
In a previous question I asked for a way to calculate the total of seconds of phone calls. This came to this SQL:
SELECT SUM(TIME_TO_SEC(differences))
FROM
(
SELECT SEC_TO_TIME(TIMESTAMPDIFF(SECOND,MIN(timestamp),MAX(timestamp)))as differences
FROM table
GROUP BY callid
)x
Now I would like to know how to do this, only for callid's that also have a row with the state CONNECTED.
Screenshot of table: http://imgur.com/gmdeSaY
Use a having clause:
SELECT SUM(difference)
FROM (SELECT callid, TIMESTAMPDIFF(SECOND, MIN(timestamp), MAX(timestamp)) as difference
FROM table
GROUP BY callid
HAVING SUM(state = 'Connected') > 0
) c;
If you only want the difference in seconds, I simplified the calculation a bit.
EDIT: (for Mihai)
If you put in:
HAVING state in ('Connected')
Then the value of state comes from an arbitrary row for each callid. Not all the rows, just an arbitrary one. You might or might not get lucky. As a general rule, avoid using the MySQL extension that allows "bare" columns in the select and having clauses, unless you really use the feature intentionally and carefully.
The problem
I'm looking at the ranking use case in MySQL but I still haven't settled on a definite "best solution" for it. I have a table like this:
CREATE TABLE mytable (
item_id int unsigned NOT NULL,
# some other data fields,
item_score int unsigned NOT NULL,
PRIMARY KEY (item_id),
KEY item_score (item_score)
) ENGINE=MyISAM;
with some millions records in it, and the most common write operation is to update item_score with a new value. Given an item_id and/or its score, I need to get its ranking, and I currently know two ways to accomplish that.
COUNT() items with higher scores
SELECT COUNT(*) FROM mytable WHERE item_score > $foo;
assign row numbers
SET #rownum := 0;
SELECT rank FROM (
SELECT #rownum := #rownum + 1 AS rank, item_id
FROM mytable ORDER BY item_score DESC ) AS result
WHERE item_id = $foo;
which one?
Do they perform the same or behave differently? If so, why are they different and which one should I choose?
any better idea?
Is there any better / faster approach? The only thing I can come up with is a separate table/memcache/NoSQL/whatever to store pre-calculated rankings, but I still have to sort & read out mytable every time I update it. That makes me think it would be a good approach only if the number of "read rank" queries is (much?) greather than the number of updates, on the other hand it should be less useful with the "read rank" queries approaching the number of update queries.
Since you have indexes on your table the only queries to use that makes sense is
-- findByScore
SELECT COUNT(*) FROM mytable WHERE item_score > :item_score;
-- findById
SELECT COUNT(*) FROM mytable WHERE item_score > (select item_score from mytable where item_id = :item_id);
on findById since you only need rank of 1 item id, it is not much different from join counterpart on performance wise.
If you need the rank of many items then using join is better.
Usign "assign row numbers" can not compete here because it wont make use of indexes (in your query not at all and if we would even improve that it is still not as good)
Also there may be some hidden traps using the assign indexes: if there are multiple items with same score then it will give you rank of last one.
Unrelated: And please use PDO if possible to be safe from sql injections.
If I have a table
CREATE TABLE users (
id int(10) unsigned NOT NULL auto_increment,
name varchar(255) NOT NULL,
profession varchar(255) NOT NULL,
employer varchar(255) NOT NULL,
PRIMARY KEY (id)
)
and I want to get all unique values of profession field, what would be faster (or recommended):
SELECT DISTINCT u.profession FROM users u
or
SELECT u.profession FROM users u GROUP BY u.profession
?
They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).
If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.
When in doubt, test!
If you have an index on profession, these two are synonyms.
If you don't, then use DISTINCT.
GROUP BY in MySQL sorts results. You can even do:
SELECT u.profession FROM users u GROUP BY u.profession DESC
and get your professions sorted in DESC order.
DISTINCT creates a temporary table and uses it for storing duplicates. GROUP BY does the same, but sortes the distinct results afterwards.
So
SELECT DISTINCT u.profession FROM users u
is faster, if you don't have an index on profession.
All of the answers above are correct, for the case of DISTINCT on a single column vs GROUP BY on a single column.
Every db engine has its own implementation and optimizations, and if you care about the very little difference (in most cases) then you have to test against specific server AND specific version! As implementations may change...
BUT, if you select more than one column in the query, then the DISTINCT is essentially different! Because in this case it will compare ALL columns of all rows, instead of just one column.
So if you have something like:
// This will NOT return unique by [id], but unique by (id,name)
SELECT DISTINCT id, name FROM some_query_with_joins
// This will select unique by [id].
SELECT id, name FROM some_query_with_joins GROUP BY id
It is a common mistake to think that DISTINCT keyword distinguishes rows by the first column you specified, but the DISTINCT is a general keyword in this manner.
So people you have to be careful not to take the answers above as correct for all cases... You might get confused and get the wrong results while all you wanted was to optimize!
Go for the simplest and shortest if you can -- DISTINCT seems to be more what you are looking for only because it will give you EXACTLY the answer you need and only that!
well distinct can be slower than group by on some occasions in postgres (dont know about other dbs).
tested example:
postgres=# select count(*) from (select distinct i from g) a;
count
10001
(1 row)
Time: 1563,109 ms
postgres=# select count(*) from (select i from g group by i) a;
count
10001
(1 row)
Time: 594,481 ms
http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks_I
so be careful ... :)
Group by is expensive than Distinct since Group by does a sort on the result while distinct avoids it. But if you want to make group by yield the same result as distinct give order by null ..
SELECT DISTINCT u.profession FROM users u
is equal to
SELECT u.profession FROM users u GROUP BY u.profession order by null
It seems that the queries are not exactly the same. At least for MySQL.
Compare:
describe select distinct productname from northwind.products
describe select productname from northwind.products group by productname
The second query gives additionally "Using filesort" in Extra.
In MySQL, "Group By" uses an extra step: filesort. I realize DISTINCT is faster than GROUP BY, and that was a surprise.
After heavy testing we came to the conclusion that GROUP BY is faster
SELECT sql_no_cache
opnamegroep_intern
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13) group by opnamegroep_intern
635 totaal 0.0944 seconds
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.0484 sec)
SELECT sql_no_cache
distinct (opnamegroep_intern)
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13)
635 totaal 0.2117 seconds ( almost 100% slower )
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.3468 sec)
(more of a functional note)
There are cases when you have to use GROUP BY, for example if you wanted to get the number of employees per employer:
SELECT u.employer, COUNT(u.id) AS "total employees" FROM users u GROUP BY u.employer
In such a scenario DISTINCT u.employer doesn't work right. Perhaps there is a way, but I just do not know it. (If someone knows how to make such a query with DISTINCT please add a note!)
Here is a simple approach which will print the 2 different elapsed time for each query.
DECLARE #t1 DATETIME;
DECLARE #t2 DATETIME;
SET #t1 = GETDATE();
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
SET #t1 = GETDATE();
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
OR try SET STATISTICS TIME (Transact-SQL)
SET STATISTICS TIME ON;
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET STATISTICS TIME OFF;
It simply displays the number of milliseconds required to parse, compile, and execute each statement as below:
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.
SELECT DISTINCT will always be the same, or faster, than a GROUP BY. On some systems (i.e. Oracle), it might be optimized to be the same as DISTINCT for most queries. On others (such as SQL Server), it can be considerably faster.
This is not a rule
For each query .... try separately distinct and then group by ... compare the time to complete each query and use the faster ....
In my project sometime I use group by and others distinct
If you don't have to do any group functions (sum, average etc in case you want to add numeric data to the table), use SELECT DISTINCT. I suspect it's faster, but i have nothing to show for it.
In any case, if you're worried about speed, create an index on the column.
If the problem allows it, try with EXISTS, since it's optimized to end as soon as a result is found (And don't buffer any response), so, if you are just trying to normalize data for a WHERE clause like this
SELECT FROM SOMETHING S WHERE S.ID IN ( SELECT DISTINCT DCR.SOMETHING_ID FROM DIFF_CARDINALITY_RELATIONSHIP DCR ) -- to keep same cardinality
A faster response would be:
SELECT FROM SOMETHING S WHERE EXISTS ( SELECT 1 FROM DIFF_CARDINALITY_RELATIONSHIP DCR WHERE DCR.SOMETHING_ID = S.ID )
This isn't always possible but when available you will see a faster response.
in mySQL i have found that GROUP BY will treat NULL as distinct, while DISTINCT does not.
Took the exact same DISTINCT query, removed the DISTINCT, and added the selected fields as the GROUP BY, and i got many more rows due to one of the fields being NULL.
So.. I tend to believe that there is more to the DISTINCT in mySQL.