Question
I'm not a comp sci major so forgive me if I muddle the terminology. What is the computational complexity for calling
SELECT DISTINCT(column) FROM table
or
SELECT * FROM table GROUP BY column
on a column that IS indexed? Is it proportional to the number of rows or the number of distinct values in the column. I believe that would be O(1)*NUM_DISINCT_COLS vs O(NUM_OF_ROWS)
Background
For example if I have 10 million rows but only 10 distinct values/groups in that column visually you could simply count the last item in each group so the time complexity would be tied to the number of distinct groups and not the number of rows. So the calculation would take the same amount of time for 1 million rows as it would for 100. I believe the complexity would be
O(1)*Number_Of_DISTINCT_ELEMENTS
But in the case of MySQL if I have 10 distinct groups will MySQL still seek through every row, basically calculating a running some of each group, or is it set up in such a way that a group of rows of the same value can be calculated in O(1) time for each distinct column value? If not then I belive it would mean the complexity is
O(NUM_ROWS)
Why Do I Care?
I have a page in my site that lists stats for categories of messages, such as total unread, total messages, etc. I could calculate this information using GROUP BY and SUM() but I was under the impression this will take longer as the number of messages grow so instead I have a table of stats for each category. When a new message is sent or created I increment the total_messages field. When I want to view the states page I simply select a single row
SELECT total_unread_messages FROM stats WHERE category_id = x
instead of calculating those stats live across all messages using GROUP BY and/or DISINCT.
The performance hit either way is not large in my case and so this may seem like a case of "premature optimization", but it would be nice to know when I'm doing something that is or isn't scalable with regard to other options that don't take much time to construct.
If you are doing:
select distinct column
from table
And there is an index on column, then MySQL can process this query using a "loose index scan" (described here).
This should allow the engine to read one key from the index and then "jump" to the next key without reading the intermediate keys (which are all identical). This suggests that the operation does not require reading the entire index, so it is, in general, less than O(n) (where n = number of rows in the table).
I doubt that finding the next value requires only one operation. I wouldn't be surprised if the overall complexity were something like O(m * log(n)), where m = number of distinct values.
Related
I had a table that is holding a domain and id
the query is
select distinct domain
from user
where id = '1'
the index is using the order idx_domain_id is faster than idx_id_domain
if the order of the execution is
(FROM clause,WHERE clause,GROUP BY clause,HAVING clause,SELECT
clause,ORDER BY clause)
then the query should be faster if it use the sorted where columns than the select one.
at 15:00 to 17:00 it show the same query i am working on
https://serversforhackers.com/laravel-perf/mysql-indexing-three
the table has a 4.6 million row.
time using idx_domain_id
time after change the order
This is your query:
select distinct first_name
from user
where id = '1';
You are observing that user(first_name, id) is faster than user(id, firstname).
Why might this be the case? First, this could simply be an artifact of how your are doing the timing. If your table is really small (i.e. the data fits on a single data page), then indexes are generally not very useful for improving performance.
Second, if you are only running the queries once, then the first time you run the query, you might have a "cold cache". The second time, the data is already stored in memory, so it runs faster.
Other issues can come up as well. You don't specify what the timings are. Small differences can be due to noise and might be meaningless.
You don't provide enough information to give a more definitive explanation. That would include:
Repeated timings run on cold caches.
Size information on the table and the number of matching rows.
Layout information, particularly the type of id.
Explain plans for the two queries.
select distinct domain
from user
where id = '1'
Since id is the PRIMARY KEY, there is at most one row involved. Hence, the keyword DISTINCT is useless.
And the most useful index is what you already have, PRIMARY KEY(id). It will drill down the BTree to find id='1' and deliver the value of domain that is sitting right there.
On the other hand, consider
select distinct domain
from user
where something_else = '1'
Now, the obvious index is INDEX(something_else, domain). This is optimal for the WHERE clause, and it is "covering" (meaning that all the columns needed by the query exist in the index). Swapping the columns in the index will be slower. Meanwhile, since there could be multiple rows, DISTINCT means something. However, it is not the logical thing to use.
Concerning your title question (order of columns): The = columns in the WHERE clause should come first. (More details in the link below.)
DISTINCT means to gather all the rows, then de-duplicate them. Why go to that much effort when this gives the same answer:
select domain
from user
where something_else = '1'
LIMIT 1
This hits only one row, not all the 1s.
Read my Indexing Cookbook.
(And, yes, Gordon has a lot of good points.)
I'm trying to use EXPLAIN to take a closer look at my queries and see how they're running, and so far, the largest id created in an EXPLAINhas been 7, but it was lengthy query with a lot going on. I just made another query with a structure similar to below and EXPLAIN gave me an id maximum of 13. From what I know about EXPLAIN is it generally means the query is less efficient/runs longer the higher an id EXPLAIN gives, but is this a relative rule or are there some sort of boundaries? Like is a query running with a max of 2 id's seen as very efficient and a query with a max id of 13 seen as very unefficient, or is it just 2 is more efficient than 13? Of course there's the third option of id number having no correlation to efficiency.
ID 13 Query:
select if(cond1, subquery, if(cond2, subquery(subsubquery),
subquery(subsubquery))) as colA, if(cond1, subquery(subsubquery), if(cond2,
subquery(subsubquery), subquery(subsubquery))) as colB from TableA join
TableB on X group by y order by z desc
I've never really heard of the id number correlating to efficiency. Unless I am mistaken, it is just little more than the number of tables (and derived tables) that end up being involved in processing the query.
Joining to a huge table once might make for less/lower id; joining to temp tables that are duplicate (since you can't use them twice in one query) but a miniscule relevant fraction of that huge table (and better/more appropriately indexed) numerous times is sure to increase the id count, but may run much more quickly and efficiently... even factoring in the cost of the preceding queries that were needed to generate those temp tables.
This post shows some hacks to page data from DB2:
How to query range of data in DB2 with highest performance?
However it does not provide a way to show the total number of rows (like MySQL's CALC_FOUND_ROWS).
SELECT SQL_CALC_FOUND_ROWS thread_id AS id, name, email
FROM threads WHERE email IS NOT NULL
LIMIT 20 OFFSET 200
And in MySQL I can follow that up with
SELECT FOUND_ROWS()
to get the total number of rows. The first part is fairly easy to duplicate with recent versions of DB2. I can't find any results on Google for a reasonable equivalent to the second query (I don't want temp tables, subqueries, or other absurdly inefficient solutions).
I don't think this exists in DB2.
Note that the total number of rows is a value that needs extra calculation to obtain. It isn't just lying around somewhere--it would have to be specifically built into the LIMIT logic. Which it doesn't look like they did.
Using SQL Server 2012, I have a table with 7 million rows. PK column is a GUID (COMB GUID). I am trying to test the performance of a query and first need to update a random sampling of data, I want to change a column value (not the PK) of 50,000 rows.
Selecting Top 50,000 Order by NEWID() takes way too long, I think SQL Server is scanning the whole table. I cannot seem to get the syntax right for TABLESAMPLE, it returns an empty set.
What is the best way to get this to work?
And to treat it as an update:
;WITH x AS
(
SELECT TOP (50000) col
FROM dbo.table TABLESAMPLE (50000 ROWS)
)
UPDATE x SET col = 'something else';
But a couple of notes:
You probably won't see a huge performance improvement over ORDER BY NEWID(). On a table with 1MM rows this took over a minute on my machine.
The TOP is there because TABLESAMPLE doesn't guarantee the exact number of rows - it's based on a rough calculation of how many pages might contain 50,000 rows. You may end up with less or more depending on your fillfactor, how many variable-length columns, how many NULL values, etc. The TOP above will help limit it to 50,000 when the estimate leads to a larger number of pages being read, but it won't help if the estimate is under.
There is some discussion of this going on in another question right now.
I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!
You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...
You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.