I need to get the amount of distinct values of every column in a table. So, I wonder, if using a query like
select count(col1), count(col2),.., count(colN) from table;
will scan the whole table N times to get all these counts? Then will it be better to use objects/procedures that concrete DBMS has to create array 1..N with value amount for every column and count values by looping table records and incrementing array elements?
I understand that this is totally dependent to a DBMS realization, so I'd like to know it specially for MySQL (but info about other popular systems is interesting too).
You will need to do:
select count(distinct col1),
count(distinct col2),
...
from table;
and the database should just do a single full-table scan to calculate this.
Related
In my script, I have a lot of SQL WHERE clauses, e.g.:
SELECT * FROM cars WHERE active=1 AND model='A3';
SELECT * FROM cars WHERE active=1 AND year=2017;
SELECT * FROM cars WHERE active=1 AND brand='BMW';
I am using different SQL clauses on same table because I need different data.
I would like to set index key on table cars, but I am not sure how to do it. Should I set separate keys for each column (active, model, year, brand) or should I set keys for groups (active,model and active,year and active,brand)?
WHERE a=1 AND y='m'
is best handled by INDEX(a,y) in either order. The optimal set of indexes is several pairs like that. However, I do not recommend having more than a few indexes. Try to limit it to queries that users actually make.
INDEX(a,b,c,d):
WHERE a=1 AND b=22 -- Index is useful
WHERE a=1 AND d=44 -- Index is less useful
Only the "left column(s)" of an index are used. Hence the second case, uses a, but stops because b is not in the WHERE.
You might be tempted to also have (active, year, model). That combination works well for active AND year, active AND year AND model, but not active AND model (but no year).
More on creating indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Since model implies a make, there is little use to put both of those in the same composite index.
year is not very selective, and users might want a range of years. These make it difficult to get an effective index on year.
How many rows will you have? If it is millions, we need to work harder to avoid performance problems. I'm leaning toward this, but only because the lookup would be more compact.
We use single indexing when we want to query for just one column, same asin your case and multiple group indexing when we have multiple condition in the same where clause.
Go for single indexing.
For more detailed explanation, refer this article: https://www.sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/
I have a MySQL table which I'm trying to search a pair of columns for a single value. It's quite a large table, so I want the search time as fast as possible.
I have simplified the tables below for ease of understanding
SELECT * FROM clients WHERE name=? OR sirname=?
VS
SELECT * FROM clients WHERE ? IN (name, sirname)
with indexes on name and sirname
EXPLAIN on the former uses the indexes, but not on the latter
Is this accurate, or is there some optimisation going on under the hood which EXPLAIN doesn't catch?
Strongly related to Checking multiple columns for one value, but cannot discuss there due to age of thread.
Because MySQL generally uses one index per table reference in a query, you will probably have to do it this way:
SELECT * FROM clients WHERE name=?
UNION
SELECT * FROM clients WHERE sirname=?
This will count as two table references for purposes of index selection. The appropriate index will be used in each case.
I have a table with some denormalized data for a specific purpose (don't ask), so it has several hundred columns. There is a primary key.
This table is updated weekly, but most id:s will have the same data as the week before.
Now, I need to store all record versions in a history table, i.e. if record with id X is added week N, no changes week N+1 but some data changed week N+2 and N+3, then the history table should contain three records: Those from weeks N, N+2 and N+3.
It's technically easy to write the appropriate insert query, but it would involve comparison of each column, so it will be a very long SQL query. I'm sure it would work, but...
Is there any way in MySQL to compare ALL columns without explicitly writing ...or t1.col1 <> t2.col1... for each column? I.e. something like ...t1.allcolumns <> t2.allcolumns..., like comparing the entire row in one go?
I'm pretty sure the answer is no, but... :-)
You can write a program (in your favourite programming language) to build the query. The program would look in the schema for the database, find all the columns of the table, and construct the query from that. I don't think it is possible to do that in plain SQL, but even if possible, plain SQL is probably the wrong tool.
You can use the row-values syntax, but you still have to name all columns:
(t1.col1, t1.col2, ...) <> (t2.col1, t2.col2, ...)
Update 1
Check this out: https://www.techonthenet.com/mysql/intersect.php
Intersect t1 and t2. Result = rows on both tables.
Select all fiends from t1 not in your intersect result.
Sorry for the lack of code, I don't have time to elaborate, but that's the idea.
Question
I'm not a comp sci major so forgive me if I muddle the terminology. What is the computational complexity for calling
SELECT DISTINCT(column) FROM table
or
SELECT * FROM table GROUP BY column
on a column that IS indexed? Is it proportional to the number of rows or the number of distinct values in the column. I believe that would be O(1)*NUM_DISINCT_COLS vs O(NUM_OF_ROWS)
Background
For example if I have 10 million rows but only 10 distinct values/groups in that column visually you could simply count the last item in each group so the time complexity would be tied to the number of distinct groups and not the number of rows. So the calculation would take the same amount of time for 1 million rows as it would for 100. I believe the complexity would be
O(1)*Number_Of_DISTINCT_ELEMENTS
But in the case of MySQL if I have 10 distinct groups will MySQL still seek through every row, basically calculating a running some of each group, or is it set up in such a way that a group of rows of the same value can be calculated in O(1) time for each distinct column value? If not then I belive it would mean the complexity is
O(NUM_ROWS)
Why Do I Care?
I have a page in my site that lists stats for categories of messages, such as total unread, total messages, etc. I could calculate this information using GROUP BY and SUM() but I was under the impression this will take longer as the number of messages grow so instead I have a table of stats for each category. When a new message is sent or created I increment the total_messages field. When I want to view the states page I simply select a single row
SELECT total_unread_messages FROM stats WHERE category_id = x
instead of calculating those stats live across all messages using GROUP BY and/or DISINCT.
The performance hit either way is not large in my case and so this may seem like a case of "premature optimization", but it would be nice to know when I'm doing something that is or isn't scalable with regard to other options that don't take much time to construct.
If you are doing:
select distinct column
from table
And there is an index on column, then MySQL can process this query using a "loose index scan" (described here).
This should allow the engine to read one key from the index and then "jump" to the next key without reading the intermediate keys (which are all identical). This suggests that the operation does not require reading the entire index, so it is, in general, less than O(n) (where n = number of rows in the table).
I doubt that finding the next value requires only one operation. I wouldn't be surprised if the overall complexity were something like O(m * log(n)), where m = number of distinct values.
This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.
If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.
AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.
Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.
Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.
Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.