When is GROUP BY required for aggregate functions?

When is GROUP BY required for aggregate functions? - mysql

I have a table called myEntity as follows:
- id (PK INT NOT NULL)
- account_id (FK INT NOT NULL)
- key (INT NOT NULL. UNIQUE for given account_id)
- name (VARCHAR NOT NULL. UNIQUE FOR given account_id)
I don't wish to expose the primary key id to the user, and added key for this purpose. key kind of acts as an auto-increment column for a given accounts_id which will need to be manually done by the application. I first planned on making the primary key composite id-account_id, however, the table is joined to other tables, and before I knew it, I had four columns in a table which could have been one. While account_id-name does the same as account_id-key, key is smaller and will minimize network traffic when a client requests multiple records. Yes, I know it isn't properly normalized, and while not my direct question, would appreciate any constructive criticism comments.
Sorry for the rambling... When is GROUP BY required for an aggregate function? For instance, what about the following? https://stackoverflow.com/a/1547128/1032531 doesn't show one. Is it needed?
SELECT COALESCE(MAX(key),0)+1 FROM myEntity WHERE accounts_id=123;

You gave a query as an example not requiring GROUP BY. For the sake of explanation, I'll simplify it as follows.
SELECT MAX(key)
FROM myEntity
WHERE accounts_id = 123
Why doesn't that query require GROUP BY? Because you only expect one row in the result set, describing a particular account.
What if you wanted a result set describing all your accounts with one row per account? Then you would use this:
SELECT accounts_id, MAX(key)
FROM myEntity
GROUP BY accounts_id
See how that goes? You get one row in this result set for each distinct value of accounts_id. By the way, MySQL's query planner knows that
SELECT accounts_id, MAX(key)
FROM myEntity
WHERE accounts_id = '123'
GROUP BY accounts_id
is equivalent to the same query omitting the GROUP BY clause.
One more thing to know: If you have a compound index on (accounts_id, key) in your table, all these queries will be almost miraculously fast because the query planner will satisfy them with a very efficient loose index scan. That's specific to MAX() and MIN() aggregate functions. Loose index scans can't bue used for SUM() or AVG() or similar functions; those require tight index scans.

It's only needed when you need it. For example, if you wanted to return all of the keys, you could use
SELECT COALESCE(MAX(key),0)+1 FROM myEntity GROUP BY accounts_id
rather than your select. But your select is fine (though it seems like you may have made things a little hard for yourself with your structure but I don't know what issues you're trying to address)

Related

Choosing the optimal index for SQL Query

I'm doing some problem sets in my database management course and I can't figure this specific problem out.
We have the following relation:
Emp (id, name, age, sal, ...)
And the following query:
SELECT id
FROM Emp
WHERE age > (select max(sal) from Emp);
We are then supposed to choose an index that we would be a good query optimizer. My answer would be to just use Emp(age) but the solution to the question is
Emp(age)
&
Emp(sal)
How come there are 2 indices? I can't seem to wrap my head around why you would need more than the age attribute..

Of course, you realize that the query is non-sensical, comparing age to sal (which is presumably a salary). That said, two indexes are appropriate for:
SELECT e.id
FROM Emp e
WHERE e.age > (select max(e2.sal) from Emp e2);
I added table aliases to emphasize that the query is referring to the Emp table twice.
To get the maximum sal from the table, you want an index on emp(sal). The maximum is a simple index lookup operation.
Then you want to compare this to age. Well, for a comparison to age, you want an index on emp(age). This an entirely separate reference to emp that has no reference to sal, so you cannot put the two columns in a single index.
The index on age may not be necessary. The query may be returning lots of rows -- and tables that returns lots of rows don't generally benefit from a secondary index. The one case where it can benefit from the index is if age is a clustered index (that is, typically the first column in the primary key). However, I wouldn't recommend such an indexing structure.

you need both indexes to get optimal performance
1) the subquery (select max(sal) from Emp) will benefit from indexing Emp(sal) because on a tree-index, retrieving the max would be much quicker
2) the outer query needs to run a filtering on Emp(age), so that also benefits from a tree-index

MySQL index key on table with more columns

In my script, I have a lot of SQL WHERE clauses, e.g.:
SELECT * FROM cars WHERE active=1 AND model='A3';
SELECT * FROM cars WHERE active=1 AND year=2017;
SELECT * FROM cars WHERE active=1 AND brand='BMW';
I am using different SQL clauses on same table because I need different data.
I would like to set index key on table cars, but I am not sure how to do it. Should I set separate keys for each column (active, model, year, brand) or should I set keys for groups (active,model and active,year and active,brand)?

WHERE a=1 AND y='m'
is best handled by INDEX(a,y) in either order. The optimal set of indexes is several pairs like that. However, I do not recommend having more than a few indexes. Try to limit it to queries that users actually make.
INDEX(a,b,c,d):
WHERE a=1 AND b=22 -- Index is useful
WHERE a=1 AND d=44 -- Index is less useful
Only the "left column(s)" of an index are used. Hence the second case, uses a, but stops because b is not in the WHERE.
You might be tempted to also have (active, year, model). That combination works well for active AND year, active AND year AND model, but not active AND model (but no year).
More on creating indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Since model implies a make, there is little use to put both of those in the same composite index.
year is not very selective, and users might want a range of years. These make it difficult to get an effective index on year.
How many rows will you have? If it is millions, we need to work harder to avoid performance problems. I'm leaning toward this, but only because the lookup would be more compact.

We use single indexing when we want to query for just one column, same asin your case and multiple group indexing when we have multiple condition in the same where clause.
Go for single indexing.
For more detailed explanation, refer this article: https://www.sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/

What will be the behavior of the index in these two scenarios in relation databases like mysql?

Let's say I have a table students with the following fields
id,student_id,test_type,score
Consider these two queries
select * from students where student_id = x and score > y
select * from students where student_id = x order by score
Let's say I have indexes on both student_id and score but not a composite index, what will be the indexes that will be used by the database? Will the query be able to use both of the indexes or whether at max one index can be used?
Let's say with the student_id index I am able to restrict the results in the query, will I be able to use the score index to sort or filtering?
or if databases chooses the score index to sort or filter first, will I be able to student_id index for student_id =x filtering?

MySQL's optimizer would like the composite INDEX(student_id, score) for both queries.
Without the composite index... The optimizer almost never uses two indexes. The optimizer would pick between INDEX(student_id) and INDEX(score).
But there is another wrinkle -- If this table is InnoDB, and if it has PRIMARY KEY(student_id), then INDEX(score) implicitly has student_id tacked on then end. HenceINDEX(score)` would be perfect for the first query.
Given two indexes, the optimizer looks at cardinality and various other things to pick between them.
More on creating the best index.

Well it definitely depends on your data set and database. Imagine in the students table if I have 100 different id's but the same student_id. The student_id index would be considered bad and the *Teradata Query Optimizer would be smart enough to choose a better one like score or id. (If using Teradata DB, but most have built in smart features like this). A Composite Index certainly wouldn't be selected because WHY? I think in this tables case wouldn't help fetch at all. The best way to select a good index is to ask okay which column can provide me a solid unique value that is inexpensive (Integer) and can eliminate a good partition or chunk of data if selected. But yes student_id would be the best index in this case. Plus the query that ends with "and score > y" would be quicker. Where clause is always seen first so dataset will be much smaller.

Correct database operation to get the data desired

Ok, so I've got a MySQL database with several tables. One of the tables (table A) has the items of most interest to me.
It has a column called type and a column called entity_id. The primary key is something called registration_id, which is more or less irrelevant to me currently.
Ultimately, I want to gather all items of a particular type, but which have a unique entity_id. The only problem with this is that entity_id in table A is NOT a unique key. It is possible to have multiple registration_ids per entity_id.
Now, there's another table (table B) which has only a list of unique entity_ids (that is, it is the primary key on that table), however there's no information on the type in that table.
So with these two tables, what is the best way to get the data I want?
I was thinking some sort of way (DISTINCT) that I could use on the first table, alone, or possibly a join of some sort (I'm still relatively new to the concept of joins) between table A and table B, combining the entity_id from table B with the type from table A.
What's the most efficient database operation for this for now? And should I (eventually, not right now as I simply do not have the time, sadly) change the database structure for greater efficiency?
If anyone needs any additional information or graphics, let me know.

If I understand correctly you can use either GROUP BY
SELECT entity_id
FROM table1
WHERE type = ?
GROUP BY entity_id
or DISTINCT
SELECT DISTINCT entity_id
FROM table1
WHERE type = ?
Here is SQLFiddle demo

Table Joins are a costly operation. If you are dealing with large datasets then the time it takes to execute a join operation is non-negligible.
The following SQL statement will grab all entity_id's and group them by type. So for each entity_id only 1 of each type will be in the result set:
SELECT type, entity_id FROM TableA GROUP BY type, entity_id;

I think this is what you are looking for. Try this to give you the types that have only one (unique) entity_id.
SELECT type , count(entity_id)
FROM table1
GROUP BY type
HAVING COUNT(entity_id)=1
Here is the SQL Fiddle

(Why) Can't MySQL use index in such cases?

1 - PRIMARY used in a secondary index, e.g. secondary index on (PRIMARY,column1)
2 - I'm aware mysql cannot continue using the rest of an index as soon as one part was used for a range scan, however: IN (...,...,...) is not considered a range, is it? Yes, it is a range, but I've read on mysqlperformanceblog.com that IN behaves differently than BETWEEN according to the use of index.
Could anyone confirm those two points? Or tell me why this is not possible? Or how it could be possible?
UPDATE:
Links:
http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/
http://www.mysqlperformanceblog.com/2006/08/14/mysql-followup-on-union-for-query-optimization-query-profiling/comment-page-1/#comment-952521
UPDATE 2: example of nested SELECT:
SELECT * FROM user_d1 uo
WHERE EXISTS (
SELECT 1 FROM `user_d1` ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
AND ui.id=uo.id
)
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
So, the outer SELECT uses timestamp_lastonline for sorting, the inner either PK to connect with the outer or birthdate for filtering.
What other options rather than this query are there if MySQL cannot use index on a range scan and for sorting?

The column(s) of the primary key can certainly be used in a secondary index, but it's not often worthwhile. The primary key guarantees uniqueness, so any columns listed after it cannot be used for range lookups. The only time it will help is when a query can use the index alone
As for your nested select, the extra complication should not beat the simplest query:
SELECT * FROM user_d1 uo
WHERE uo.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
MySQL will choose between a birthdate index or a timestamp_lastonline index based on which it feels will have the best chance of scanning fewer rows. In either case, the column should be the first one in the index. The birthdate index will also carry a sorting penalty, but might be worthwhile if a large number of recent users will have birth dates outside of that range.
If you wish to control the order, or potentially improve performance, a (timestamp_lastonline, birthdate) or (birthdate, timestamp_lastonline) index might help. If it doesn't, and you really need to select based on the birthdate first, then you should select from the inner query instead of filtering on it:
SELECT * FROM (
SELECT * FROM user_d1 ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
) as uo
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
Even then, MySQL's optimizer might choose to rewrite your query if it finds a timestamp_lastonline index but no birthdate index.
And yes, IN (..., ..., ...) behaves differently than BETWEEN. Only the latter can effectively use a range scan over an index; the former would look up each item individually.

2.IN will obviously differ from BETWEEN. If you have an index on that column, BETWEEN will need to get the starting point and it's all done. If you have IN, it will look for a matching value in the index value by value thus it will look for the values as many times as there are values compared to BETWEEN's one time look.

yes #Andrius_Naruševičius is right the IN statement is merely shorthand for EQUALS OR EQUALS OR EQUALS has no inherent order whatsoever where as BETWEEN is a comparison operator with an implicit greater than or less than and therefore absolutely loves indexes
I honestly have no idea what you are talking about, but it does seem you are asking a good question I just have no notion what it is :-). Are you saying that a primary key cannot contain a second index? because it absolutely can. The primary key never needs to be indexed because it is ALWAYS indexed automatically, so if you are getting an error/warn (I assume you are?) about supplementary indices then it's not the second, third index causing it it's the PRIMARY KEY not needing it, and you mentioning that probably is the error. Having said that I have no idea what question you asked - it's my answer to my best guess as to your actual question.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

When is GROUP BY required for aggregate functions? - mysql

Related

Choosing the optimal index for SQL Query

MySQL index key on table with more columns

What will be the behavior of the index in these two scenarios in relation databases like mysql?

Correct database operation to get the data desired

(Why) Can't MySQL use index in such cases?

Categories

Resources