MySQL: Indexes on GROUP BY - mysql

I have a reasonably big table (>10.000 rows) which is going to grow much bigger fast. On this table I run the following query:
SELECT *, MAX(a) FROM table GROUP BY b, c, d
Currently EXPLAIN tells me that there are no keys, no possible keys and it's "Using temporary; Using filesort". What would the best key be for such a table?

What about composite key b+c+d+a?
Btw, SELECT * makes no sense in case when you have GROUP BY

A primary index on field b,c,d would be nice if applicable.
In that case you just do a
SELECT * FROM table1
group by <insert PRIMARY KEY here>
If not put an index on b,c,d.
And maybe on a, depends on the performance.
If b,c,d are always used in unison, use a composite index on all three.
Very important! Always declare a primary key. Without it performance on InnoDB will suck.
To elaborate on #zerkms, you only need to put those columns in the group by clause that completely define the rows that you are selecting.
If you select * that may be OK, but than the max(a) is not needed and neither is the group by.
Also note that the max(a) may come from a different row than the rest of the fields.
The only use case that does make sense is:
select t1.*, count(*) as occurrence from t1
inner join t2 on (t1.id = t2.manytoone_id)
group by t1.id
Where t1.id is the PK.
I think you need to rethink that query.
Ask a new question explaining what you want with the real code.
And make sure to ask how to make the outcomedeterminate, so that all values shown are functionally dependent on the group by clause.

In the end what worked was a modification to the query as follows:
SELECT b, c, d, e, f, MAX(a) FROM table GROUP BY b, c, d
And creating an index on (b, c, d, e, f).
Thanks a lot for your help: the tips here were very useful.

Related

MariaDB subquery use whole row

Usually subqueries compare single or multiple fields and delete statements usually delete values by an ID. Unfortunately I don't have a ID field and I have to use an generic approach for differnt kind of tables.
That's why I am working with a subquery using limit and offset as resolving rows.
I know that approach is risky, however is there any way to delete rows by subquerying and comparing the whole row?
DELETE FROM table WHERE * = ( SELECT * FROM table LIMIT 1 OFFSET 6 )
I am using the latest version of MariaDB
This sounds like a really strange need, but who am I to judge? :)
I would simply rely on the primary key:
DELETE FROM table WHERE id_table = (SELECT id_table FROM table LIMIT 1 OFFSET 6)
update: oh, so you don't have a primary key? You can join on the whole row this way (assuming it has five columns named a, b, c, d, e):
DELETE t
FROM table t
INNER JOIN (
SELECT a, b, c, d, e
FROM table
ORDER BY a, b, c, d, e
LIMIT 1 OFFSET 6
) ROW6 USING (a, b, c, d, e);
Any subset of columns (e.g. a, c, d) that uniquely identify a row will do the trick (and is probably what you need as a primary key anyway).
Edit: Added an ORDER BY clause as per The Impaler's excellent advice. That's what you get for knocking an example up quickly.
DELETE FROM t
ORDER BY ... -- fill in as needed
LIMIT 6
(Works on any version)

I have sql data 17949366 in mysql workbench, I try to write query for finding duplicate data

SELECT id, survey_id
From Table1
Where survey_id IN(
select survey_id
from Table1
Group By survey_id
having count(id)>1
)
THIS IS MY QUERY BUT I HAVE BIG DATA I GUESS STILL FETCHING IN IT IN MYSQL WORKBENCH. ANY IDEA I CAN MAKE THIS PROCESS FASTER ?
Sometimes EXISTS performs better because it returns as soon as it finds the row:
SELECT t.id, t.survey_id
From Table1 AS t
WHERE EXISTS(
SELECT 1 FROM Table1
WHERE id <> t.id AND survey_id = t.survey_id
)
I assume id is the primary key in the table.
You can group your data without subqueries:
SELECT id, GROUP_CONCAT(survey_id) as survey_ids
FROM Table1
GROUP BY id
HAVING COUNT(survey_id)>1;
Select count(*),column from table group by column having count(column) > 1
You can simply group by directly. No need for sub query.
Try to add index for column
Use EXPLAIN to see the query execution plan.
On large sets, we will get better performance when an index can be used to satisfy a GROUP BY, rather than a "Using filesort" operation.
Personally, I'd avoid the IN (subquery) and instead use a join to a derived table. I don't know that this has any impact on performance, or in which versions of MySQL there might be a difference. Just my personal preference to write the query this way:
SELECT t.id
, t.survey_id
FROM ( -- inline view
SELECT s.survey_id
FROM Table1 s
GROUP BY s.survey_id
HAVING COUNT(s.id) > 1
) r
JOIN Table1 t
ON t.survey_id = r.survey_id
We do want an index that has survey_id as the leading column. That allows the GROUP BY to be satisfied from the index, avoiding a potentially expensive "Using filesort" operation. That same index will also be used for the join to the original table.
CREATE INDEX Table1_IX2 ON Table1 (survey_id, id, ...)
NOTE: If this is InnoDB and if id is the cluster key, then including the id column doesn't use any extra space (it does enforce some additional ordering), but more importantly it makes the index a covering index for the outer query (query can be satisfied entirely from the index, without lookups of pages in the underlying table.)
With that index defined, we'd expect the EXPLAIN output Extra column show "Using index" for the outer query, and to omit "Using filesort" for the derived table (inline view).
Again, use EXPLAIN to see the query execution plan.

GROUP BY a, b VS GROUP BY b, a

I was wondering if there is any difference between order of grouping in GROUP BY a, b and GROUP BY b, a (I know the final result is the same). If so, would it affect the query's speed?
A group by clause just defines the unique combination of field(s) which would be considered a group. There is no meaning to the order these fields are stated.
It does matter if you have multiple-column indexes. You should define the GROUP BY columns in the order of the index.
So, if you have an index for (a,b) then you should use GROUP BY a, b and MySQL is able to take full advantage of the index.
See example

Query using LEFT() function has bad performance

I am using INNER JOIN and WHERE with LEFT function to match records by its first 8 chars.
INSERT INTO result SELECT id FROM tableA a
INNER JOIN tableB b ON a.zip=b.zip
WHERE LEFT(a.street,8)=LEFT(b.street,8)
Both a.street and b.street are indexed (partial index 8).
The query didn't finish in 24+ hours. I am wondering is there a problem with indexes or is there a more efficient way to perform this task
Mysql won't use indexes for columns that have a function applied.
Other databases do allow function based indexes.
You could create a column with just the first 8 chars of a.street and b.street and index those and things will be quicker.
This is your query:
INSERT INTO result
SELECT id
FROM tableA a INNER JOIN
tableB b ON a.zip=b.zip
WHERE LEFT(a.street,8)=LEFT(b.street,8);
MySQL is not smart enough to use a prefix index with this comparison. It will use a prefix index for like and direct string comparisons. If I assume that id is combine from tableA, then the following may perform better:
INSERT INTO result(id)
SELECT id
FROM tableA a
WHERE exists (select 1
from tableB b
where a.zip = b.zip and
b.street like concat(left(a.street, 8), '%')
);
The index that you want is tableB(zip, street(8)) or tableB(zip, street). This may use both components of the index. In any case, it might get better performance even if it cannot use both sides of the index.

Is using an IN over a huge data set a good idea?

Let's say I have a query of the form:
SELECT a, b, c, d
FROM table1
WHERE a IN (
SELECT x
FROM table2
WHERE some_condition);
Now the query for the IN can return a huge number of records.
Assuming that a is the primary key, so an index is used is this the best way to write such a query?
Or it is more optimal to loop over each of the records returned by the subquery?
For me it is clear that when I do a where a = X it is clear that I just do an index (tree) traversal.
But I am not sure how an IN (especially over a huge data set) would traverse/utilize an index.
The MySQL optimizer isn't really ready (jet) to handle this correctly you should rewrite this kind of query to a iNNER JOIN and index correctly this will be the fasted method assuming t1.a and t2.x are unique
something like this.
SELECT
a
, b
, c
, d
FROM
table1 as t1
INNER JOIN
table2 as t2
ON t1.a = t2.x
WHERE
t1.some_condition ....
And make sure that t1.a and t2.x have PRIMARY or UNIQUE indexes
Having 1 query instead of loop will be definitely more efficient (and by nature consistent , to get consistent results with loop in general you will have to use serializable transaction ). One can argue in favour of EXISTS vs IN; as far as I remember mysql generates (or at least it was true for up to 5.1)...
Efficiency of utilizing index on a depends on number and order subquery result (assuming optimizer choses to grab results from subquery first and then compare it with a). In my understanding, the fastest option is to perform merge join which requires both resultsets sorted by the same key; however it may not be possible due to different sort order. Then I guess it's optimizer decision whether to sort or to use loop join. You can rely on its choice or try using hints and see if it makes a difference.