MySQL group-by very slow - mysql

I have the folowwing SQL query
SELECT CustomerID FROM sales WHERE `Date` <= '2012-01-01' GROUP BY CustomerID
The query is executed over 11400000 rows and runs very slow. It takes over 3 minutes to execute. If I remove the group-by part, this runs below 1 second. Why is that?
MySQL Server version is '5.0.21-community-nt'
Here is the table schema:
CREATE TABLE `sales` (
`ID` int(11) NOT NULL auto_increment,
`DocNo` int(11) default '0',
`CustomerID` int(11) default '0',
`OperatorID` int(11) default '0',
PRIMARY KEY (`ID`),
KEY `ID` (`ID`),
KEY `DocNo` (`DocNo`),
KEY `CustomerID` (`CustomerID`),
KEY `Date` (`Date`)
) ENGINE=MyISAM AUTO_INCREMENT=14946509 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Try putting an index on (Date,CustomerID).
Have a look at the mysql manual for optimizing group by queries:- Group by optimization
You can find out how mysql is generating the result if you use EXPLAIN as follows:-
EXPLAIN SELECT CustomerID FROM sales WHERE `Date` <= '2012-01-01' GROUP BY CustomerID
This will tell you which indexes (if any) mysql is using to optimize the query. This is very handy when learning which indexes work for which queries as you can try creating an index and see if mysql uses it. So even if you don't fully understand how mysql calculates aggregate queries you can create a useful index by trial and error.

Without knowing what your table schema looks like, it's difficult to be certain, but it would probably help if you added a multiple-column index on Date and CustomerID. That'd save MySQL the hassle of doing a full table scan for the GROUP BY statement. So try ALTER TABLE sales ADD INDEX (Date,CustomerID).

try this one :
SELECT distinct CustomerID FROM sales WHERE `Date` <= '2012-01-01'

I had the same problem, I changed the key fields to the same Collation and that fix the problem. Fields to join the tables had different Collate value.

Wouldn't this one be a lot faster and achieve the same?
SELECT DISTINCT CustomerID FROM sales WHERE `Date` <= '2012-01-01'
Make sure to place an index on Date, of course. I'm not entirely sure but indexing CustomerID might also help.

Related

At what execution level will MySQL utilize the index for ORDER BY?

I would like to understand at what point in time will MySQL use the indexed column when using ORDER BY.
For example, the query
SELECT * FROM A
INNER JOIN B ON B.id = A.id
WHERE A.status = 1 AND A.name = 'Mike' AND A.created_on BETWEEN '2014-10-01 00:00:00' AND NOW()
ORDER BY A.accessed_on DESC
Based on my knowledge a good index for the above query is an index on table A (id, status, name created_on, accessed_on) and another on B.id.
I also understand that SQL execution follow the order below. but I am not sure how the order selection and order works.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
Question
Is will it be better to start the index with the id column or in this case is does not matter since WHERE is executed first before the JOIN? or should it be
Second question the column accessed_on should it be at the beginning of the index combination, end or the middle? or should the id column come after all the columns in the WHERE clause?
I appreciate a detailed answer so I can understand the execution level of MySQL/SQL
UPDATED
I added few million records to both tables A and B then I have added multiple indexes to see which would be the best index. But, MySQL seems to like the index id_2 (ie. (status, name, created_on, id, accessed_on))
It seems to be applying the where and it will figure out that it would need and index on status, name, created_on then it apples the INNER JOIN and it will use the id index followed by the first 3. Finally, it will look for accessed_on as the last column. so the index (status, name, created_on, id, accessed_on) fits the same execution order
Here is the tables structures
CREATE TABLE `a` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`status` int(2) NOT NULL,
`name` varchar(255) NOT NULL,
`created_on` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`accessed_on` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `status` (`status`,`name`),
KEY `status_2` (`status`,`name`,`created_on`),
KEY `status_3` (`status`,`name`,`created_on`,`accessed_on`),
KEY `status_4` (`status`,`name`,`accessed_on`),
KEY `id` (`id`,`status`,`name`,`created_on`,`accessed_on`),
KEY `id_2` (`status`,`name`,`created_on`,`id`,`accessed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=3135750 DEFAULT CHARSET=utf8
CREATE TABLE `b` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3012644 DEFAULT CHARSET=utf8
The best indexes for this query is: A(status, name, created_on) and B(id). These indexes will satisfy the where clause and use the index for the join to B.
This index will not be used for sorting. There are two major impediments to using any index for sorting. The first is the join. The second is the non-equality on created_on. Some databases might figure out to use an index on A(status, name, accessed_on), but I don't think MySQL is smart enough for that.
You don't want id as the first column in the index. This precludes using the index to filter on A, because id is used for the join rather than in the where.

Index for slow MySQL Query

I have a large table with user votes. I tried nearly every tutorial and essay about INDEX usage and after failing that ... every possible combination of fields as keys, but the query stays slow.
Is there any Index I can use to speed this up?
(I will spare you my hidious attempts at indizes so far...)
CREATE TABLE IF NOT EXISTS `votes` (
`uid` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`objectId` bigint(15) NOT NULL,
`vote` tinyint(1) NOT NULL,
`created` datetime NOT NULL,
UNIQUE KEY `unique input` (`uid`,`objectId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The table has around 1.3Million rows and will continue growing. This is the query i am trying:
EXPLAIN SELECT objectId, COUNT(uid) AS voteCount, AVG(vote) AS rating
FROM votes
GROUP BY objectId
Any pointers?
The only approach I could suggest is the following, although I don't know if it would increase performance. It assumes that you have an Objects table, with one row per ObjectId:
SELECT ObjectId,
(select count(*) from votes v where v.objectid = o.objectid) as votecount,
(select avg(vote) from votes v where v.objectid = o.objectid) as rating
FROM objects o;
Then you want the following index: votes(objectid, vote).
This would replace the outer group by with index scans, which may speed up the query.
I don't see how an index will help to speed this up, because the average function will require that you interact with each and every row. There's no WHERE clause.
Maybe you could create a VIEW that would amortize the cost for you.
I think duffymo is correct. However you could try swapping the two columns in your unique key around (or add an index for objectid only) as it may help the GROUP BY.

Alternative to GROUP BY to optimize SQL call

I have the following SQL statement that is VERY slow. It varies from 600-800ms!
I'm looking for possible ways to optimize it, but not sure exactly the best route. My database is fairly big, with the entries table having 400,000 rows and the devices table having 90,000 rows.
SQL Statement
SELECT devices.manufacturer, COUNT(devices.manufacturer) AS device_count
FROM entries
JOIN devices ON entries.device_id=devices.id
WHERE waypoint_id IN (1,2,3,5)
AND entries.updated_at >= '2013-06-20 21:01:40 -0400'
AND entries.updated_at <= '2013-06-27 21:01:40 -0400'
GROUP BY devices.manufacturer;
Is this SQL statement slow because I'm running it on poor hardware, or because the statement is bad, or have I not structured the table correctly? Any thoughts would be appreciated!
Goal of Statement
Get a list of all the device manufacturers, and the associated count of how many times that manufacturer showed up in the entries table.
Table Structure
Devices
id int(11) NOT NULL AUTO_INCREMENT,
mac_address varchar(255) DEFAULT NULL,
user_id int(11) DEFAULT NULL,
created_at datetime NOT NULL,
updated_at datetime NOT NULL,
manufacturer varchar(255) DEFAULT NULL,
PRIMARY KEY (id),
UNIQUE KEY mac_address (mac_address),
KEY manufacturer (manufacturer)
ENGINE=InnoDB AUTO_INCREMENT=839310 DEFAULT CHARSET=utf8;
Entries
id int(11) NOT NULL AUTO_INCREMENT,
device_id int(11) DEFAULT NULL,
created_at datetime NOT NULL,
updated_at datetime NOT NULL,
waypoint_id int(11) DEFAULT NULL,
unsure tinyint(1) DEFAULT '0',
PRIMARY KEY (id),
KEY device_index (device_id)
ENGINE=InnoDB AUTO_INCREMENT=3389538 DEFAULT CHARSET=utf8;
Also– I have been looking into alternate databases. Considering this database is going to need very fast read/writes in the future, would something like Redis be of use?
The query would run faster if you added a multiple-column index on entries(waypoint_id, updated_at).
Also, you query would look better like this:
SELECT
devices.manufacturer,
COUNT(devices.manufacturer) AS device_count
FROM
entries
JOIN
devices ON devices.id = entries.device_id
WHERE
entries.waypoint_id IN (1,2,3,5)
AND
entries.updated_at BETWEEN '2013-06-20 21:01:40 -0400' AND '2013-06-27 21:01:40 -0400'
GROUP BY
devices.device_id
P.S.: wouldn't it be a good thing to explicitely declare device_id as a foreign key?
You'll need an index on Entries {waypoint_id, updated_at}. This should satisfy the:
WHERE waypoint_id IN (1,2,3,5)
AND entries.updated_at >= '2013-06-20 21:01:40 -0400'
AND entries.updated_at <= '2013-06-27 21:01:40 -0400';
Depending on actual cardinalities, you may or may not want to reverse the order of fields in this composite index.
Alternatively, create a covering index on Entries {waypoint_id, updated_at, device_id}, to avoid accessing the Entries table altogether.
On top of that, consider creating an index on Devices {id, manufacturer}. Hopefully, MySQL will be smart enough to use it to satisfy both JOIN and aggregation without even accessing the Devices table.

mysql where + group by very slow

one question that I should be able to answer myself but I don't and I also don't find any answer in google:
I have a table that contains 5 million rows with this structure:
CREATE TABLE IF NOT EXISTS `files_history2` (
`FILES_ID` int(10) unsigned DEFAULT NULL,
`DATE_FROM` date DEFAULT NULL,
`DATE_TO` date DEFAULT NULL,
`CAMPAIGN_ID` int(10) unsigned DEFAULT NULL,
`CAMPAIGN_STATUS_ID` int(10) unsigned DEFAULT NULL,
`ON_HOLD` decimal(1,0) DEFAULT NULL,
`DIVISION_ID` int(11) DEFAULT NULL,
KEY `DATE_FROM` (`DATE_FROM`),
KEY `FILES_ID` (`FILES_ID`),
KEY `CAMPAIGN_ID` (`CAMPAIGN_ID`),
KEY `CAMP_DATE` (`CAMPAIGN_ID`,`DATE_FROM`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
When I execute
SELECT files_id, min( date_from )
FROM files_history2
WHERE campaign_id IS NOT NULL
GROUP BY files_id
the query rests with status "Sending data" for more than eight hours (then I killed the process).
Here the explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE files_history2 ALL CAMPAIGN_ID,CAMP_DATE NULL NULL NULL 5073254 Using where; Using temporary; Using filesort
I assume that I generated the necessary keys but then the query should take that long, does it?
I would suggest a different index... Index on (Files_ID, Date_From, Campaign_ID)...
Since your group by is on Files_ID, you want THOSE grouped. Then the MIN( Date_From), so that is in second position... Then FINALLY the Campaign_ID to qualify for not null and here's why...
If you put all your campaign IDs first, great, get all the NULLs out of the way... Now, you have 1,000 campaigns and the Files_ID spans MANY campaigns and they also span many dates, you are going to choke.
By the index I'm projecting, by the Files_ID first, you have each "files_id" already ordered to match your group by. Then, within that, all the earliest dates are at the top of the indexed list... great, almost there, then, by campaign ID. Skip over whatever NULL may be there and you are done, on to the next Files_ID
Hope this makes sense -- unless you have TONs of entries with NULL value campaigns.
Also, by having all 3 parts of the index matching the criteria and output columns of your query, it never has to go back to the raw data file for the data, it gets it all from the index directly.
I'd create a covering index (CAMPAIGN_ID, files_id, date_from) and check that performance. I suspect your issue is due to the grouping not and date_from not being able to use the same index.
CREATE INDEX your_index_name ON files_history2 (CAMPAIGN_ID, files_id, date_from);
If this works you could drop the point index CAMPAIGN_ID as it's included in the composite index.
Well the query is slow due to the aggregation ( function MIN ) along with grouping.
One of the solution is altering your query by moving the aggregating subquery from the WHERE clause to the FROM clause, which will be lot faster than the approach you are using.
try following:
SELECT f.files_id
FROM file_history2 AS f
JOIN (
SELECT campaign_id, MIN(date_from) AS datefrom
FROM file_history2
GROUP BY files_id
) AS f1 ON f.campaign_id = f1.campaign_id AND f.date_from = f1.datefrom;
This should have lot better performance, if doesn't work temporary table would only be the choice to go with.

How could I optimise this MySQL query?

I have a table that stores a pupil_id, a category and an effective date (amongst other things). The dates can be past, present or future. I need a query that will extract a pupil's current status from the table.
The following query works:
SELECT *
FROM pupil_status
WHERE (status_pupil_id, status_date) IN (
SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW() -- to ensure we ignore the "future status"
GROUP BY status_pupil_id );
In MySQL, the table is defined as follows:
CREATE TABLE IF NOT EXISTS `pupil_status` (
`status_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`status_pupil_id` int(10) unsigned NOT NULL, -- a foreign key
`status_category_id` int(10) unsigned NOT NULL, -- a foreign key
`status_date` datetime NOT NULL, -- effective date/time of status change
`status_modify` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`status_staff_id` int(10) unsigned NOT NULL, -- a foreign key
`status_notes` text NOT NULL, -- notes detailing the reason for status change
PRIMARY KEY (`status_id`),
KEY `status_pupil_id` (`status_pupil_id`,`status_category_id`),
KEY `status_pupil_id_2` (`status_pupil_id`,`status_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1409 ;
However, with 950 pupils and just over 1400 statuses in the table, the query takes 0.185 seconds to process. Perhaps acceptable now, but when the table swells, I'm worried about scalability. It is likely that the production system will have over 10000 pupils and each will have 15-20 statuses each.
Is there a better way to write this query? Are there better indexes that I should have to assist the query? Please let me know.
There are the following things you could try
1 Use an INNER JOIN instead of the WHERE
SELECT *
FROM pupil_status ps
INNER JOIN
(SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW()
GROUP BY status_pupil_id) X
ON ps.status_pupil_id = x.status_pupil_id
AND ps.status_date = x.status_date
2 Have a variable and store the value for NOW() - I am not sure if the DB engine optimizes this call to NOW() as just one call but if it doesnt, then this might help a bit
These are some suggestions however you will need to compare the query plans and see if there is any appreciable improvement or not.
Based on your usage of indexes as per the Query plan, robob's suggestion above could also come in handy
Find out how long query takes when you load the system with 10000 pupils each with have 15-20 statuses each.
Only refactor if it takes too long.