Alternative to GROUP BY to optimize SQL call - mysql

I have the following SQL statement that is VERY slow. It varies from 600-800ms!
I'm looking for possible ways to optimize it, but not sure exactly the best route. My database is fairly big, with the entries table having 400,000 rows and the devices table having 90,000 rows.
SQL Statement
SELECT devices.manufacturer, COUNT(devices.manufacturer) AS device_count
FROM entries
JOIN devices ON entries.device_id=devices.id
WHERE waypoint_id IN (1,2,3,5)
AND entries.updated_at >= '2013-06-20 21:01:40 -0400'
AND entries.updated_at <= '2013-06-27 21:01:40 -0400'
GROUP BY devices.manufacturer;
Is this SQL statement slow because I'm running it on poor hardware, or because the statement is bad, or have I not structured the table correctly? Any thoughts would be appreciated!
Goal of Statement
Get a list of all the device manufacturers, and the associated count of how many times that manufacturer showed up in the entries table.
Table Structure
Devices
id int(11) NOT NULL AUTO_INCREMENT,
mac_address varchar(255) DEFAULT NULL,
user_id int(11) DEFAULT NULL,
created_at datetime NOT NULL,
updated_at datetime NOT NULL,
manufacturer varchar(255) DEFAULT NULL,
PRIMARY KEY (id),
UNIQUE KEY mac_address (mac_address),
KEY manufacturer (manufacturer)
ENGINE=InnoDB AUTO_INCREMENT=839310 DEFAULT CHARSET=utf8;
Entries
id int(11) NOT NULL AUTO_INCREMENT,
device_id int(11) DEFAULT NULL,
created_at datetime NOT NULL,
updated_at datetime NOT NULL,
waypoint_id int(11) DEFAULT NULL,
unsure tinyint(1) DEFAULT '0',
PRIMARY KEY (id),
KEY device_index (device_id)
ENGINE=InnoDB AUTO_INCREMENT=3389538 DEFAULT CHARSET=utf8;
Also– I have been looking into alternate databases. Considering this database is going to need very fast read/writes in the future, would something like Redis be of use?

The query would run faster if you added a multiple-column index on entries(waypoint_id, updated_at).
Also, you query would look better like this:
SELECT
devices.manufacturer,
COUNT(devices.manufacturer) AS device_count
FROM
entries
JOIN
devices ON devices.id = entries.device_id
WHERE
entries.waypoint_id IN (1,2,3,5)
AND
entries.updated_at BETWEEN '2013-06-20 21:01:40 -0400' AND '2013-06-27 21:01:40 -0400'
GROUP BY
devices.device_id
P.S.: wouldn't it be a good thing to explicitely declare device_id as a foreign key?

You'll need an index on Entries {waypoint_id, updated_at}. This should satisfy the:
WHERE waypoint_id IN (1,2,3,5)
AND entries.updated_at >= '2013-06-20 21:01:40 -0400'
AND entries.updated_at <= '2013-06-27 21:01:40 -0400';
Depending on actual cardinalities, you may or may not want to reverse the order of fields in this composite index.
Alternatively, create a covering index on Entries {waypoint_id, updated_at, device_id}, to avoid accessing the Entries table altogether.
On top of that, consider creating an index on Devices {id, manufacturer}. Hopefully, MySQL will be smart enough to use it to satisfy both JOIN and aggregation without even accessing the Devices table.

Related

MySQL Query/Table in need of optimization

I have a query that is taking an embarrassingly long time. ~7 minutes embarrassing. I would really appreciate some help. Missing indexes? Rewrite the query? All of the above?
Many thanks
mysql Ver 14.14 Distrib 5.7.25, for Linux (x86_64)
The query looks like:
SELECT COUNT(*) AS count_all, name
FROM api_events ae
INNER JOIN products p on p.token=ae.product_token
WHERE (ae.created_at > '2019-01-21 12:16:53.853732')
GROUP BY name
Here are the two table definitions
api_events has ~31 million records
CREATE TABLE `api_events` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`api_name` varchar(200) NOT NULL,
`hostname` varchar(200) NOT NULL,
`controller_action` varchar(2000) NOT NULL,
`duration` decimal(12,5) NOT NULL DEFAULT '0.00000',
`view` decimal(12,5) NOT NULL DEFAULT '0.00000',
`db` decimal(12,5) NOT NULL DEFAULT '0.00000',
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
`product_token` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `product_token` (`product_token`)
) ENGINE=InnoDB AUTO_INCREMENT=64851218 DEFAULT CHARSET=latin1;
and
products has only 12 records
CREATE TABLE `products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`code` varchar(30) NOT NULL,
`name` varchar(100) NOT NULL,
`description` varchar(2000) NOT NULL,
`token` varchar(50) NOT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=19 DEFAULT CHARSET=latin1;
You could improve the join performance adding index
create index idx1 on api_events(product_token, created_at);
create index idx2 on products(token);
You could also trying inverting the columns ofr api_events
create index idx1 on api_events(created_at, product_token);
and trying add redundancy to product index
create index idx2 on products(token, name);
For the query as stated, you needed
api_events: INDEX(created_at, product_token)
products: INDEX(token, name)
Because the WHERE mentions api_events, the Optimizer is likely to start with that table. created_at is in the WHERE, so the index starts with that, even though starting with a 'range' is usually wrong. In this case, the pair is "covering".
Then, INDEX(token, name) is also "covering".
"Covering" indexes give a small, but widely varying, amount of performance improvement.
What happens if you group by the token instead of the name?
SELECT ae.product_token, COUNT(*) AS count_all
FROM api_events ae
WHERE ae.created_at > '2019-01-21 12:16:53.853732')
GROUP BY ae.product_token;
For this query, an index on api_events(created_at, product_token) will probably help.
If this is faster, then you can bring in the name using a subquery.
It seems like the criteria on created_at is very selective (looking at only the past 7 days?). That's crying out to explore an index with created_at as a leading column.
The query is also referencing the product_token column from the same table, so we can include that column in the index, to make it a covering index.
api_events_IX ON api_events ( created_at, product_token )
Using that index, we can probably avoid looking at the vast majority of the 31 million rows, and quickly narrow in on the subset of rows we actually need to look at.
Using the index, the query will still need a "Using filesort" operation to satisfy the GROUP BY.
(My guess here is that the join to the 12 rows in product doesn't exclude a lot of rows... that on the vast majority of rows in api_event the product_token refers to a row that exists in product.
Use MySQL EXPLAIN to see the query execution plan.
A further possible refinement (to test the performance of) would be to do some of the aggregation in an inline view:
SELECT SUM(s.count_all) AS count_all
, p.name
FROM ( SELECT COUNT(*) AS count_all
, ae.product_token
FROM api_events ae
WHERE ae.created_at > '2019-01-21 12:16:53.853732'
GROUP
BY ae.product_token
) s
JOIN products p
ON p.token = s.product_token
GROUP
BY p.name
If the assumption about product_token is misinformed, if there are lots of rows in api_event that have product_token values that don't reference a row in product ... we might take a different tack ...

Flat MySQL table with enum-based filters is unexpectedly slow

I have a site where there is an activity feed, similar to how social sites like Facebook have one. It is a "newest first" list that describes actions taken by users. In production, there's about 200k entries in that table.
Since this is going to be asked anyway, I'll first share the full table structure:
CREATE TABLE `karmalog` (
`id` int(11) NOT NULL auto_increment,
`guid` char(36) default NULL,
`user_id` int(11) default NULL,
`user_name` varchar(45) default NULL,
`user_avat_url` varchar(255) default NULL,
`user_sec_id` int(11) default NULL,
`user_sec_name` varchar(45) default NULL,
`user_sec_avat_url` varchar(255) default NULL,
`event` enum('EDIT_PROFILE','EDIT_AVATAR','EDIT_EMAIL','EDIT_PASSWORD','FAV_IMG_ADD','FAV_IMG_ADDED','FAV_IMG_REMOVE','FAV_IMG_REMOVED','FOLLOW','FOLLOWED','UNFOLLOW','UNFOLLOWED','COM_POSTED','COM_POST','COM_VOTE','COM_VOTED','IMG_VOTED','IMG_UPLOAD','LIST_CREATE','LIST_DELETE','LIST_ADMINDELETE','LIST_VOTE','LIST_VOTED','IMG_UPD','IMG_RESTORE','IMG_UPD_LIC','IMG_UPD_MOD','IMG_GEO','IMG_UPD_MODERATED','IMG_VOTE','IMG_VOTED','TAG_FAV_ADD','CLASS_DOWN','CLASS_UP','IMG_DELETE','IMG_ADMINDELETE','IMG_ADMINDELETEFAV','SET_PASSWORD','IMG_RESTORED','IMG_VIEW','FORUM_CREATE','FORUM_DELETE','FORUM_ADMINDELETE','FORUM_REPLY','FORUM_DELETEREPLY','FORUM_ADMINDELETEREPLY','FORUM_SUBSCRIBE','FORUM_UNSUBSCRIBE','TAG_INFO_EDITED','IMG_ADDSPECIE','IMG_REMOVESPECIE','SPECIE_ADDVIDEO','SPECIE_REMOVEVIDEO','EARN_MEDAL','JOIN') NOT NULL,
`event_type` enum('follow','tag','image','class','list','forum','specie','medal','user') NOT NULL,
`active` bit(1) NOT NULL,
`delete` bit(1) NOT NULL default '\0',
`object_id` int(11) default NULL,
`object_cache` text,
`object_sec_id` int(11) default NULL,
`object_sec_cache` text,
`karma_delta` int(11) NOT NULL,
`gold_delta` int(11) NOT NULL,
`newkarma` int(11) NOT NULL,
`newgold` int(11) NOT NULL,
`migrated` int(11) NOT NULL default '0',
`date_created` timestamp NOT NULL default '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `user_sec_id` (`user_sec_id`),
KEY `image_id` (`object_id`),
KEY `date_event` (`date_created`,`event`),
KEY `event` (`event`),
KEY `date_created` (`date_created`),
CONSTRAINT `karmalog_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `user` (`id`) ON DELETE SET NULL,
CONSTRAINT `karmalog_ibfk_2` FOREIGN KEY (`user_sec_id`) REFERENCES `user` (`id`) ON DELETE SET NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Before optimizing this table, my query had 5 joins and I ran into slow query times. I have denormalized all of that data, so that not a single join is there anymore. So the table and query is flat.
As you can see in the table design, there's an "event" field which is an enum, holding a few dozen possible values. Throughout the site, I show activity feeds based on specific event types. Typically that query looks like this:
SELECT * FROM karmalog as k
WHERE k.event IN ($events) AND k.delete=0
ORDER BY k.date_created DESC, k.id DESC
LIMIT 0,30
What this query does is to find the latest 30 entries in the total set that match any of the events passed in $events, which can be multiple.
Due to removing the joins and having indices on most fields, I was expecting this to perform very well, but it doesn't. On 200k entries, it still takes over 3 seconds and I don't understand why.
Regarding solutions, I know I could archive older entries or partition the table per event type, but that will have quite a code impact, and I first would like to understand why the above is so slow.
As a temporary work-around, I'm now doing this:
SELECT * FROM
(SELECT * FROM karmalog ORDER BY date_created DESC, id DESC LIMIT 0,1000) as karma
WHERE karma.event IN ($events) AND karma.delete=0
LIMIT $page,$pagesize
What this does is to limit the baseset to search in to the latest 1000 entries only, hoping and guessing that there's 30 entries to be found for the filters that I pass in. It's not very robust though. It will not work for more rare events, and it brings pagination issues.
Therefore, I first like to get to the root cause of why my initial query is slow, against my expectation.
Edit: I was asked to share the execution plan. Here's the test query:
EXPLAIN SELECT * FROM karmalog
WHERE event IN ('FAV_IMG_ADD','FOLLOW','COM_POST','IMG_VOTE','LIST_VOTE','JOIN','CLASS_UP','LIST_CREATE','FORUM_REPLY','FORUM_CREATE','FORUM_SUBSCRIBE','IMG_GEO','IMG_ADDSPECIE','SPECIE_ADDVIDEO','EARN_MEDAL') AND karmalog.delete=0
ORDER BY date_created DESC, id DESC
LIMIT 0,36
Execution plan:
id = 1
select_type = SIMPLE
table = karmalog
type = range
possible_keys = event
key = event
key_len = 1
red = NULL
rows = 80519
Extra = Using where; Using filesort
I'm not sure how to read into the above, but I do know that the sort clause really seems to kill this query. With this sorting, it takes 4.3 secs, without 0.03 secs.
SELECT * sometimes slows down ordered queries by a huge amount, so let's start by refactoring your query as follows:
SELECT k.*
FROM karmalog AS k
JOIN (
SELECT id
FROM karmalog
WHERE event IN ($events)
AND delete=0
ORDER BY date_created DESC, id DESC
LIMIT 0,30
) AS m ON k.id = m.id
ORDER BY k.date_created DESC, k.id DESC
This will do your ORDER BY ... LIMIT operation without having to haul the whole table around in the sorting phase. Finally it will look up the appropriate thirty rows from the original table and sort just those again. This might save a whole lot of I/O and in-memory data shuffling.
Second, if id column values are assigned in ascending order as records are inserted, then the use of date_created in your ORDER BY operation is redundant. But MySQL doesn't know that, so leaving it out might help. This will be true if you always use the current date when inserting, and never update the dates.
Third, you might be able to use a compound covering index for the selection (inner) query. This is an index that contains all the fields you need. When you use a covering index, the whole query can be satisfied from the index, and there's no need to bounce back to the original table. This saves disk access time.
Try this compound covering index: (delete, event, id). If you decide you can't get rid of the use of date_created in your ordering, try this instead: (delete, event, date_created, id)
Add a compound index over the two relevant questions. In your table, you can do that by specifying e.g.
KEY `date_created` (`date_created`, `event`)
This key can still be used to satisfy plain old date_created range searching. But in addition to that, the event data is included as well, so the DBS will be able to detect the relevant rows by only looking at the index.
If you want, you can try the other order as well: first event and then date. This might allow some optimization if there are many event types but your filter only contains few. On the other hand, I'm not sure the system will be able to make use of the LIMIT clause in this case, so I'm not certain that this other order will be any help at all.
Edit: I completely missed that your date_event index already has this info. According to your execution plan, though, that one isn't used. Looks like the optimizer is getting things wrong. You could try removing the event index, and perhaps the date index as well, and see what happens then.

Optimization of a MySQL view

I want to join two MySQL tables and store it as a view, so I can address this view in a application in stead of querying two tables. But this view occurs to be extremely slow.
This are my tables:
CREATE TABLE spectrumsets (
setid INT(11) NOT NULL,
timestampdt INT(11) NULL DEFAULT NULL,
timestampd INT(10) UNSIGNED NOT NULL,
timestampt INT(10) UNSIGNED NOT NULL,
device INT(11) NOT NULL,
methodname VARCHAR(50) NOT NULL,
PRIMARY KEY (setid),
UNIQUE INDEX setid_idx (setid),
UNIQUE INDEX timestamp_device_idx (timestampd, timestampt, device),
INDEX device_fk (device),
INDEX timestampd_idx (timestampd),
CONSTRAINT device_fk FOREIGN KEY (device)
REFERENCES spectrumdevices (deviceid)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
CREATE TABLE spectrumdata (
valueid INT(11) NOT NULL AUTO_INCREMENT,
spectrumset INT(11) NOT NULL,
wavelength DOUBLE NULL DEFAULT NULL,
intensity DOUBLE NULL DEFAULT NULL,
PRIMARY KEY (valueid),
INDEX spectrumset_idx (spectrumset),
CONSTRAINT spectrumset_fk FOREIGN KEY (spectrumset)
REFERENCES spectrumsets (setid)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
And this is my view:
SELECT spectrumsets.timestampd,spectrumsets.timestampt,spectrumsets.device,
spectrumdata.wavelength,spectrumdata.intensity
FROM spectrumdata INNER JOIN spectrumsets ON spectrumdata.spectrumset=
spectrumsets.setid
WHERE spectrumdata.wavelength>0
ORDER BY spectrumsets.timestampd,spectrumsets.timestampt,spectrumsets.device,
spectrumdata.wavelength
A select count(*) on my machine takes 385.516 seconds and results into 82923705 records, so a rather large dataset
I already found this link but still don't fully understand what's wrong.
UPDATE:
EXPLAIN gives this results:
"id","select_type","table","type","possible_keys","key","key_len","ref","rows","Extra"
"1","SIMPLE","spectrumsets","index","PRIMARY,setid_idx","timestamp_device_idx","12",NULL,"327177","Using index; Using temporary; Using filesort"
"1","SIMPLE","spectrumdata","ref","spectrumset_idx","spectrumset_idx","4","primprod.spectrumsets.setid","130","Using where"
Explain suggests that the query is hitting the indices for the join (which is good), but then using a temporary table and file sort for the rest of the query.
This is for two reasons:
the where clause isn't hitting the index
the order by clause isn't hitting the index
In a comment, you say that removing the where clause has lead to a big improvement; that suggests you need the compound index on spectrumset, wavelength, assuming wavelength has a decent number of possible values (if it's just 10 values, an index may not do anything).
If you leave the "order by" clause out of your view, it should go a lot faster - and there's a good case for letting sort order be determined by the query extracting data, not the view. I'm guessing most queries will be very selective about the data - limiting to a few timestamps; by embedding the order by in the view, you pay the price for sorting every time.
If you really must have the "order by" in the view, create an index that includes all fields in the order of the "order by", with the join at the front. For instance:
UNIQUE INDEX timestamp_device_idx (set_id, timestampd, timestampt, device),

MySQL group-by very slow

I have the folowwing SQL query
SELECT CustomerID FROM sales WHERE `Date` <= '2012-01-01' GROUP BY CustomerID
The query is executed over 11400000 rows and runs very slow. It takes over 3 minutes to execute. If I remove the group-by part, this runs below 1 second. Why is that?
MySQL Server version is '5.0.21-community-nt'
Here is the table schema:
CREATE TABLE `sales` (
`ID` int(11) NOT NULL auto_increment,
`DocNo` int(11) default '0',
`CustomerID` int(11) default '0',
`OperatorID` int(11) default '0',
PRIMARY KEY (`ID`),
KEY `ID` (`ID`),
KEY `DocNo` (`DocNo`),
KEY `CustomerID` (`CustomerID`),
KEY `Date` (`Date`)
) ENGINE=MyISAM AUTO_INCREMENT=14946509 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Try putting an index on (Date,CustomerID).
Have a look at the mysql manual for optimizing group by queries:- Group by optimization
You can find out how mysql is generating the result if you use EXPLAIN as follows:-
EXPLAIN SELECT CustomerID FROM sales WHERE `Date` <= '2012-01-01' GROUP BY CustomerID
This will tell you which indexes (if any) mysql is using to optimize the query. This is very handy when learning which indexes work for which queries as you can try creating an index and see if mysql uses it. So even if you don't fully understand how mysql calculates aggregate queries you can create a useful index by trial and error.
Without knowing what your table schema looks like, it's difficult to be certain, but it would probably help if you added a multiple-column index on Date and CustomerID. That'd save MySQL the hassle of doing a full table scan for the GROUP BY statement. So try ALTER TABLE sales ADD INDEX (Date,CustomerID).
try this one :
SELECT distinct CustomerID FROM sales WHERE `Date` <= '2012-01-01'
I had the same problem, I changed the key fields to the same Collation and that fix the problem. Fields to join the tables had different Collate value.
Wouldn't this one be a lot faster and achieve the same?
SELECT DISTINCT CustomerID FROM sales WHERE `Date` <= '2012-01-01'
Make sure to place an index on Date, of course. I'm not entirely sure but indexing CustomerID might also help.

How could I optimise this MySQL query?

I have a table that stores a pupil_id, a category and an effective date (amongst other things). The dates can be past, present or future. I need a query that will extract a pupil's current status from the table.
The following query works:
SELECT *
FROM pupil_status
WHERE (status_pupil_id, status_date) IN (
SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW() -- to ensure we ignore the "future status"
GROUP BY status_pupil_id );
In MySQL, the table is defined as follows:
CREATE TABLE IF NOT EXISTS `pupil_status` (
`status_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`status_pupil_id` int(10) unsigned NOT NULL, -- a foreign key
`status_category_id` int(10) unsigned NOT NULL, -- a foreign key
`status_date` datetime NOT NULL, -- effective date/time of status change
`status_modify` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`status_staff_id` int(10) unsigned NOT NULL, -- a foreign key
`status_notes` text NOT NULL, -- notes detailing the reason for status change
PRIMARY KEY (`status_id`),
KEY `status_pupil_id` (`status_pupil_id`,`status_category_id`),
KEY `status_pupil_id_2` (`status_pupil_id`,`status_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1409 ;
However, with 950 pupils and just over 1400 statuses in the table, the query takes 0.185 seconds to process. Perhaps acceptable now, but when the table swells, I'm worried about scalability. It is likely that the production system will have over 10000 pupils and each will have 15-20 statuses each.
Is there a better way to write this query? Are there better indexes that I should have to assist the query? Please let me know.
There are the following things you could try
1 Use an INNER JOIN instead of the WHERE
SELECT *
FROM pupil_status ps
INNER JOIN
(SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW()
GROUP BY status_pupil_id) X
ON ps.status_pupil_id = x.status_pupil_id
AND ps.status_date = x.status_date
2 Have a variable and store the value for NOW() - I am not sure if the DB engine optimizes this call to NOW() as just one call but if it doesnt, then this might help a bit
These are some suggestions however you will need to compare the query plans and see if there is any appreciable improvement or not.
Based on your usage of indexes as per the Query plan, robob's suggestion above could also come in handy
Find out how long query takes when you load the system with 10000 pupils each with have 15-20 statuses each.
Only refactor if it takes too long.