Optimize SQL subquery for statistics - mysql

I created a simple statistics tool for our user PCs. It records every 5 minutes the state of all of our PCs. And a little frontend gives me a usage chart:
Now with growing data the SQL queries are getting slower and slower and I'm searching a way to optimize it.
This is the structure. As you can see, the table "usage" contains about 6 million records and it uses MySQL InnoDB:
CREATE TABLE IF NOT EXISTS `usage` (
`id` int(11) unsigned NOT NULL,
`host_id` int(10) unsigned NOT NULL,
`time` int(10) unsigned NOT NULL,
`state` enum('LinuxTU','LinuxExt','View','Browser','Idle','Offline') CHARACTER SET latin1 NOT NULL DEFAULT 'Offline'
) ENGINE=InnoDB AUTO_INCREMENT=5963366 DEFAULT CHARSET=utf8;
ALTER TABLE `usage`
ADD PRIMARY KEY (`id`), ADD KEY `host_id` (`host_id`), ADD KEY `time` (`time`);
ALTER TABLE `usage`
MODIFY `id` int(11) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=5963366;
The following query takes about 7 seconds to execute. It is the query that gives the data to the screenshot.
/* create pivot table */
SELECT `time`,
SUM(IF(state='LinuxTU', statecount, 0)) AS LinuxTU,
SUM(IF(state='LinuxExt', statecount, 0)) AS LinuxExt,
SUM(IF(state='View', statecount, 0)) AS View,
SUM(IF(state='Browser', statecount, 0)) AS Browser
FROM (
/* get data from last 24h grouped by state */
SELECT `time`, `state`, COUNT(`state`) statecount
FROM `usage` u
/* group by time to get every 5 minutes
group by state to get the state counter */
GROUP BY `time`, `state`
HAVING `time` > 1441271078 AND `time` < 1441357478
) AS s
GROUP BY `time`
ORDER BY `time` ASC
I don't know how to optimize it. Is there something I missed? Or do I need to reorganize the structure? Any Hint?

In addition to moving the time comparison into a where clause, you can get rid of the subquery entirely:
/* create pivot table */
SELECT `time`,
SUM(state = 'LinuxTU') AS LinuxTU,
SUM(state = 'LinuxExt') AS LinuxExt,
SUM(state = 'View') AS View,
SUM(state = 'Browser') AS Browser
FROM usage u
WHERE `time` > 1441271078 AND `time` < 1441357478
GROUP BY `time`
ORDER BY `time` ASC;

I think your problem is on the last
GROUP BY `time`
ORDER BY `time` ASC
because of the subquery your indexes are gone. So, you should find a way to eliminate that.
Do you also have the option to make some processing in the programming language ? Just make the inner select + the variables without sum from the outer select, also add the order and then make the processing in the programming language.
Or must you write this in a query ?

I've found the bottleneck. The problem is the inner query. HAVING seems to be much slower than WHERE. So I tried some different queries and now I got this result:
Takes 7 seconds:
SELECT `time`, `state`, COUNT(`state`) statecount
FROM `usage` u
GROUP BY `time`, `state`
HAVING `time` > 1441271078 AND `time` < 1441357478
Takes 0.1 seconds:
SELECT `time`, `state`, COUNT(`state`) `statecount`
FROM `usage` u
WHERE `time` > 1441271078 AND `time` < 1441357478
GROUP BY `time`, `state`
And gives me the same result. The frontend is now much faster.

Related

Optimize a query

How can I proceed to make my response time more faster, approximately the average time of response is 0.2s ( 8039 records in my items table & 81 records in my tracking table )
Query
SELECT a.name, b.cnt FROM `items` a LEFT JOIN
(SELECT guid, COUNT(*) cnt FROM tracking WHERE
date > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day ) GROUP BY guid) b ON
a.`id` = b.guid WHERE a.`type` = 'streaming' AND a.`state` = 1
ORDER BY b.cnt DESC LIMIT 15 OFFSET 75
Tracking table structure
CREATE TABLE `tracking` (
`id` bigint(11) NOT NULL AUTO_INCREMENT,
`guid` int(11) DEFAULT NULL,
`ip` int(11) NOT NULL,
`date` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `i1` (`ip`,`guid`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=4303 DEFAULT CHARSET=latin1;
Items table structure
CREATE TABLE `items` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`guid` int(11) DEFAULT NULL,
`type` varchar(255) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`embed` varchar(255) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`description` text,
`tags` varchar(255) DEFAULT NULL,
`date` int(11) DEFAULT NULL,
`vote_val_total` float DEFAULT '0',
`vote_total` float(11,0) DEFAULT '0',
`rate` float DEFAULT '0',
`icon` text CHARACTER SET ascii,
`state` int(11) DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=9258 DEFAULT CHARSET=latin1;
Your query, as written, doesn't make much sense. It produces all possible combinations of rows in your two tables and then groups them.
You may want this:
SELECT a.*, b.cnt
FROM `items` a
LEFT JOIN (
SELECT guid, COUNT(*) cnt
FROM tracking
WHERE `date` > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day)
GROUP BY guid
) b ON a.guid = b.guid
ORDER BY b.cnt DESC
The high-volume data in this query come from the relatively large tracking table. So, you should add a compound index to it, using the columns (date, guid). This will allow your query to random-access the index by date and then scan it for guid values.
ALTER TABLE tracking ADD INDEX guid_summary (`date`, guid);
I suppose you'll see a nice performance improvement.
Pro tip: Don't use SELECT *. Instead, give a list of the columns you want in your result set. For example,
SELECT a.guid, a.name, a.description, b.cnt
Why is this important?
First, it makes your software more resilient against somebody adding columns to your tables in the future.
Second, it tells the MySQL server to sling around only the information you want. That can improve performance really dramatically, especially when your tables get big.
Since tracking has significantly fewer rows than items, I will propose the following.
SELECT i.name, c.cnt
FROM
(
SELECT guid, COUNT(*) cnt
FROM tracking
WHERE date > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day )
GROUP BY guid
) AS c
JOIN items AS i ON i.id = c.guid
WHERE i.type = 'streaming'
AND i.state = 1;
ORDER BY c.cnt DESC
LIMIT 15 OFFSET 75
It will fail to display any items for which cnt is 0. (Your version displays the items with NULL for the count.)
Composite indexes needed:
items: The PRIMARY KEY(id) is sufficient.
tracking: INDEX(date, guid) -- "covering"
Other issues:
If ip is an IP-address, it needs to be INT UNSIGNED. But that covers only IPv4, not IPv6.
It seems like date is not just a "date", but really a date+time. Please rename it to avoid confusion.
float(11,0) -- Don't use FLOAT for integers. Don't use (m,n) on FLOAT or DOUBLE. INT UNSIGNED makes more sense here.
OFFSET is naughty when it comes to performance -- it must scan over the skipped records. But, in your query, there is no way to avoid collecting all the possible rows, sorting them, stepping over 75, and only finally delivering 15 rows. (And, with no more than 81, it won't be a full 15.)
What version are you using? There have been important changes to the Optimization of LEFT JOIN ( SELECT ... ). Please provide EXPLAIN SELECT for each query under discussion.

MysqL big table query optimization

I have a chatting application. I have an api which returns list of users who the user talked. But it takes a long to mysql return a list messages when it reachs 100000 rows of data.
This is my messages table
CREATE TABLE IF NOT EXISTS `messages` (
`_id` int(11) NOT NULL AUTO_INCREMENT,
`fromid` int(11) NOT NULL,
`toid` int(11) NOT NULL,
`message` text NOT NULL,
`attachments` text NOT NULL,
`status` tinyint(1) NOT NULL DEFAULT '0',
`date` datetime NOT NULL,
`delete` varchar(50) NOT NULL,
`uuid_read` varchar(250) NOT NULL,
PRIMARY KEY (`_id`),
KEY `fromid` (`fromid`,`toid`,`status`,`delete`,`uuid_read`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=118561 ;
and this is my users table (simplified)
CREATE TABLE IF NOT EXISTS `users` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`login` varchar(50) DEFAULT '',
`sex` tinyint(1) DEFAULT '0',
`status` varchar(255) DEFAULT '',
`avatar` varchar(30) DEFAULT '0',
`last_active` datetime DEFAULT NULL,
`active` tinyint(1) DEFAULT '1',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=15523 ;
And here is my query (for user with id 1930)
select SQL_CALC_FOUND_ROWS `u_id`, `id`, `login`, `sex`, `birthdate`, `avatar`, `online_status`, SUM(`count`) as `count`, SUM(`nr_count`) as `nr_count`, `date`, `last_mesg` from
(
(select `m`.`fromid` as `u_id`, `u`.`id`, `u`.`login`, `u`.`sex`, `u`.`birthdate`, `u`.`avatar`, `u`.`last_active` as online_status, COUNT(`m`.`_id`) as `count`, (COUNT(`m`.`_id`)-SUM(`m`.`status`)) as `nr_count`, `tm`.`date` as `date`, `tm`.`message` as `last_mesg` from `messages` as m inner join `messages` as tm on `tm`.`_id`=(select MAX(`_id`) from `messages` as `tmz` where `tmz`.`fromid`=`m`.`fromid`) left join `users` as u on `u`.`id`=`m`.`fromid` where `m`.`toid`=1930 and `m`.`delete` not like '%1930;%' group by `u`.`id`)
UNION
(select `m`.toid as `u_id`, `u`.`id`, `u`.`login`, `u`.`sex`, `u`.`birthdate`, `u`.`avatar`, `u`.`last_active` as online_status, COUNT(`m`.`_id`) as `count`, 0 as `nr_count`, `tm`.`date` as `date`, `tm`.`message` as `last_mesg` from `messages` as m inner join `messages` as tm on `tm`.`_id`=(select MAX(`_id`) from `messages` as `tmz` where `tmz`.`toid`=`m`.`toid`) left join `users` as u on `u`.`id`=`m`.`toid` where `m`.`fromid`=1930 and `m`.`delete` not like '%1930;%' group by `u`.`id`)
order by `date` desc ) as `f` group by `u_id` order by `date` desc limit 0,10
Please help to optimize this query
What I need,
Who user talked to (name, sex, and etc)
What was the last message (from me or to me)
Count of messages (all)
Count of unread messages (only to me)
The query works well, but takes too long.
The output must be like this
You have some design problems on your query and database.
You should avoid keywords as column names, as that delete column or the count column;
You should avoid selecting columns not declared in the group by without an aggregation function... although MySQL allows this, it's not a standard and you don't have any control on what data will be selected;
Your not like construction may cause a bad behavior on your query because '%1930;%' may match 11930; and 11930 is not equal to 1930;
You should avoid like constructions starting and ending with % wildcard, which will cause the text processing to take longer;
You should design a better way to represent a message deletion, probably a better flag and/or another table to save any important data related with the action;
Try to limit your result before the join conditions (with a derived table) to perform less processing;
I tried to rewrite your query the best way I understood it. I've executed my query in a messages table with ~200.000 rows and no indexes and it performed in 0,15 seconds. But, for sure you should create the right indexes to help it perform better when the amount of data increase.
SELECT SQL_CALC_FOUND_ROWS
u.id,
u.login,
u.sex,
u.birthdate,
u.avatar,
u.last_active AS online_status,
g._count,
CASE WHEN m.toid = 1930
THEN g.nr_count
ELSE 0
END AS nr_count,
m.`date`,
m.message AS last_mesg
FROM
(
SELECT
MAX(_id) AS _id,
COUNT(*) AS _count,
COUNT(*) - SUM(m.status) AS nr_count
FROM messages m
WHERE 1=1
AND m.`delete` NOT LIKE '%1930;%'
AND
(0=1
OR m.fromid = 1930
OR m.toid = 1930
)
GROUP BY
CASE WHEN m.fromid = 1930
THEN m.toid
ELSE m.fromid
END
ORDER BY MAX(`date`) DESC
LIMIT 0, 10
) g
INNER JOIN messages AS m ON 1=1
AND m._id = g._id
LEFT JOIN users AS u ON 0=1
OR (m.fromid <> 1930 AND u.id = m.fromid)
OR (m.toid <> 1930 AND u.id = m.toid)
ORDER BY m.`date` DESC
;

SQL VIEW simplification/solution faster Querys

I'm trying to break down and re-write a view that had been created by a long gone developer. The query takes well over three minuites to access, I'm assuming from all the CONCATs.
CREATE VIEW `active_users_over_time` AS
select
`users_activity`.`date` AS `date`,
time_format(
addtime(
concat(`users_activity`.`date`,' ',`users_activity`.`time`),
concat('0 ',sec_to_time(`users_activity`.`duration_checkout`),'.0')
),'%H:%i:%s') AS `time`,
`users_activity`.`username` AS `username`,
count(addtime(concat(`users_activity`.`date`,' ',`users_activity`.`time`),
concat('0 ',sec_to_time(`users_activity`.`duration_checkout`),'.0'))) AS `checkouts`
from `users_activity`
group by
concat(
addtime(
concat(`users_activity`.`date`,' ',`users_activity`.`time`),
concat('0 ',sec_to_time(`users_activity`.`duration_checkout`),'.0')
),
`users_activity`.`username`);
The data comes from the SQL table:
CREATE TABLE `users_activity` (
`id` int(10) unsigned NOT NULL auto_increment,
`featureid` smallint(5) unsigned NOT NULL,
`date` date NOT NULL,
`time` time NOT NULL,
`duration_checkout` int unsigned NOT NULL,
`update_date` date NOT NULL,
`username` varchar(255) NOT NULL,
`checkout` smallint(5) unsigned NOT NULL,
`licid` smallint(5) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `featureid_licid_username` (`featureid`,`licid`,`date`,`time`,`username`),
FOREIGN KEY(featureid) REFERENCES features(id)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
I'm having a hard time decifering what exactly what is needed and what isnt needed.
Anyone have any ideas? Thanks.
I think this does everything that the original query did, skipping a bunch of redundant steps:
select `date`
, `time`
, `username`
, count(1) as `checkouts`
from
(
select
`users_activity`.`date` AS `date`
,time_format(
addtime(`users_activity`.`date`,`users_activity`.`time`)
+ interval `users_activity`.`duration_checkout` second
,'%H:%i:%s'
) AS `time`
,`users_activity`.`username` AS `username`
from `users_activity`
) x
group by `username`, `date`, `time`
You may also want to look at what indexes are on the table to see if optimisations can be made elsewhere (e.g. if you don't already have an index on the username and date fields you'd get a lot of benefit for this query by adding one).
You can start from rewriting GROUP BY clase from this:
group by
concat(
addtime(
concat(`users_activity`.`date`,' ',`users_activity`.`time`),
concat('0 ',sec_to_time(`users_activity`.`duration_checkout`),'.0')
),
`users_activity`.`username`);
to this one:
GROUP BY `users_activity`.`date`,
`users_activity`.`time`,
`users_activity`.`duration_checkout`,
`users_activity`.`username`
This change should give some slight savings on converting dates to strings and concatenating them, and the result of the query shouldn't change.
Then you may consider creating a composite index on GROUP BY columns.
According to this link: http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html
The most important preconditions for using indexes for GROUP BY are that all GROUP BY columns reference attributes from the same index
It means, that if we create the following index:
CREATE INDEX idx_name ON `users_activity`(
`date`,`time`,`duration_checkout`,`username`
);
then MySql might use it to optimize GROUP BY (but there is no guarantee).

MySQL innoDB: Long time of query execution

I'm having troubles to run this SQL:
I think it's a index problem but I don't know because I dind't make this database and I'm just a simple programmer.
The problem is, that table has 64260 records, so that query gets crazy when executing, I have to stop mysql and run again because the computer get frozen.
Thanks.
EDIT: table Schema
CREATE TABLE IF NOT EXISTS `value_magnitudes` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`value` float DEFAULT NULL,
`magnitude_id` int(11) DEFAULT NULL,
`sdi_belongs_id` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`reading_date` datetime DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1118402 ;
Query
select * from value_magnitudes
where id in
(
SELECT min(id)
FROM value_magnitudes
WHERE magnitude_id = 234
and date(reading_date) >= '2013-04-01'
group by date(reading_date)
)
EDIT2
First, add an index on (magnitude_id, reading_date):
ALTER TABLE
ADD INDEX magnitude_id__reading_date__IX -- just a name for the index
(magnitude_id, reading_date) ;
Then try this variation:
SELECT vm.*
FROM value_magnitudes AS vm
JOIN
( SELECT MIN(id) AS id
FROM value_magnitudes
WHERE magnitude_id = 234
AND reading_date >= '2013-04-01' -- changed so index is used
GROUP BY DATE(reading_date)
) AS vi
ON vi.id = vm.id ;
The GROUP BY DATE(reading_date) will still need to apply the function to all the selected (thorugh the index) rows and that cannot be improved, unless you follow #jurgen's advice and split the column into date and time columns.
Since you want to get results for every day you need to extract the date from a datetime column with the function date(). That makes indexes useless.
You can split up the reading_date column into reading_date and reading_time. Then you can run the query without a function and indexes will work.
Additionally you can change the query into a join
select *
from value_magnitudes v
inner join
(
SELECT min(id) as id
FROM value_magnitudes
WHERE magnitude_id = 234
and reading_date >= '2013-04-01'
group by reading_date
) x on x.id = v.id
For starters, I would change your query to:
select * from value_magnitudes where id = (
select min(id) from value_magnitudes
where magnitude_id = 234
and DATE(reading_date) >= '2013-04-01'
)
You don't need to use the IN clause when the subquery is only going to return one record.
Then, I would make sure you have an index on magnitude_id and reading_date (probably a two field index) as that's what you are querying against in the subquery. Without that index, you are scanning the table each time.
Also if possible change magnitude_id and reading_date to non null. Null values and indexes are not great fits.

speeding up mysql queries / mysql views in django

I use the following code to select popular news entries (by date) from the database:
popular = Entry.objects.filter(type='A', is_public=True).extra(select = {'dpub': 'date(dt_published)'}).order_by('-dpub', '-views', '-dt_written', 'headline')[0:5]
To compare the execution speeds of a normal query and this one I ran the following mysql queries:
SELECT *, date(dt_published) as dpub FROM `news_entry` order by dpub DESC LIMIT 500
# Showing rows 0 - 29 (500 total, Query took 0.1386 sec)
-
SELECT * , DATE( dt_published ) AS dpub FROM `news_entry` ORDER BY id DESC LIMIT 500
# Showing rows 0 - 29 (500 total, Query took 0.0021 sec) [id: 58079 - 57580]
As you can see the normal query is much faster. Is there a way to speed this up?
Is it possible to use mysql views with django?
I realize I could just split the datetime field into two fields (date and time), but I'm curious.
Structure:
CREATE TABLE IF NOT EXISTS `news_entry` (
`id` int(11) NOT NULL DEFAULT '0',
`views` int(11) NOT NULL,
`user_views` int(11) NOT NULL,
`old_id` int(11) DEFAULT NULL,
`type` varchar(1) NOT NULL,
`headline` varchar(256) NOT NULL,
`subheadline` varchar(256) NOT NULL,
`slug` varchar(50) NOT NULL,
`category_id` int(11) DEFAULT NULL,
`is_public` tinyint(1) NOT NULL,
`is_featured` tinyint(1) NOT NULL,
`dt_written` datetime DEFAULT NULL,
`dt_modified` datetime DEFAULT NULL,
`dt_published` datetime DEFAULT NULL,
`author_id` int(11) DEFAULT NULL,
`author_alt` varchar(256) NOT NULL,
`email_alt` varchar(256) NOT NULL,
`tags` varchar(255) NOT NULL,
`content` longtext NOT NULL
) ENGINE=MyISAM DEFAULT;
SELECT *, date(dt_published) as dpub FROM `news_entry` order by dpub DESC LIMIT 500
This query orders on dpub, while this one:
SELECT * , DATE( dt_published ) AS dpub FROM `news_entry` ORDER BY id DESC LIMIT 500
orders on id.
Since id is most probably a PRIMARY KEY for your table, and each PRIMARY KEY has an implicit index backing it, ORDER BY does not need to sort.
dpub is a computed field and MySQL does not support indexes on computed fields. However, ORDER BY dt_published is an ORDER BY dpub as well.
You need to change your query to this:
SELECT *, date(dt_published) as dpub FROM `news_entry` order by date_published DESC LIMIT 500
and create an index on news_entry (dt_published).
Update:
Since DATE is a monotonic function, you may employ this trick:
SELECT *, DATE(dt_published) AS dpub
FROM news_entry
WHERE dt_published >=
(
SELECT md
FROM (
SELECT DATE(dt_published) AS md
FROM news_entry
ORDER BY
dt_published DESC
LIMIT 499, 1
) q
UNION ALL
SELECT DATE(MIN(dt_published))
FROM news_entry
LIMIT 1
)
ORDER BY
dpub DESC, views DESC, dt_written DESC, headline
LIMIT 500
This query does the following:
Selects the 500th record in dt_published DESC order, or the first record posted should there be less than 500 records in the table.
Fetches all records posted later than the date of the last record selected. Since DATE(x) is always less or equal to x, there can be more than 500 records, but still
much less than the whole table.
Orders and limits these records as appropriate.
You may find this article interesting, since it covers a similar problem:
Things SQL needs: sargability of monotonic functions
May need an index on dt_published. Could you post the query plans for the two queries?