MySQL query to identify concurrent access/usage - mysql

I am trying get a rough idea of a server's past load; I would like to utilize a log table to estimate concurrent access but I'm getting stuck with building the query. The table is as follows:
FYI: I didn't design/create this table, took over an ugly database :(
CREATE TABLE `user_access_log` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`oem_id` varchar(50) NOT NULL,
`add_date` datetime NOT NULL,
`username` varchar(36) NOT NULL,
`site_id` int(5) NOT NULL,
`card_id` int(6) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1$$
Table data:
---------------------------------------------------------------------------
| id | add_date | username | site_id | card_id |
---------------------------------------------------------------------------
| 185818901 | 2012-12-25 03:01:04 | 1944E9745A9CE91 | 4 | 27273 |
| 185818900 | 2012-12-25 03:01:04 | EA5902C8115C9FF | 1 | 27238 |
---------------------------------------------------------------------------
This may be far fetched, or perhaps illogical... but, I am trying to count unique usernames for each second from add_date (keeping in mind the data spans over months).
Any thoughts?
Thanks,
Kate

Sounds like you just need to use COUNT and DATEFORMAT:
select DATE_FORMAT(add_date, '%Y-%m-%d %H:%i:%s'),
count(distinct username) usercount
from player_user_video
group by DATE_FORMAT(add_date, '%Y-%m-%d %H:%i:%s')
SQL Fiddle Demo
Depending on how your data is stored, you may not need to use DATEFORMAT. But this will return you a count of distinct usernames grouped by the date, hour, minute and second. I assume that's what you're trying to accomplish. If you are trying to figure out which second or hour has the most traffic, you can use SECOND(add_date) for example. Depends on your exact needs (the Fiddle includes both examples).

Related

Optimize Mysql query with two joins with explain

I have a somewhat complex (to me) query where I am joining three tables. I have been steadily trying to optize it, reading how to improve things by looking at the EXPLAIN output.
One of the tables person_deliveries is growing by one to two million records per day, so the query is taking longer and longer due to my poor optimization. Any insight would be GREATLY appreciated.
Here is the query:
SELECT
DATE(pdel.date) AS date,
pdel.ip_address AS ip_address,
pdel.sending_campaigns_id AS campaigns_id,
(substring_index(pe.email, '#', -1)) AS recipient_domain,
COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,
COUNT(CASE WHEN pdel.ip_address = pc.ip_address AND pdel.sending_campaigns_id = pc.campaigns_id AND pdel.emails_id = pc.emails_id THEN pdel.emails_id ELSE NULL END) AS complaints
FROM
person_deliveries pdel
LEFT JOIN person_complaints pc on pc.ip_address = pdel.ip_address
LEFT JOIN person_emails pe ON pe.id = pdel.emails_id
WHERE
(pdel.date >= '2022-03-11' AND pdel.date <= '2022-03-12')
AND pe.id IS NOT NULL
AND pdel.ip_address is NOT NULL
GROUP BY date(pdel.date), pdel.ip_address, pdel.sending_campaigns_id
ORDER BY date(pdel.date), INET_ATON(pdel.ip_address), pdel.sending_campaigns_id ASC ;
Here is the output of EXPLAIN:
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
| 1 | SIMPLE | pdel | NULL | range | person_campaign_date,ip_address,date,emails_id | date | 5 | NULL | 2333678 | 50.00 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | pe | NULL | eq_ref | PRIMARY | PRIMARY | 4 | subscriber.pdel.emails_id | 1 | 100.00 | NULL |
| 1 | SIMPLE | pc | NULL | ref | ip_address | ip_address | 18 | subscriber.pdel.ip_address | 128 | 100.00 | NULL |
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
I added a few indexes to get it to this point, but the query still takes an extraordinary amount of resources/time to process.
I know I am missing something here, either an index or using a function that is causing it to be slow, but from everything I have read I haven't figured it out yet.
UPDATE:
I neglected to include table info, so I am providing that to be more helpful.
person_deliveries:
CREATE TABLE `person_deliveries` (
`emails_id` int unsigned NOT NULL,
`sending_campaigns_id` int NOT NULL,
`date` datetime NOT NULL,
`vmta` varchar(255) DEFAULT NULL,
`ip_address` varchar(15) DEFAULT NULL,
`sending_domain` varchar(255) DEFAULT NULL,
UNIQUE KEY `person_campaign_date` (`emails_id`,`sending_campaigns_id`,`date`),
KEY `ip_address` (`ip_address`),
KEY `sending_domain` (`sending_domain`),
KEY `sending_campaigns_id` (`sending_campaigns_id`),
KEY `date` (`date`),
KEY `emails_id` (`emails_id`)
person_complaints:
CREATE TABLE `person_complaints` (
`emails_id` int unsigned NOT NULL,
`campaigns_id` int unsigned NOT NULL,
`complaint_datetime` datetime DEFAULT NULL,
`added_datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`ip_address` varchar(15) DEFAULT NULL,
`networks_id` int DEFAULT NULL,
`mailing_domains_id` int DEFAULT NULL,
UNIQUE KEY `email_campaign_date` (`emails_id`,`campaigns_id`,`complaint_datetime`),
KEY `ip_address` (`ip_address`)
person_emails:
CREATE TABLE `person_emails` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`data_providers_id` tinyint unsigned DEFAULT NULL,
`email` varchar(255) NOT NULL,
`email_md5` varchar(255) DEFAULT NULL,
`original_import` timestamp NULL DEFAULT NULL,
`last_import` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `email` (`email`),
KEY `data_providers_id` (`data_providers_id`),
KEY `email_md5` (`email_md5`)
Hopefully this extra info helps.
Too many questions vs comments.
It appears for date criteria you are only pulling for a SINGLE date. Is this always the case?, or just this sample. Your pdel.date. Is it a date or date/time as stored. Your query is doing >= '2022-03-11' AND <= '2022-03-12'. Is this because your are trying to get up to and including 2022-03-11 at 11:59:59pm? And if so, should it be LESS than 03-12?
If your counts are based on a single day basis, and this data is rather fixed... that is you are not going to be changing deliveries, etc. on a day that has already passed. This might be a candidate condition for having a stored aggregate table that is done on a daily basis. This way when you are looking for activity patterns, you can have the non-changing aggregates already done and just go against that. Then if you need the details, go back to the raw data.
These indexes are "covering", which should help some:
pdel: INDEX(date, ip_address, sending_campaigns_id, emails_id)
pc: INDEX(ip_address, campaigns_id, emails_id)
Assuming date is a DATETIME, this contains an extra midnight:
WHERE pdel.date >= '2022-03-11'
AND pdel.date <= '2022-03-12'
I prefer the pattern:
WHERE pdel.date >= '2022-03-11'
AND pdel.date < '2022-03-11' + INTERVAL 1 DAY
When the GROUP BY and ORDER BY are different, an extra sort is (usually) required. So, write the GROUP BY to be just like the ORDER BY (after removing "ASC").
A minor simplification (and speedup):
COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,
-->
COUNT(DISTINCT, pdel.emails_id, pdel.date) AS deliveries,
Consider storing the numeric version of the IPv4 in INT UNSIGNED (only 4 bytes) instead of a VARCHAR. It will be smaller and you can eliminate some conversions, but will add an INET_NTOA in the SELECT.
The COUNT(CASE ... ) can be simplified to
SUM( pdel.ip_address = pc.ip_address
AND pdel.sending_campaigns_id = pc.campaigns_id
AND pdel.emails_id = pc.emails_id ) AS complaints
In
(substring_index(pe.email, '#', -1)) AS recipient_domain,
I think it should be 1, not -1 or the alias is 'wrong'.
Please change LEFT JOIN pe ... WHERE pe.id IS NOT NULL to equivalent, but simpler JOIN pe without the null test.
Sorry, but those will not provide a huge performance improvement. The next step would be to build and maintain a Summary Tables and use that to generate the desired 'report'. (See DRapp's Answer.)

How do I speed up 'where MATCH AGAINST and DATETIME' search in MySQL?

I have a rather large database where I would like to search/filter on a MEDIUMTEXT (tags), DATETIME (created_time) and a BIT (include) column.
Let's say the database looks like this:
+------+-----------------------+--------------------------+---------+
| id | created_time | tags | include |
|(INT) | (DATETIME) | (MEDIUMTEXT) | (BIT) |
+------+-----------------------+--------------------------+---------+
| 1 | '2017-02-20 08:58:06' | 'client 1' | 1 |
| 2 | '2017-03-01 18:12:00' | 'client 1 and client 2' | 0 |
| 3 | '2017-03-02 02:52:35' | 'client 3 plus client 1' | 0 |
| 4 | '2017-03-03 12:41:58' | 'client 1' | 1 |
| 5 | '2017-03-05 18:03:12' | 'client 2, client 3' | 1 |
| 6 | '2017-03-06 20:25:45' | 'client 1 and client 3' | 0 |
| 7 | '2017-03-08 22:51:22' | 'client 1' | 1 |
+------+-----------------------+--------------------------+---------+
I have indexed the DATETIME and BIT columns and I have used a FULLTEXT index on the MEDIUMTEXT column.
If I run this statement:
select statement 1
------------------
SELECT COUNT(*)
FROM database
WHERE (MATCH(tags) AGAINST('"client 1"' IN BOOLEAN MODE))
AND created_time >= '2017-03-01 12:00:00'
AND include = 0;
It takes 14 sec. to run and returns 6700 rows.
However, if I run:
select statement 2
------------------
SELECT COUNT(*)
FROM database
WHERE (MATCH(tags) AGAINST('"client 1"' IN BOOLEAN MODE));
It takes 0,4 sec. to run and returns 145000 rows and if I run:
select statement 3
------------------
SELECT COUNT(*)
FROM database
WHERE created_time >= '2017-03-01 12:00:00'
AND include = 0;
It takes 0,5 sec. to run and returns 25000 rows.
Now my question is, how do I make ‘select statement 1’ run faster? Do I need to first run ‘select statement 2’ and then run the ‘select statement 3’ on the results? If so, how? Anyone have experience with UNION and can I use it here? Or is there a way I can create a multiple-column index on INDEX and FULLTEXT?
Added info on the actual table (and not the example above) with special thanks to #rick-james
Query 1:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE created_time >= '2017-01-01 23:00:00'
AND MATCH(tags) AGAINST('\"dkpol\"' IN BOOLEAN MODE);
Query 2:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE MATCH(tags) AGAINST('\"dkpol\"' IN BOOLEAN MODE);
Query 3:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE created_time >= '2017-01-01 23:00:00';
EXPLAIN for the 3 queries:
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 1 | SIMPLE | Twitter_tweet | fulltext | created_time_INDEX,SELECT_tags_INDEX,tags_FULLTEXT | tags_FULLTEXT | 0 | const | 1 | 50.00 | Using where; Ft_hints: no_ranking |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 2 | SIMPLE | | | | | | | | | Select tables optimized away |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 3 | SIMPLE | Twitter_tweet | range | created_time_INDEX,SELECT_tags_INDEX | created_time_INDEX | 6 | | 572286 | 100.00 | Using where; Using index |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
SHOW CREATE TABLE:
CREATE TABLE `Twitter_tweet` (
`post_id` bigint(20) unsigned NOT NULL,
`from_user_id` bigint(20) unsigned NOT NULL,
`from_user_username` tinytext,
`from_user_fullname` tinytext,
`message` mediumtext,
`created_time` datetime DEFAULT NULL,
`quoted_post_id` bigint(20) unsigned DEFAULT NULL,
`quoted_user_id` bigint(20) unsigned DEFAULT NULL,
`quoted_user_username` tinytext,
`quoted_user_fullname` tinytext,
`to_post_id` bigint(20) unsigned DEFAULT NULL,
`to_user_id` bigint(20) unsigned DEFAULT NULL,
`to_user_username` tinytext,
`truncated` bit(1) DEFAULT NULL,
`is_retweet` bit(1) DEFAULT NULL,
`retweeting_post_id` bigint(20) unsigned DEFAULT NULL,
`retweeting_user_id` bigint(20) unsigned DEFAULT NULL,
`retweeting_user_username` tinytext,
`retweeting_user_fullname` tinytext,
`tags` text,
`mentions_user_id` text,
`mentions_user_username` text,
`mentions_user_fullname` text,
`post_urls` text,
`count_favourite` int(11) DEFAULT NULL,
`count_retweet` int(11) DEFAULT NULL,
`lang` tinytext,
`location_longitude` float(13,10) DEFAULT NULL,
`location_latitude` float(13,10) DEFAULT NULL,
`place_id` tinytext,
`place_fullname` tinytext,
`source` tinytext,
`fetchtime` datetime DEFAULT NULL,
PRIMARY KEY (`post_id`),
UNIQUE KEY `post_id_UNIQUE` (`post_id`),
KEY `from_user_id_INDEX` (`from_user_id`),
KEY `quoted_user_id_INDEX` (`quoted_user_id`),
KEY `to_user_id_INDEX` (`to_user_id`),
KEY `retweeting_user_id_INDEX` (`retweeting_user_id`),
KEY `created_time_INDEX` (`created_time`),
KEY `retweeting_post_id_INDEX` (`retweeting_post_id`),
KEY `post_all_id_INDEX` (`post_id`,`retweeting_post_id`,`to_post_id`,`quoted_post_id`),
KEY `quoted_post_id_INDEX` (`quoted_post_id`),
KEY `to_post_id_INDEX` (`to_post_id`),
KEY `is_retweet_INDEX` (`is_retweet`),
KEY `SELECT_tags_INDEX` (`created_time`,`is_retweet`,`post_id`),
FULLTEXT KEY `tags_FULLTEXT` (`tags`),
FULLTEXT KEY `mentions_user_id_FULLTEXT` (`mentions_user_id`),
FULLTEXT KEY `message_FULLTEXT` (`message`),
FULLTEXT KEY `content_select` (`tags`,`message`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
When timing, do two things:
Turn off the Query cache (or us SELECT SQL_NO_CACHE...)
Run the query twice.
When a query is run, these happen:
Check the QC to see if exactly the same query was recently run; if so, return the result from that run. This usually takes ~1ms. (This is not what happened in the examples you gave.)
Perform the query. Now there are multiple sub-cases:
If the "buffer pool" is 'cold', this is likely to involve lots of I/O. I/O is slow. This may explain your 14 second run.
If the desired data is cached in RAM, then it will run faster. This probably explains why the other two runs were a lot faster.
If, after compensating from these, you still have issues, please provide SHOW CREATE TABLE and EXPLAIN SELECT ... for the cases. (There could other factors involved.)
Schema critique
One way to improve performance (some) is to shrink the data.
lang tinytext, -- there is a 5 char standard
BIGINT takes 8 bytes. A 4-byte INT is enough for half the people in the world. (But first verify that your AUTO_INCREMENTs are not burning a lot of ids.)
For subtle reasons, VARCHAR(255) is better than TINYTEXT, even though they seem equivalent. Whenever practical, use something less than 255.
FLOAT(13,10) has some issues; I recommend DECIMAL(8,6)/(9,6) as sufficient for distinguishing two tweeters sitting next to each other (not that GPS is that precise).
A PRIMARY KEY is a UNIQUE key; get rid of the redundant UNIQUE.
With INDEX(a, b), you don't also need INDEX(a). (at least 2 cases of such)
Bulk
What will you do with 6700 or 25000 rows in the resultset? I ask because the effort of returning lots of rows is part of the performance problem. If your next step is to further whittle down the output, then it may be better to do the whittling in SQL.
Analysis
Looking at the second set of Queries:
FT + date range. This first did the FT search, then further filtered by date.
FT, count results, quit. Note that all of that was done in the EXPLAIN, hence "Select tables optimized away" -- and the EXPLAIN time is the same as the SELECT time.
Scan one index for an estimated 572K rows -- done entirely in the index. This cannot be improved. However, it can be made severely worse -- such as by adding a seemingly innocuous AND include = 0. In this case it would not be able to use just the index, but instead would have to bounce between the index and the data -- a lot more costly. A cure for this case: INDEX(include, created_time), which would run faster.
COUNT(*) is potentially cheap -- no need to return lots of data, often can be completed within an index, etc.
SELECT col1, col2 is faster than SELECT * -- especially because of TEXT columns.

How can I optimize a query which depends on both COUNT and GROUP BY?

I have a query which purpose is to generate statistics for how many musical work (track) has been downloaded from a site at different periods (by month, by quarter, by year etc). The query operates on the tables entityusage, entityusage_file and track.
To get the number of downloads for tracks belonging to an specific album I would do the following query :
select
date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage as eu
inner join entityusage_file as euf
ON euf.entityusage_id = eu.id
inner join track as t
ON t.id = euf.track_id
where
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
and entitytype = 't'
and action = 1
group by date_format(eu.updated, '%Y%m%d')
I need to set entitytype = 't' as the entityusage can hold downloads of other entities as well (if entitytype = 'a' then an entire album would have been downloaded, and entityusage_file would then hold all tracks which the album "translated" into at the point of download).
This query takes 40 - 50 seconds. I've been trying to optimize this query for a while, but I have the feeling that I'm approaching this the wrong way.
This is one out of 4 similar queries which must run to generate a report. The report should preferable be able to finish while a user waits for it. Right now, I'm looking at 3 - 4 minutes. That's a long time to wait.
Can this query be optimised further with indexes, or do I need to take another approach to get this job done?
CREATE TABLE `entityusage` (
`id` char(36) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`entitytype` varchar(5) NOT NULL,
`entityid` char(36) NOT NULL,
`externaluser` int(10) NOT NULL,
`action` tinyint(1) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `e` (`entityid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `entityusage_file` (
`id` char(36) NOT NULL,
`entityusage_id` char(36) NOT NULL,
`track_id` char(36) NOT NULL,
`file_id` char(36) NOT NULL,
`type` varchar(3) NOT NULL,
`quality` int(1) NOT NULL,
`size` int(20) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `file_id` (`file_id`),
KEY `entityusage_id` (`entityusage_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `track` (
`id` char(36) NOT NULL,
`album_id` char(36) NOT NULL,
`number` int(3) NOT NULL DEFAULT '0',
`title` varchar(255) DEFAULT NULL,
`updated` datetime NOT NULL DEFAULT '2000-01-01 00:00:00',
PRIMARY KEY (`id`),
KEY `album` (`album_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC;
An EXPLAIN on the query gives me the following :
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | eu | ALL | NULL | NULL | NULL | NULL | 7832817 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | euf | ref | entityusage_id | entityusage_id | 108 | func | 1 | Using index condition |
| 1 | SIMPLE | t | eq_ref | PRIMARY,album | PRIMARY | 108 | trackerdatabase.euf.track_id | 1 | Using where |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
This is your query:
select date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage eu join
entityusage_file euf
on euf.entityusage_id = eu.id join
track t
on t.id = euf.track_id
where t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7' and
eu.entitytype = 't' and
eu.action = 1
group by date_format(eu.updated, '%Y%m%d');
I would suggest indexes on track(album_id, id), entityusage_file(track_id, entityusage_id), and entityusage(id, entitytype, action).
Assuming that entityusage_file is mostly a many:many mapping table, see this for tips on improving it. Note that it calls for getting rid of the id and making a pair of 2-column indexes, one of which is the PRIMARY KEY(track_id, entityusage_id). Since your table has a few extra columns, that link does not cover everything.
The UUIDs could be shrunk from 108 bytes to 36, then then to 16 by going to BINARY(16) and using a compression function. Many exist (including a builtin pair in version 8.0); here's mine.
To explain one thing... The query execution should have started with track (on the assumption that '0054a47e-b594-407b-86df-3be078b4e7b7' is very selective). The hangup was that there was no index to get from there to the next table. Gordon's suggested indexes include such.
date_format(eu.updated, '%Y-%m-%d') and date_format(eu.updated, '%Y%m%d') can be simplified to DATE(eu.updated). (No significant performance change.)
(The other Answers and Comments cover a number of issues; I won't repeat them here.)
Because the GROUP BY operation is on an expression involving a function, MySQL can't use an index to optimize that operation. It's going to require a "Using filesort" operation.
I believe the indexes that Gordon suggested are the best bets, given the current table definitions. But even with those indexes, the "tall post" is the eu table, chunking through and sorting all those rows.
To get more reasonable performance, you may need to introduce a "precomputed results" table. It's going to be expensive to generate the counts for everything... but we can pay that price ahead of time...
CREATE TABLE usage_track_by_day
( updated_dt DATE NOT NULL
, PRIMARY KEY (track_id, updated_dt)
)
AS
SELECT eu.track_id
, DATE(eu.updated) AS updated_dt
, SUM(IF(eu.action = 1,1,0) AS cnt
FROM entityusage eu
WHERE eu.track_id IS NOT NULL
AND eu.updated IS NOT NULL
GROUP
BY eu.track_id
, DATE(eu.updated)
An index ON entityusage (track_id,updated,action) may benefit performance.
Then, we could write a query against the new "precomputed results" table, with a better shot at reasonable performance.
The "precomputed results" table would get stale, and would need to be periodically refreshed.
This isn't necessarily the best solution to the issue, but it's a technique we can use in datawarehouse/datamart applications. This lets us churn through lots of detail rows to get counts one time, and then save those counts for fast access.
can you try this. i cant really test it without some sample data from you.
In this case the query looks first in table track and joins then the other tables.
SELECT
date_format(eu.updated, '%Y-%m-%d') AS p
, count(eu.id) AS c
FROM track AS t
INNER JOIN entityusage_file AS euf ON t.id = euf.track_id
INNER JOIN entityusage AS eu ON euf.entityusage_id = eu.id
WHERE
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
AND entitytype = 't'
AND ACTION = 1
GROUP BY date_format(eu.updated, '%Y%m%d');

Mysql InnoDb is very slow on SELECT query

I have a mysql table with following structure:
mysql> show create table logs \G;
Create Table: CREATE TABLE `logs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`request` text,
`response` longtext,
`msisdn` varchar(255) DEFAULT NULL,
`username` varchar(255) DEFAULT NULL,
`shortcode` varchar(255) DEFAULT NULL,
`response_code` varchar(255) DEFAULT NULL,
`response_description` text,
`transaction_name` varchar(250) DEFAULT NULL,
`system_owner` varchar(250) DEFAULT NULL,
`request_date_time` datetime DEFAULT NULL,
`response_date_time` datetime DEFAULT NULL,
`comments` text,
`user_type` varchar(255) DEFAULT NULL,
`channel` varchar(20) DEFAULT 'WEB',
/**
other columns here....
other 18 columns here, with Type varchar and Text
**/
PRIMARY KEY (`id`),
KEY `transaction_name` (`transaction_name`) USING BTREE,
KEY `msisdn` (`msisdn`) USING BTREE,
KEY `username` (`username`) USING BTREE,
KEY `request_date_time` (`request_date_time`) USING BTREE,
KEY `system_owner` (`system_owner`) USING BTREE,
KEY `shortcode` (`shortcode`) USING BTREE,
KEY `response_code` (`response_code`) USING BTREE,
KEY `channel` (`channel`) USING BTREE,
KEY `request_date_time_2` (`request_date_time`),
KEY `response_date_time` (`response_date_time`)
) ENGINE=InnoDB AUTO_INCREMENT=59582405 DEFAULT CHARSET=utf8
and it has more than 30000000 records in it.
mysql> select count(*) from logs;
+----------+
| count(*) |
+----------+
| 38962312 |
+----------+
1 row in set (1 min 17.77 sec)
Now the problem is that it is very slow, the result of select takes ages to fetch records from table.
My following sub query takes almost 30 minutes to fetch records of one day:
SELECT
COUNT(sub.id) AS count,
DATE(sub.REQUEST_DATE_TIME) AS transaction_date,
sub.SYSTEM_OWNER,
sub.transaction_name,
sub.response,
MIN(sub.response_time),
MAX(sub.response_time),
AVG(sub.response_time),
sub.channel
FROM
(SELECT
id,
REQUEST_DATE_TIME,
RESPONSE_DATE_TIME,
TIMESTAMPDIFF(SECOND, REQUEST_DATE_TIME, RESPONSE_DATE_TIME) AS response_time,
SYSTEM_OWNER,
transaction_name,
(CASE
WHEN response_code IN ('0' , '00', 'EIL000') THEN 'Success'
ELSE 'Failure'
END) AS response,
channel
FROM
logs
WHERE
response_code != ''
AND DATE(REQUEST_DATE_TIME) BETWEEN '2016-10-26 00:00:00' AND '2016-10-27 00:00:00'
AND SYSTEM_OWNER != '') sub
GROUP BY DATE(sub.REQUEST_DATE_TIME) , sub.channel , sub.SYSTEM_OWNER , sub.transaction_name , sub.response
ORDER BY DATE(sub.REQUEST_DATE_TIME) DESC , sub.SYSTEM_OWNER , sub.transaction_name , sub.response DESC;
I have also added indexes to my table, but still it is very slow.
Any help on how can I make it fast ?
EDIT:
Ran the above query using EXPLAIN
+----+-------------+------------+------+----------------------------+------+---------+------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+----------------------------+------+---------+------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 16053297 | Using temporary; Using filesort |
| 2 | DERIVED | logs | ALL | system_owner,response_code | NULL | NULL | NULL | 32106592 | Using where |
+----+-------------+------------+------+----------------------------+------+---------+------+----------+---------------------------------+
As it stands, the query must scan the entire table.
But first, let's air a possible bug:
AND DATE(REQUEST_DATE_TIME) BETWEEN '2016-10-26 00:00:00'
AND '2016-10-27 00:00:00'
Gives you the logs for two days -- all of the 26th and all of the 27th. Or is that what you really wanted? (BETWEEN is inclusive.)
But the performance problem is that the index will not be used because request_date_time is hiding inside a function (DATE).
Jump forward to a better way to phrase it:
AND REQUEST_DATE_TIME >= '2016-10-26'
AND REQUEST_DATE_TIME < '2016-10-26' + INTERVAL 1 DAY
A DATETIME can be compared against a date.
Midnight of the morning of the 26th is included, but midnight of the 27th is not.
You can easily change 1 to however many days you wish -- without having to deal with leap days, etc.
This formulation allows the use of the index on request_date_time, thereby cutting back severely on amount of data to be scanned.
As for other tempting areas:
!= does not optimize well, so no 'composite' index is likely to be beneficial.
Since we can't really get past the WHERE, no index is useful for GROUP BY or ORDER BY.
My comments about DATE() in WHERE do not apply to GROUP BY; no change needed.
Why have the subquery? I think it can be done in a single layer. This will eliminate a rather large temp table. (Yeah, it means 3 uses of TIMESTAMPDIFF(), but that is probably a lot cheaper than the temp table.)
How much RAM? What is the value of innodb_buffer_pool_size?
If my comments are not enough, and if you frequently run a query like this (over a day or over a date range), then we can talk about building and maintaining a Summary table, which might give you a 10x speedup.

getting the total consumption

if i have the following tables:
create table rar (
rar_id int(11) not null auto_increment primary key,
rar_name varchar (20));
create table data_link(
id int(11) not null auto_increment primary key,
rar_id int(11) not null,
foreign key(rar_id) references rar(rar_id));
create table consumption (
id int(11) not null,
foreign key(id) references data_link(id),
consumption int(11) not null,
total_consumption int(11) not null,
date_time datetime not null);
i want the total consumption to be all the consumption field values added up. Is there a way to accomplish this through triggers? or do i need to each time read all the values + the latest value, sum them up and then update the table? is there a better way to do this?
--------------------------------------------------
id | consumption | total_consumption | date_time |
==================================================|
1 | 5 | 5 | 09/09/2013 |
2 | 5 | 10 | 10/09/2013 |
3 | 7 | 17 | 11/09/2013 |
4 | 3 | 20 | 11/09/2013 |
--------------------------------------------------
just wondering if there is a cleaner faster way of getting the total each time a new entry is added?
Or perhaps this is bad design? Would it better to have something like:
SELECT SUM(consumption) FROM consumption WHERE date BETWEEN '2013-09-09' AND '2013-09-11' in order to get this type of information... would doing this be the best option? The only problem i see with this is that the same command would be re-run multiple times - where each time the data would not be stored as it would be retrieved by request....it could be inefficient when you are re-generating the same report several times over for viewing purposes.... rather if the total is already calculated all you have to do is read the data, rather than computing it again and again... thoughts?
any help would be appreciated...
If you've got an index on total_consumption it won't noticeably slow the query down to have a nested select of MAX(total_consumption) as part of the insert as the max value will be stored already.
eg.
INSERT INTO `consumption` (consumption, total_consumption)
VALUES (8,
consumption + (SELECT MAX(total_consumption) FROM consumption)
);
I'm not sure how you're using the id column but you can easily add criteria to the nested select to control this.
If you do need to put a WHERE on the nested select, make sure you have an index across the fields you use and then the total_consumption column. For example, if you make it ... WHERE id = x, you'll need an index on (id, total_consumption) for it to work efficiently.
if You MUST have trigger - it shoud be like that:
DELIMITER $$
CREATE
TRIGGER `chg_consumption` BEFORE INSERT ON `consumption`
FOR EACH ROW BEGIN
SET NEW.total_consumption=(SELECT
MAX(total_consumption)+new.consumption
FROM consumption);
END;
$$
DELIMITER ;
p.s. and make total_consumption int(11) not null, nullable or default 0
EDIT:
improve from SUM(total_consumption) for MAX(total_consumption) as #calcinai suggestion