MySQL Query Performance - Query/Schema/Indexes? - mysql

Basically having some performance issues with queries, mainly to my largest table which holds call data.
The main query contains quite a few left joins & sub-selects, but in a scenario where I'm running a query where I expect back 1.3M calls to be returned, the query is just not doing it. Having to stop it at 7 minutes means there's definately a problem somewhere.
I've narrowed down the main query and tested the simplest sub-select join which is
SELECT
DateStart,
ID,
NumbID,
EffectiveFlag,
OrigNumber
FROM calls
WHERE
DateStart <= '2013-12-31'
AND DateStart >= '2013-01-01'
AND CallLength >= '00:00:00'
AND Direction = '1'
AND CustID IN (474,482,250,268,197,604,132,359,279,441,118,448,152,133,380,162,249,679,226,259,2450,2408,2451,2453,2439,2454,2444,2445,2452)
And even that query takes 4.5s - so when it's a sub-select in a query with other joins & sub-selected, I can imagine why the query as a whole is unusable.
The explain statement for the above query is
+----+-------------+-------+-------+-------------------------------------------------------------------------------------------------------+----------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-------------------------------------------------------------------------------------------------------+----------------------+---------+------+---------+-------------+
| 1 | SIMPLE | calls | range | idx_CustID,idx_DateStart,idx_CustID_DateStart,idx_CustID_TermNumber,idx_Direction | idx_CustID_DateStart | 7 | NULL | 1660009 | Using where |
+----+-------------+-------+-------+-------------------------------------------------------------------------------------------------------+----------------------+---------+------+---------+-------------+
The database schema of the calls table is
+-------------------+-------------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+-------------+------+-----+---------------------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| CustID | int(11) | NO | MUL | 0 | |
| CarrID | int(11) | NO | MUL | NULL | |
| TariID | int(11) | NO | MUL | 0 | |
| CarrierRef | varchar(30) | NO | MUL | | |
| NumbID | int(11) | NO | MUL | 0 | |
| VlviID | int(11) | NO | MUL | NULL | |
| VcamID | int(11) | NO | MUL | NULL | |
| SomeID | int(11) | NO | MUL | NULL | |
| VlnsID | int(11) | NO | MUL | NULL | |
| NGNumber | varchar(12) | NO | | | |
| OrigNumber | varchar(16) | NO | MUL | NULL | |
| CLIRestrictedFlag | int(2) | NO | | NULL | |
| OrigLocality | varchar(11) | NO | MUL | | |
| OrigAreaCode | varchar(11) | NO | MUL | | |
| TermNumber | varchar(16) | NO | MUL | NULL | |
| BatchNumber | varchar(10) | NO | MUL | | |
| DateStart | date | NO | MUL | 0000-00-00 | |
| DateClear | date | NO | | 0000-00-00 | |
| TimeStart | time | NO | | 00:00:00 | |
| TimeClear | time | NO | | 00:00:00 | |
| CallLength | time | NO | | 00:00:00 | |
| RingLength | time | NO | | 00:00:00 | |
| EffectiveFlag | smallint(1) | NO | MUL | NULL | |
| UnansweredFlag | smallint(1) | NO | MUL | NULL | |
| EngagedFlag | smallint(1) | NO | | NULL | |
| RecID | int(11) | NO | MUL | NULL | |
| CreatedUserID | int(11) | NO | | 0 | |
| CreatedDatetime | datetime | NO | MUL | 0000-00-00 00:00:00 | |
| Direction | int(1) | NO | MUL | NULL | |
+-------------------+-------------+------+-----+---------------------+----------------+
The indexes on the calls table are
+-------+------------+---------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+---------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+
| calls | 0 | PRIMARY | 1 | ID | A | 23905312 | NULL | NULL | | BTREE | |
| calls | 1 | idx_CustID | 1 | CustID | A | 1685 | NULL | NULL | | BTREE | |
| calls | 1 | idx_NumbID | 1 | NumbID | A | 37765 | NULL | NULL | | BTREE | |
| calls | 1 | idx_OrigNumber | 1 | OrigNumber | A | 5976328 | NULL | NULL | | BTREE | |
| calls | 1 | idx_OrigLocality | 1 | OrigLocality | A | 45019 | NULL | NULL | | BTREE | |
| calls | 1 | idx_OrigAreaCode | 1 | OrigAreaCode | A | 846 | NULL | NULL | | BTREE | |
| calls | 1 | idx_TermNumber | 1 | TermNumber | A | 232090 | NULL | NULL | | BTREE | |
| calls | 1 | idx_DateStart | 1 | DateStart | A | 4596 | NULL | NULL | | BTREE | |
| calls | 1 | idx_EffectiveFlag | 1 | EffectiveFlag | A | 2 | NULL | NULL | | BTREE | |
| calls | 1 | idx_UnansweredFlag | 1 | UnansweredFlag | A | 2 | NULL | NULL | | BTREE | |
| calls | 1 | idx_EngagedFlag | 1 | UnansweredFlag | A | 2 | NULL | NULL | | BTREE | |
| calls | 1 | idx_TariID | 1 | TariID | A | 110 | NULL | NULL | | BTREE | |
| calls | 1 | idx_CustID_DateStart | 1 | CustID | A | 1685 | NULL | NULL | | BTREE | |
| calls | 1 | idx_CustID_DateStart | 2 | DateStart | A | 919435 | NULL | NULL | | BTREE | |
| calls | 1 | idx_NumbID_DateStart | 1 | NumbID | A | 37765 | NULL | NULL | | BTREE | |
| calls | 1 | idx_NumbID_DateStart | 2 | DateStart | A | 5976328 | NULL | NULL | | BTREE | |
| calls | 1 | idx_RecID | 1 | RecID | A | 288015 | NULL | NULL | | BTREE | |
| calls | 1 | idx_CarrierRef | 1 | CarrierRef | A | 7968437 | NULL | NULL | | BTREE | |
| calls | 1 | idx_CustID_CallTermNumber | 1 | CustID | A | 1685 | NULL | NULL | | BTREE | |
| calls | 1 | idx_CustID_CallTermNumber | 2 | TermNumber | A | 246446 | NULL | NULL | | BTREE | |
| calls | 1 | idx_CreatedDatetime | 1 | CreatedDatetime | A | 771139 | NULL | NULL | | BTREE | |
| calls | 1 | idx_Direction | 1 | Direction | A | 2 | NULL | NULL | | BTREE | |
| calls | 1 | idx_VlviID | 1 | VlviID | A | 50539 | NULL | NULL | | BTREE | |
| calls | 1 | idx_SomeID | 1 | SomeID | A | 30 | NULL | NULL | | BTREE | |
| calls | 1 | idx_VcamID | 1 | VcamID | A | 64 | NULL | NULL | | BTREE | |
| calls | 1 | idx_VlnsID | 1 | VlnsID | A | 191 | NULL | NULL | | BTREE | |
| calls | 1 | idx_CarrID | 1 | CarrID | A | 4 | NULL | NULL | | BTREE | |
| calls | 1 | idx_BatchNumber | 1 | BatchNumber | A | 271651 | NULL | NULL | | BTREE | |
+-------+------------+---------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+
Something which I understand may be causing the performance, is the indexes on columns with a low cardinality. I know columns such as Direction which has a cardinality of 2 is actually probably worse of with an index in terms of performance, but that alone shouldn't be making the statement so slow.
In terms of the cardinality requirements to have a worthwhile index, is there a general cardinality percentage compared to total table records at which an index increases performance and when it reduces performance?
I understand that no one is going to be able to throw an answer at me that will change the query time from 4.5s to 0.01s, but any advice on either the query itself, the table schema, the indexes, or the hardware would be greatly appreciated.
Update:
#Sebas "could you please rerun the query AND explain plan without the part: AND CallLength >= '00:00:00' AND Direction = '1' please?"
+----+-------------+-------+-------+---------------------------------------------------------------------+----------------------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------------------------------------------------------------+----------------------+---------+------+--------+-------------+
| 1 | SIMPLE | calls | range | idx_CustID,idx_DateStart,idx_CustID_DateStart,idx_CustID_TermNumber | idx_CustID_DateStart | 7 | NULL | 724813 | Using where |
+----+-------------+-------+-------+---------------------------------------------------------------------+----------------------+---------+------+--------+-------------+

Is your "DateStart" a truncated datetime -- keep date only? If not, you may want to build one with truncated value (by day, or hour), and use int datatype, which will make the index much smaller for faster query.
Or, another way to optimize (golden rule #1 don't do it, #2 do't do it now).
If and only if your date and PK are sync in sequence, you can build a external index of Range of StartDate <=> ID (PK).
and using below pattern
SELECT #start:=ID_START FROM ANOTHER_TABLE WHERE StartDate='2013-01-01'
SELECT #end:=ID_END FROM ANOTHER_TABLE WHERE StartDate='2013-12-31'
SELECT * FROM calls WHERE ID BETWEEN #start and #end AND CustId in (xxxxx) ....
By using above pattern, Mysql will know if has to scan only a segment of table.

Like Darhazer said, you have way too many indexes, start by removing all of them and build them up again based on your needs.
For this specific query, create one INDEX with these fields in it:
DateStart
CallLength
Direction
CustID
Change AND Direction = '1' to AND Direction = 1 (remove the quotes, you're comparing an integer, not a string)
And see what this does to your query time. If this goes well, add the subquery, check it again with EXPLAIN, add the needed indexes and so on.

The best index that your query should be hitting is idx_CustID_DateStart. The IN statement is preventing that from happening. If the CustID list is from a table, I suggest JOIN it in, instead of enumerating.

I am not sure that the original query that takes more than 7 minutes, is written correctly when you are worried by a subquery which takes 5 seconds (hopefully it is not executed for each row). But anyway, if you want to speed up this one you should read something how indexes work. I would recommend this article to begin with.
Basically you have conditions on 4 fields, and on two fields those are range conditions. If you have read the article, you know that the index is effectively used until the first range condition is met. Though, the rest of the data in index can be used for index scanning. Thus, you need to choose which condition is better narrowing the resultset: on DateStart or on CallLength.
Anyway, you need a composite index that starts with (CustID, Direction .... My feeling is that a condition on the DateStart is better. So I would start with (CustID, Direction, DateStart, CallLength), and compare it with (CustID, Direction, DateStart), because the last field might not give a sufficient performance gain, but will take memory resources.
Though I still think, one should be sure that the rest of the query is written correctly when concentrating on a subquery. There might be a properer way to organize the query, so that this optimization would turn out irrelevant.

4.5s is not much for a 1.6m rows returned, I am pretty sure it's all spent on IO operations. Then there is hardly any space left for optimisation. You'd better present us your original query, may be we can help better.
What % of total those 1.6m makes? Indexes are good if they're used to return smallest part of dataset, but since their data access pattern with mrr is random reads, its sometimes more efficient using fullscan on a table. Surely it depends on how data has been added to the table and how space was allocated on disk.
Also you might find useful to monitor performance with MySQL performance schema, look here for details.

You have too much indexes. For example, you don't need a separate CustID Index, becuase it's the left-most in the CustID,DateStart. You have 2 indexes on UnansweredFlag. And do you really need all of those indexes? This not only slows down the inserts/updates, it also slows down the optimzier and may trick the optimizer to choose not-so-good index.
Now, on the specific query. You need to see what field or combination limits the query the most (as now it scans 1,6M rows!) and force it to use that index. So run SELECT COUNT(*) queries for each of the where clauses (direction, call length) with the DateStart specified (you'll always want to limit based on this). Maybe you just need to add the direction to the index.
Also, before MySQL 5.6, subqueries in the WHERE clause are not optimized, so maybe you should rewrite the entire query to use join instead of subselect, and not to optimize the particular query

Related

How to pull a large amount of data from mariadb in a fast manner when workload is IO bound

Important to know information:
We have a database table 'cdrs'
+-----------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+----------------+
| id | bigint(12) | NO | PRI | NULL | auto_increment |
| server_id | tinyint(2) | NO | | 0 | |
| cdr_id | bigint(13) | NO | MUL | 0 | |
| user_id | int(11) | NO | MUL | 0 | |
| transaction_id | int(11) | YES | MUL | NULL | |
| sip_id | int(11) | NO | MUL | 0 | |
| call_type | tinyint(2) | YES | MUL | NULL | |
| did_from | char(24) | YES | | NULL | |
| did_from_alias | bigint(18) | YES | MUL | 0 | |
| did_to | char(24) | YES | | NULL | |
| did_to_alias | bigint(18) | YES | MUL | 0 | |
| call_status | char(12) | YES | MUL | NULL | |
| start_time | int(11) | NO | PRI | 0 | |
| duration | decimal(13,3) | YES | | NULL | |
| billed_duration | decimal(13,3) | NO | | 0.000 | |
| rate | decimal(10,4) | YES | | NULL | |
| amount | decimal(10,4) | YES | | NULL | |
| usf | decimal(10,4) | YES | | NULL | |
| total | decimal(10,4) | YES | MUL | NULL | |
| country | varchar(96) | YES | | NULL | |
| country_id | int(8) | NO | | 0 | |
| code | varchar(8) | NO | | | |
+-----------------+---------------+------+-----+---------+----------------+
With the following indexes
+---------+------------+------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Ignored |
+---------+------------+------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| cdrs | 0 | PRIMARY | 1 | id | A | 573346816 | NULL | NULL | | BTREE | | | NO |
| cdrs | 0 | PRIMARY | 2 | start_time | A | 573346816 | NULL | NULL | | BTREE | | | NO |
| cdrs | 1 | i_cdr_id | 1 | cdr_id | A | 573346816 | NULL | NULL | | BTREE | | | NO |
| cdrs | 1 | i_user_id | 1 | user_id | A | 158909 | NULL | NULL | | BTREE | | | NO |
| cdrs | 1 | i_call_type | 1 | call_type | A | 19887 | NULL | NULL | YES | BTREE | | | NO |
| cdrs | 1 | i_transaction_id | 1 | transaction_id | A | 143336704 | NULL | NULL | YES | BTREE | | | NO |
| cdrs | 1 | i_total | 1 | total | A | 163953 | NULL | NULL | YES | BTREE | | | NO |
| cdrs | 1 | i_start_time | 1 | start_time | A | 24928122 | NULL | NULL | | BTREE | | | NO |
| cdrs | 1 | i_call_status | 1 | call_status | A | 9877 | NULL | NULL | YES | BTREE | | | NO |
| cdrs | 1 | i_sip_id | 1 | sip_id | A | 353481 | NULL | NULL | | BTREE | | | NO |
| cdrs | 1 | i_did_to | 1 | did_to_alias | A | 286673408 | NULL | NULL | YES | BTREE | | | NO |
| cdrs | 1 | i_did_from | 1 | did_from_alias | A | 57334681 | NULL | NULL | YES | BTREE | | | NO |
+---------+------------+------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
The Problem:
Currently in production this table has more rows then the max value of int(hence why the id is bigint)
This table grows in size everyday by GB and as the company expects to grow those GB will turn into 10s of GB and eventually hundreds of GB
Right now the database is running on raid 10 SSD's
Taking the following scenario
Its the first of the month and a customer wants to pull his records for last month so he can bill his customer.
This query will be directly IO bound because none of the data will exist in the Innodb buffer because this is the first time it is queried.
So lets say our query looks similar to
SELECT * FROM cdrs WHERE user_id='<SOME_USER_ID' AND start_time>=1654099200 AND start_time<1656691200
So we are grabbing all the call records for this user from the beginning of the month to the end of the month
In a test this resulted in 52,627,431 rows for one client and took 50 seconds to just run the query
SELECT Count(*) FROM cdrs WHERE user_id='<SOME_USER_ID>' AND start_time>=1654099200 AND start_time<1656691200;
I have looked around for different databases and I just don't know what direction I need to go in but we need the ability to pull these records in a much faster fashion currently just pulling an hour of records takes 30mins for some customers with higher volumes
Possible solutions that we have tried
Use a summary table
I know one suggestion will be using a summary table which we already use one but it does not help in this instance because customers need the actual data contained in this table
Partition the table
We partition the table by the month and the results are still to slow
I have tried to use a columnstore engine in mariadb and this actually gave us significant performance improvements in some areas but not specifically for getting the CDR's to our customers
I wish I could use Redis for this because Redis is lightning fast but tables take up 1.1T of data currently and will only continue to grow
Eager and open for a solution
PRIMARY KEY
WHERE user_id='<SOME_USER_ID>'
AND start_time>=1654099200
AND start_time< 1656691200;
is best handled by
PRIMARY KEY(user_id, start_time) -- in this order.
If a "user" can have two rows with exactly the same start_time, then
PRIMARY KEY(user_id, start_time, id) -- in this order.
INDEX(id)
Otherwise, consider getting rid of id.
Having the PK start with user_id puts all the rows for a given user clustered together, making that (and similar) queries faster.
Having the second part of the PK be start_time significantly helps with that BETWEEN.
PARTITION -- There are many "wrong" ways to use PARTITIONing. To discuss that, please provide more details. In general, Partitioning does not provide any speedup.
The main reason I see for using PARTITIONing is if you need to delete "old" data; DROP PARTITION is immensely faster than DELETEing millions of rows.
Data size
Many of the comments I will make here relate to how much disk space the table is/will take.
By "clustering" the data properly, I/O is decreased when the table is much bigger than RAM. What is the setting of innodb_buffer_pool_size? How much RAM do you have? How many GB will the table have before you start purging it?
Clustering may give you 10x speedup.
CHAR -- This is a fixed length datatype; it should be used only if the data is 'always' the length specified. Otherwise, it wastes space.
INDEXes -- Don't blindly add indexes for lots of columns. Only have the ones you need. "Need" comes from the queries you will be running. Consider using "composite" indexes where appropriate.
It seems like most SELECTs would include WHERE user_id = ...; if that is correct, I would expect many of the indexes to be "composite", starting with user_id.
Country -- Don't include a 4-byte INT and a multibyte string; simply have a CHAR(2) for the standard "country_code" and, if needed, a lookup table to map to strings.
Midnight -- When building a daily or monthly report, what should happen to a call that spills across midnight?
Summary tables -- A summary table should have worked quite well for your data. Please explain what data the report needs; I'll help redesign the summary table.
If the summary is all details for all calls for the month, then that is not a "summary"; the clustered PK gives you a 10x improvement.
If "summary" means daily subtotals, then let's see the details.
Numeric types -- INT takes 4 bytes; BIGINT takes 8 bytes. The "13" in bigint(13) has no meaning. If the value is really limited to 13 digits, then consider DECIMAL(13,0), which takes 6 bytes. See also MEDIUMINT and SMALLINT. DECIMAL(13,3) takes 7 bytes.
ENGINE -- Do use InnoDB.

MySQL query optimization with compound index and persistent column

The following query is being run on MariaDB 10.0.28, taking ~17 seconds, I'm looking to speed it up substantially.
select series_id,delivery_date,delivery_he,forecast_date,forecast_he,value
from forecast where forecast_he=8
AND series_id in (12142594,20735627,632287496,1146453088,1206342447,1154376340,2095084238,2445233529,2495523920,2541234725,2904312523,3564421486)
AND delivery_date >= '2016-07-13'
AND delivery_date < '2018-06-27'
and DATEDIFF(delivery_date,forecast_date)=1
The first attempt to speed it up was to create a persistent column as (datediff(delivery_date,forecast_date)), rebuild the index using the persistent column, and modify the query, replacing the datediff calc with forecast_delivery_delta=1
> describe forecast;
+-------------------------+------------------+------+-----+---------+------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------------+------------------+------+-----+---------+------------+
| series_id | int(10) unsigned | NO | PRI | 0 | |
| delivery_date | date | NO | PRI | NULL | |
| delivery_he | int(11) | NO | PRI | NULL | |
| forecast_date | date | NO | PRI | NULL | |
| forecast_he | int(11) | NO | PRI | NULL | |
| value | float | NO | | NULL | |
| forecast_delivery_delta | tinyint(4) | YES | | NULL | PERSISTENT |
+-------------------------+------------------+------+-----+---------+------------+
> show index from forecast;
+----------+------------+----------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+----------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| forecast | 0 | PRIMARY | 1 | series_id | A | 35081 | NULL | NULL | | BTREE | | |
| forecast | 0 | PRIMARY | 2 | delivery_date | A | 130472 | NULL | NULL | | BTREE | | |
| forecast | 0 | PRIMARY | 3 | delivery_he | A | 1290223 | NULL | NULL | | BTREE | | |
| forecast | 0 | PRIMARY | 4 | forecast_date | A | 2322401 | NULL | NULL | | BTREE | | |
| forecast | 0 | PRIMARY | 5 | forecast_he | A | 23224016 | NULL | NULL | | BTREE | | |
| forecast | 1 | he_series_delta_date | 1 | forecast_he | A | 29812 | NULL | NULL | | BTREE | | |
| forecast | 1 | he_series_delta_date | 2 | series_id | A | 74198 | NULL | NULL | | BTREE | | |
| forecast | 1 | he_series_delta_date | 3 | delivery_date | A | 774133 | NULL | NULL | | BTREE | | |
+----------+------------+----------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
This seems to have taken ~2 seconds off the runtime, but I'm wondering if there are better ways to speed this up substantially. I looked into adjusting the buffer size but it seemed not to be wildly misconfigured.
>show variables like '%innodb_buffer_pool_size%';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
Total table size:
+----------+------------+
| Table | Size in MB |
+----------+------------+
| forecast | 1547.00 |
+----------+------------+
EXPLAIN:
+------+-------------+----------+-------+------------------------------+----------------------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+-------+------------------------------+----------------------+---------+------+--------+-----------------------+
| 1 | SIMPLE | forecast | range | PRIMARY,he_series_delta_date | he_series_delta_date | 11 | NULL | 832016 | Using index condition |
+------+-------------+----------+-------+------------------------------+----------------------+---------+------+--------+-----------------------+
If you are going to say
AND forecast_delivery_delta=1
then the optimal index is one starting with the two = columns:
(forecast_he, forecast_delivery_delta, -- in either order
series_id, -- an IN might work ok next
delivery_date) -- finally a range
It is generally useless to put a column (delivery_date) tested via a range anywhere other than last.
But note, that index will not work very well if you say forecast_delivery_delta <= 2. Now it is a "range", and nothing after it in the index will be used for filtering. Still it may be worth it to have a small number of different indexes, just in case you turn = into a range or vice versa.
And increase innodb_buffer_pool_size to be about 70% of RAM (assuming you have over 4GB of RAM).

Query optimization using any method

I need to speed up this query. What can I do?
select i.resp_id as id from int_result i, response_set rs, cx_store_child cbu
where rs.survey_id IN(5550512,5550516,5550521,5550520,5590351,5590384,5679615,5679646,5691634,5699259,5699266,5699270)
and i.q_id IN(52603091,52251250,52250724,52251333,52919541,52920117,54409178,54409806,54625102,54738933,54739117,54739221)
and rs.t >= '2017-08-30 00:00:00' and rs.t <= '2017-09-30 00:00:00'
and i.response_set_id = rs.id and rs.cx_business_unit_id = cbu.child_bu_id
and cbu.business_unit_id = 30850
group by rs.cx_business_unit_id, i.a_id, extract(day from rs.t)
------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| int_result | 0 | PRIMARY | 1 | id | A | 240843099 | NULL | NULL | | BTREE | | |
| int_result | 1 | q_id | 1 | q_id | A | 1442174 | NULL | NULL | | BTREE | | |
| int_result | 1 | a_id | 1 | a_id | A | 20070258 | NULL | NULL | | BTREE | | |
| int_result | 1 | resp_id | 1 | resp_id | A | 120421549 | NULL | NULL | | BTREE | | |
| int_result | 1 | response_set_id | 1 | response_set_id | A | 26760344 | NULL | NULL | | BTREE | | |
| int_result | 1 | survey_id | 1 | survey_id | A | 503855 | NULL | NULL | YES | BTREE | | |
| int_result | 1 | survey_id_2 | 1 | survey_id | A | 1459655 | NULL | NULL | YES | BTREE | | |
| int_result | 1 | survey_id_2 | 2 | q_id | A | 2736853 | NULL | NULL | | BTREE | | |
+------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+------
--+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------+------------+----------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| response_set | 0 | PRIMARY | 1 | id | A | 14307454 | NULL | NULL | | BTREE | | |
| response_set | 1 | survey_id | 1 | survey_id | A | 223553 | NULL | NULL | | BTREE | | |
| response_set | 1 | id | 1 | id | A | 14307454 | NULL | NULL | | BTREE | | |
| response_set | 1 | external_id | 1 | external_id | A | 2921 | NULL | NULL | YES | BTREE | | |
| response_set | 1 | panel_member_id | 1 | panel_member_id | A | 357686 | NULL | NULL | YES | BTREE | | |
| response_set | 1 | email_group | 1 | email_group | A | 21259 | NULL | NULL | YES | BTREE | | |
| response_set | 1 | survey_timestamp_idx | 1 | survey_id | A | 433559 | NULL | NULL | | BTREE | | |
| response_set | 1 | survey_timestamp_idx | 2 | t | A | 14307454 | NULL | NULL | YES | BTREE | | |
| response_set | 1 | bu_id | 1 | cx_business_unit_id | A | 2246 | NULL | NULL | YES | BTREE | | |
----------------+------------+------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------------+------------+------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| cx_store_child | 0 | PRIMARY | 1 | id | A | 13667 | NULL | NULL | | BTREE | | |
| cx_store_child | 0 | bu_child_ref | 1 | business_unit_id | A | 13667 | NULL | NULL | YES | BTREE | | |
| cx_store_child | 0 | bu_child_ref | 2 | child_bu_id | A | 13667 | NULL | NULL | YES | BTREE | | |
| cx_store_child | 1 | cx_feedback_id | 1 | cx_feedback_id | A | 506 | NULL | NULL | YES | BTREE | | |
| cx_store_child | 1 | business_unit_id | 1 | business_unit_id | A | 13667 | NULL | NULL | YES | BTREE | | |
| cx_store_child | 1 | child_bu_id | 1 | child_bu_id | A | 13667 | NULL | NULL | YES | BTREE | | |
+----------------+------------+------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+
You seem to have an index on response_set(survey_id, t).
Try creating a so-called compound covering index on
response_set(t, survey_id, cx_business_unit_id)
This may help optimize the part of your query using that table. Why? Your query calls for a range scan on t, and a column used in a range scan must be first in its compound index.
Similarly, an index on int_result (q_id, resp_id, response_set_id) may help extract the data you need from that table.
Some notes:
It's hard to tell what your query does. Maybe some explanation will help you get better results here?
and rs.t >= '2017-08-30 00:00:00' and rs.t <= '2017-09-30 00:00:00' is probably incorrect as to the end of the time range. It probably contains an off-by-one error. Do you want < in place of <= ? What you have given includes the records with timestamps precisely at midnight on 30-Sep-2017, but no records on that day after midnight.
You have one index on int_result(survey_id, q_id) and another on int_result(survey_id). The latter index is entirely redundant with the former and you can drop it.
You seem to have lots of single-column indexes. Pro tip: don't add such indexes unless you know you need them. They rarely help speed up arbitrary queries and always slow down insertions and updates. Why might you need them? If you have a query you know needs them, or you need to enforce uniqueness. Drop the indexes you don't need.
Use 21st-century JOIN syntax instead of the old-timey comma-join syntax as follows. It's easier to read.
from int_result i
join response_set rs on i.response_set_id = rs.id
join cx_store_child cbu on rs.cx_business_unit_id = cbu.child_bu_id
Read this. You're maintaining a large database and it's worth your time to learn a lot about indexing. http://use-the-index-luke.com/
There are many ways that the Optimizer may attempt to perform the query. The following indexes give it some flexibility to find the optimal order of hitting the tables:
cbu: INDEX(business_unit_id, child_bu_id)
rs: INDEX(t, cx_business_unit_id, survey_id)
rs: INDEX(survey_id, t, cx_business_unit_id)
rs: INDEX(cx_business_unit_id, survey_id, t)
i: INDEX(response_set_id, q_id)
i: INDEX(q_id, response_set_id)
I arranged for rs and cbu to have "covering" indexes in all cases; this helps some.
(Yes, you should change to JOIN...ON as O. Jones suggests. And the rest of his suggestions.)
Before further discussion, please provide SHOW CREATE TABLE; the could be datatype issues, too.
A PRIMARY KEY is a UNIQUE key is an INDEX -- so INDEX(id) in rs is redundant.

mysql need to run optimize every day after batch jobs

I have a table with about 4M rows. Every night, about 15 batch jobs run on the data, with a few hundred thousand inserts and updates. The problem is, when I run a simple count query such as
select count(*) from items;
I have to wait for about 15 minutes for it to return. After researching on SO, I see that
optimize table items;
does seem to fix the problem, after running it, the above query returns instantly. The problem is, it takes 17 hours to run. Any suggestions on what to look for to figure out why this is happening and how to fix it?
Thanks for any help,
Kevin
UPDATE:
Here's what happens when I optimize:
mysql> optimize table items;
+------------------------+----------+----------+-------------------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+------------------------+----------+----------+-------------------------------------------------------------------+
| g_production.items | optimize | note | Table does not support optimize, doing recreate + analyze instead |
| g_production.items | optimize | status | OK |
+------------------------+----------+----------+-------------------------------------------------------------------+
2 rows in set (9 hours 20 min 48.36 sec)
Also, strangely, the select is not using the primary index, ID:
explain select count(id) from items;
+----+-------------+-------+-------+---------------+--------------------------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------------------------- +---------+------+----------+-------------+
| 1 | SIMPLE | items | index | NULL | index_items_on_real_sale | 2 | NULL | 45152757 | Using index |
+----+-------------+-------+-------+---------------+--------------------------+---------+------+----------+-------------+
1 row in set (0.10 sec)
And finally, here are all the indexes on the table:
+-------+------------+---------------------------------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+---------------------------------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| items | 0 | PRIMARY | 1 | id | A | 47144790 | NULL | NULL | | BTREE | | |
| items | 1 | index_items_on_affiliate_id | 1 | affiliate_id | A | 47144790 | NULL | NULL | YES | BTREE | | |
| items | 1 | index_items_on_brand_id | 1 | brand_id | A | 1024886 | NULL | NULL | YES | BTREE | | |
| items | 1 | index_items_on_real_sale | 1 | real_sale | A | 18 | NULL | NULL | YES | BTREE | | |
| items | 1 | index_items_on_retailer_id_and_affiliate_id | 1 | retailer_id | A | 18 | NULL | NULL | YES | BTREE | | |
| items | 1 | index_items_on_retailer_id_and_affiliate_id | 2 | affiliate_id | A | 47144790 | NULL | NULL | YES | BTREE | | |
| items | 1 | index_items_on_retailer_id | 1 | retailer_id | A | 40021 | NULL | NULL | YES | BTREE | | |
| items | 1 | index_items_on_shopzilla_id | 1 | shopzilla_id | A | 457716 | NULL | NULL | YES | BTREE | | |
| items | 1 | index_items_on_updated_at | 1 | updated_at | A | 6734970 | NULL | NULL | | BTREE | | |
Note the cardinality on the index that the EXPLAIN is revealing, I have 4M rows, but explain says it's using index_items_on_real_sale, which the show indexes command reveals has a cardinality of 18. Could this be the problem?
It could be quite a few things, but I'm wondering if it's indexed properly. Also, try to run the query with explain, like so:
EXPLAIN SELECT a,b,c WHERE....
Look at the output and see how many rows it's reading to process the query and they type of indexes etc...
Definitely need more information in order to help out, I'm just guessing based on the limited information you provided.

Simple MySQL query with performance issues

I have the following simple MySQL query:
SELECT SQL_NO_CACHE mainID
FROM tableName
WHERE otherID3=19
AND dateStartCol >= '2012-08-01'
AND dateStartCol <= '2012-08-31';
When I run this it takes 0.29 seconds to bring back 36074 results. When I increase my date period to bring back more results (65703) it runs in 0.56. When I run other similar SQL queries on the same server but on different tables (some tables are larger) the results come back in approximately 0.01 seconds.
Although 0.29 isn't slow - this is a basic part for a complex query and this timing means that it is not scalable.
See below for the table definition and indexes.
I know it's not server load as I have the same issue on a development server which has very little usage.
+---------------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------------+--------------+------+-----+---------+----------------+
| mainID | int(11) | NO | PRI | NULL | auto_increment |
| otherID1 | int(11) | NO | MUL | NULL | |
| otherID2 | int(11) | NO | MUL | NULL | |
| otherID3 | int(11) | NO | MUL | NULL | |
| keyword | varchar(200) | NO | MUL | NULL | |
| dateStartCol | date | NO | MUL | NULL | |
| timeStartCol | time | NO | MUL | NULL | |
| dateEndCol | date | NO | MUL | NULL | |
| timeEndCol | time | NO | MUL | NULL | |
| statusCode | int(1) | NO | MUL | NULL | |
| uRL | text | NO | | NULL | |
| hostname | varchar(200) | YES | MUL | NULL | |
| IPAddress | varchar(25) | YES | | NULL | |
| cookieVal | varchar(100) | NO | | NULL | |
| keywordVal | varchar(60) | NO | | NULL | |
| dateTimeCol | datetime | NO | MUL | NULL | |
+---------------------------+--------------+------+-----+---------+----------------+
+--------------------+------------+-------------------------------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------------+------------+-------------------------------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+
| tableName | 0 | PRIMARY | 1 | mainID | A | 661990 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_otherID1 | 1 | otherID1 | A | 330995 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_otherID2 | 1 | otherID2 | A | 25 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_otherID3 | 1 | otherID3 | A | 48 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_dateStartCol | 1 | dateStartCol | A | 187 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_timeStartCol | 1 | timeStartCol | A | 73554 | NULL | NULL | | BTREE | |
|tableName | 1 | idx_dateEndCol | 1 | dateEndCol | A | 188 | NULL | NULL | | BTREE | |
|tableName | 1 | idx_timeEndCol | 1 | timeEndCol | A | 73554 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_keyword | 1 | keyword | A | 82748 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_hostname | 1 | hostname | A | 2955 | NULL | NULL | YES | BTREE | |
| tableName | 1 | idx_dateTimeCol | 1 | dateTimeCol | A | 220663 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_statusCode | 1 | statusCode | A | 2 | NULL | NULL | | BTREE | |
+--------------------+------------+-------------------------------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+
Explain Output:
+----+-------------+-----------+-------+----------------------------------+-------------------+---------+------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+-------+----------------------------------+-------------------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | tableName | range | idx_otherID3,idx_dateStartCol | idx_dateStartCol | 3 | NULL | 66875 | 75.00 | Using where |
+----+-------------+-----------+-------+----------------------------------+-------------------+---------+------+-------+----------+-------------+
If that is really your query (and not a simplified version of same), then this ought to achieve best results:
CREATE INDEX table_ndx on tableName( otherID3, dateStartCol, mainID);
The first index entry means that the first match in the WHERE is very fast; the same also applies with dateStartCol. The third field is very small and does not slow the index appreciably, but allows for the datum you require to be found immediately in the index with no table access at all.
It is important that the keys are in the same index. In the EXPLAIN you posted, each key is in an index of its own, so even if MySQL chooses the best index, the performances will not be optimal. I'd try and use less indexes, for they also have a cost (shameless plug: Can Indices actually decrease SELECT performance? ).
First try to add the right key. It seems like dateStartCol is more selective than otherID3
ALTER TABLE tableName ADD KEY idx_dates(dateStartCol, dateStartCol)
Second - please make sure you select only rows you need by adding LIMIT clause to the SELECT. This will should up the query. Try like this:
SELECT SQL_NO_CACHE mainID
FROM tableName
WHERE otherID3=19
AND dateStartCol >= '2012-08-01'
AND dateStartCol <= '2012-08-31'
LIMIT 10;
Please also make sure that your MySQL tuned up properly. You may want to check key_buffer_size and innodb_buffer_pool_size as described in http://astellar.com/2011/12/why-is-stock-mysql-slow/
If this is a recurrent or important query then create a multiple column index:
CREATE INDEX index_name ON tableName (otherID3, dateStartCol)
Delete the non used indexes as they make table changes more expensive.
BTW you don't need two separate columns for date and time. You can combine then in a datetime or timestamp type. One less column and one less index.
The explain output shows it chose the dateStartCol index so you could try the opposite I suggested above:
CREATE INDEX index_name ON tableName (dateStartCol, otherID3)
Notice that the query's dateStartCol condition will still get 75% of the rows so not much improvement, if any, in using that single index.
How unique is otherID3? If there are not many repeated otherID3 you can hint the engine to use it.