I have a table with over 250 million records. Our Reporting server queries regularly to that table using similar kind of query.
SELECT
COUNT(*),
DATE(updated_at) AS date,
COUNT(DISTINCT INT_FIELD)
FROM
TABLE_WITH_250_Million
WHERE
Field1 = 'value in CHAR'
AND field2 = 'VALUE in CHAR'
AND updated_at > '2012-04-27'
AND updated_at < '2012-04-28 00:00:00'
GROUP BY
Field2,
DATE(updated_at)
ORDER BY
date DESC
I have tried to create a BTREE index on the table including Field1,Field2,Field3 DESC in the same order but its not giving me the right result.
Can anyone help me how do I optimize it. My problem is I can't change the query as I don't have code where this reporting server is executing query from.
Any help would be really appreciated.
Thanks
Here's my table:
CREATE TABLE backup_jobs (
id int(11) unsigned NOT NULL AUTO_INCREMENT,
backup_profile_id int(11) DEFAULT NULL,
state varchar(32) DEFAULT NULL,
limit int(11) DEFAULT NULL,
file_count int(11) DEFAULT NULL,
byte_count bigint(20) DEFAULT NULL,
created_at datetime DEFAULT NULL,
updated_at datetime DEFAULT NULL,
status_type varchar(32) DEFAULT NULL,
status_param_1 varchar(255) DEFAULT NULL,
status_param_2 varchar(255) DEFAULT NULL,
status_param_3 varchar(255) DEFAULT NULL,
started_at datetime DEFAULT NULL,
PRIMARY KEY (id),
KEY index_backup_jobs_on_state (state),
KEY index_backup_jobs_on_backup_profile_id (backup_profile_id),
KEY index_backup_jobs_created_at (created_at),
KEY idx_backup_jobs_state_updated_at (state,updated_at) USING BTREE,
KEY idx_backup_jobs_state_status_param_1_updated_at (state,status_param_1,updated_at) USING BTREE
) ENGINE=MyISAM AUTO_INCREMENT=508748682 DEFAULT CHARSET=utf8;
Add the int_field into the index:
CREATE INDEX idx_backup_jobs_state_status_param_1_updated_at_backup_profile_id ON backup_jobs (state, status_param_1, updated_at, backup_profile_id)
to make it cover all fields.
This way, table lookups go (you will see Using index in the plan) which will make your query some 10x faster (your mileage may vary).
Also note that (at least for the single-date range provided) GROUP BY DATE(updated_at) and ORDER BY date DESC are redundant and will only make the query to use temporary and filesort without any real purpose. Not that you can do much about it, though, if you cannot change the query.
I'm sure that all 250M rows didn't occur in the date range of interest.
The problem is that the between nature of the date check forces a table scan, because you can't know where the date falls.
I'd recommend that you partition the 250M row table into weeks, months, quarters, or years and only scan the partitions need for a given date range. You'll only have to scan the partitions within the range. That'll help matters.
If you go down the partition road, you'll need to talk to a MySQL DBA, preferrably someone who's familiar with partioning. It's not for the faint of heart.
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Per your query, you'll have to take the lead here -- smallest granularity. We have no idea what the frequency is of activity, what the Field1, Field2 status entries are, how far back your data goes, how many entries would be normal on a given SINGLE DATE. All that said, I would build my indexes based on smallest granularity first that closely matches your querying criteria.
Ex: if your "Field1" has a dozen possible "CHAR" values, and you are applying an "IN" clause, and Field1 is first in your index, it will hit each char for each date and field2 value. 250 million records could force a lot of index paging activity especially based on history. Likewise with your Field2. However, due to your "Group By" clause on Field2 and date updated, I would have ONE of those respectively in the first/second position of the index. Based on historical data, I would even tend to shoot at the following index to have dates as the primary basis, and within that, the secondary criteria.
index ( Updated_At, Field2, Field1, INT_FIELD )
This way, your entire query can be done on just the index alone and does not need to query against the raw data of the actual record. All the fields are right there in the index to pull from. You have a finite date range, so your updated_at is right-away qualified, and in order prep of the group by. From that, your "CHAR" values from Field2 will right-along nicely finish your group by. Field1 to qualify your 3rd criteria of "IN" char list, and finally your INT_FIELD for count( distinct ).
Don't know how long the index will take to build on 250 million, but that is where I would start.
Related
I encountered a very puzzling optimization case. I'm no SQL expert but still this case seems to defy my understanding of clustered key principles.
I have the below table schema:
CREATE TABLE `orders` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`chargeQuote` tinyint(1) NOT NULL,
`features` int(11) NOT NULL,
`sequenceIndex` int(11) NOT NULL,
`createdAt` bigint(20) NOT NULL,
`previousSeqId` bigint(20) NOT NULL,
`refOrderId` bigint(20) NOT NULL,
`refSeqId` bigint(20) NOT NULL,
`seqId` bigint(20) NOT NULL,
`updatedAt` bigint(20) NOT NULL,
`userId` bigint(20) NOT NULL,
`version` bigint(20) NOT NULL,
`amount` decimal(36,18) NOT NULL,
`fee` decimal(36,18) NOT NULL,
`filledAmount` decimal(36,18) NOT NULL,
`makerFeeRate` decimal(36,18) NOT NULL,
`price` decimal(36,18) NOT NULL,
`takerFeeRate` decimal(36,18) NOT NULL,
`triggerOn` decimal(36,18) NOT NULL,
`source` varchar(32) NOT NULL,
`status` varchar(50) NOT NULL,
`symbol` varchar(32) NOT NULL,
`type` varchar(50) NOT NULL,
PRIMARY KEY (`id`),
KEY `IDX_STATUS` (`status`) USING BTREE,
KEY `IDX_USERID_SYMBOL_STATUS_TYPE` (`userId`,`symbol`,`status`,`type`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=7937243 DEFAULT CHARSET=utf8mb4;
This is a big table. 100 million rows. It's already sharded by createdAt, so 100 million = 1 month worth of orders.
I have a below slow query. The query is pretty straight-forward:
select id,chargeQuote,features,sequenceIndex,createdAt,previousSeqId,refOrderId,refSeqId,seqId,updatedAt,userId,version,amount,fee,filledAmount,makerFeeRate,price,takerFeeRate,triggerOn,source,`status`,symbol,type
from orders where 1=1
and userId=100000
and createdAt >= '1567775174000' and createdAt <= '1567947974000'
and symbol in ( 'BTC_USDT' )
and status in ( 'FULLY_FILLED' , 'PARTIAL_CANCELLED' , 'FULLY_CANCELLED' )
and type in ( 'BUY_LIMIT' , 'BUY_MARKET' , 'SELL_LIMIT' , 'SELL_MARKET' )
order by id desc limit 0,20;
This query takes 24 seconds. The number of rows that satisfy userId=100000 is very little, around 100. And the number of rows that satisfy this entire where clause is 0.
But when I did a small tweak, that is, I changed the order by clause:
order by id desc limit 0,20; -- before
order by createdAt desc, id desc limit 0,20; -- after
It became very fast, 0.03 seconds.
I can see it made a big difference in MySQL engine because explain gives that, before the change it was using key: PRIMARY and after it finally uses key: IDX_USERID_SYMBOL_STATUS_TYPE, as expected, and I guess therefore very fast. Here's the explain plan:
select_type table partitions type possible_keys key key_len ref rows filtered Extra
SIMPLE orders index IDX_STATUS,IDX_USERID_SYMBOL_STATUS_TYPE PRIMARY 8 20360 0.02 Using where
SIMPLE orders range IDX_STATUS,IDX_USERID_SYMBOL_STATUS_TYPE IDX_USERID_SYMBOL_STATUS_TYPE 542 26220 11.11 Using index condition; Using where; Using filesort
So what gives? Actually I was very surprised by the fact that it was not naturally sorted by id (which is the PRIMARY KEY). Isn't this the clustered key in MySQL? And why it chose to not to use index when it's sorted by id?
I'm very puzzled because a more demanding query (sort by 2 conditions) is super fast but a more lenient query is slow.
And no, I tried ANALYZE TABLE orders; and nothing happened.
MySQL has two alternative query plans for queries with ORDER BY ... LIMIT n:
Read all qualifying rows, sort them, and pick the n top rows.
Read the rows in sorted order and stop when n qualifying rows have been found.
In order to decide which is the better option, the optimizer needs to estimate the filtering effect of your WHERE condition. This is not straight-forward, especially for columns that are not indexed, or for columns where values are correlated. In your case, the MySQL optimizer evidently thinks that the second strategy is the best. Inn other words, it does not see that the WHERE clause will not be satisfied by any rows, but thinks that 2% of the rows will satisfy the WHERE clause, and that it will be able to find 20 rows by only scanning part of the table backwards in PRIMARY key order.
How the filtering effect of a WHERE clause is estimated varies quite a bit between 5.6, 5.7, and 8.0. If you are using MySQL 8.0, you can try to create histograms for the columns involved to see if that can improve the estimation. If not, I think your only option is to use a FORCE INDEX hint to make the optimizer choose the desired index.
For your fast query, the second strategy is not an option since there is no index on createdAt that can be used to avoid sorting.
Update:
Reading Rick's answer, I realized that an index on only userId should speed up your ORDER BY id query. In such an index, the entries for a given userId will be sorted on primary key. Hence, using this index will both make it possible to only access the rows of the requested userId, and access the rows in the requested sort order (by id).
The main filters works well with cardinality estimator. When order by uses limit, this is automatically another filter, as data needs to be filter further. This may redirect cardinality estimator to prone to inaccurate estimation which eventually result a poor plan to be selected. In order to prove this, run the 24sec query without the limit clause. It should also respond at 0.3 as your trick.
In order to solve this, if you have a standard very good performance just with the main filters, select this first, and filter at later 2nd time where the result set will be significantly smaller than the whole table. Use something like:
select * from (select ...main select statement)
order by x limit by y
...or...
insert into temp select ...main select statement
select from temp order by x limit by y
Given
and userId=100000
and createdAt >= '1567775174000' and createdAt <= '1567947974000'
and ... -- I am not making use of the other items
order by createdAt DESC, id desc -- I am assuming this change
limit 0,20;
I would try
INDEX(userId, createdAt, id) -- in this order
userId is tested by = is first, thereby narrows down the part of the index to look at.
Leave out the columns tested by IN. If there are multiple values in a IN, we can't make use of step 4.
createdAt filters further by range.
createdAt and id are compared in the same direction (DESC). (Yes, I know 8.0 has an improvement, but I don't think you wanted (ASC, DESC)).
Below are my table structure and index I have created. This table is having 160+ Million rows.
create table test
(
client_id varchar(100),
user_id varchar(100),
ad_id varchar(100),
attr0 varchar(250),
scp_id varchar(250),
attr1 datetime null default null,
attr2 datetime null default null,
attr3 datetime null default null,
attr4 datetime null default null,
sent_date date null default null,
channel varchar(100)
)ENGINE=InnoDB;
CREATE INDEX idx_test_cid_sd ON test (client_id,sent_date);
CREATE INDEX idx_test_uid ON test (user_id);
CREATE INDEX idx_test_aid ON test (ad_id);
Below is the queries I am running:
select
count(distinct user_id) as users
count(distinct ad_id) as ads
, count(attr1) as attr1
, count(attr2) as attr2
, count(attr3) as attr3
, count(attr4) as attr4
from test
where client_id = 'abcxyz'
and sent_date >= '2017-01-01' and sent_date < '2017-02-01';
Issues: Above query taking a lot of time more than 1 hour to return the result. When I saw the explain plan, it is using indexing and scanning only 8 millions record but the weird problem is it is taking more than 1 hour to return the results.
Can anyone tell me what is going wrong here or any suggestions on optimization part?
Shrink the table to decrease the need for I/O. This includes normalizing (where practical). Using AUTO_INCREMENT of reasonable size for the various ids - instead of VARCHAR. If you could explain those varchars, I could assess whether this is practical and how much benefit you might get.
Have a PRIMARY KEY. InnoDB does not like not having one. (This will not help the particular problem. If some combination of columns is UNIQUE, then make that the PK. If not, use id INT UNSIGNED AUTO_INCREMENT; it won't run out of ids until after 4 billion.
Change the PRIMARY KEY to make the query run faster. (Though perhaps not faster than Simulant's "covering" index.) But it would be less bulky:
Assuming you add id .. AUTO_INCREMENT, then:
PRIMARY KEY(client_id, sent_date, id),
INDEX(id)
How big (GB) is the data? The indexes? You may be at the cusp of "too big to cache", and paying for more RAM may help.
Summary tables are great for COUNT, but not great for COUNT(DISTINCT ...). That is, the counts could be done in seconds. For Uniques, see my blog. Alas it is rather sketchy; ask for help. It provides rolling up COUNT(DISTINCT...) as efficiently as COUNT, but with a 1-2% error.
The gist of a Summary table: PRIMARY KEY(client_id, day) with columns for each day's counts. Then getting the values for a month is SUMming the counts for 31 days. Very fast. More on Summary Tables.
You could add a covering index containing to only the columns of the where-clause but also the selected columns for the result. In this way the query can read the whole result from the index and does not have to read a single row. Your columns used in the where clause need to stay as first columns of the index so this index can be used for the where restriction.
CREATE INDEX idx_test_cid_sd_cover_all ON test
(client_id, sent_date, user_id, ad_id, attr1, attr2, attr3, attr4);
However this index will be bigger than than your existing indexes, because nearly all your data of the table will exist as a copy in the index.
Ok, I have the following MySQL table structure:
CREATE TABLE `creditlog` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`memberId` int(10) unsigned NOT NULL,
`quantity` decimal(10,2) unsigned DEFAULT NULL,
`timeAdded` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`reference` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `memberId` (`memberId`),
KEY `timeAdded` (`timeAdded`));
And I'm querying it like this:
SELECT SUM(quantity) FROM creditlog where timeAdded>'2016-09-01' AND timeAdded<'2016-10-01' AND memberId IN (3,6,8,9,11)
Now, I also use the use index (timeAdded) because due to the number of entries it is more convenient. Explaining the above query shows:
type -> range,
key -> timeAdded,
rows -> 921294
extra -> using where
Meanwhile if I use the memberId INDEX it shows:
type -> range,
key -> memberId,
rows -> 1707849
extra -> using where
Now, my question is it's possible to combine these 2 indexes somehow to be used together and reduce the surface of the query since I ll also need to add more conditions (on other columns).
MySQL almost never uses two indexes in a single query; it is just not cost effective. However, composite indexes are often very efficient. You need this order: INDEX(memberId, timeAdded).
Build the index this way...
First include column(s) that are in the WHERE clause tested with =. (None, in your case.)
Any column(s) with IN.
One 'range', such as <, BETWEEN, etc.
Move onto all the fields of the GROUP BY or ORDER BY. (Not relevant here.)
There are a lot of exceptions and caveats. Some are given in my cookbook .
(Contrary to popular opinion, cardinality is almost never relevant in designing an index.)
Here is a way to compare two indexes (even with a table that is too small to get reliable timings):
FLUSH STATUS;
SELECT SQL_NO_CACHE ...;
SHOW SESSION STATUS LIKE 'Handler%';
(repeat for other query/index)
Smaller numbers almost always indicate better.
"timeAdded>'2016-09-01' AND timeAdded<'2016-10-01'" -- That excludes midnight on the first day. I recommend this pattern:
timeAdded >= '2016-09-01'
AND timeAdded < '2016-09-01' + INTERVAL 1 MONTH
That also avoids computing dates.
That smells like a common query? Have you considered building and maintaining Summary tables ? The equivalent query would probably run 10 times as fast.
one question that I should be able to answer myself but I don't and I also don't find any answer in google:
I have a table that contains 5 million rows with this structure:
CREATE TABLE IF NOT EXISTS `files_history2` (
`FILES_ID` int(10) unsigned DEFAULT NULL,
`DATE_FROM` date DEFAULT NULL,
`DATE_TO` date DEFAULT NULL,
`CAMPAIGN_ID` int(10) unsigned DEFAULT NULL,
`CAMPAIGN_STATUS_ID` int(10) unsigned DEFAULT NULL,
`ON_HOLD` decimal(1,0) DEFAULT NULL,
`DIVISION_ID` int(11) DEFAULT NULL,
KEY `DATE_FROM` (`DATE_FROM`),
KEY `FILES_ID` (`FILES_ID`),
KEY `CAMPAIGN_ID` (`CAMPAIGN_ID`),
KEY `CAMP_DATE` (`CAMPAIGN_ID`,`DATE_FROM`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
When I execute
SELECT files_id, min( date_from )
FROM files_history2
WHERE campaign_id IS NOT NULL
GROUP BY files_id
the query rests with status "Sending data" for more than eight hours (then I killed the process).
Here the explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE files_history2 ALL CAMPAIGN_ID,CAMP_DATE NULL NULL NULL 5073254 Using where; Using temporary; Using filesort
I assume that I generated the necessary keys but then the query should take that long, does it?
I would suggest a different index... Index on (Files_ID, Date_From, Campaign_ID)...
Since your group by is on Files_ID, you want THOSE grouped. Then the MIN( Date_From), so that is in second position... Then FINALLY the Campaign_ID to qualify for not null and here's why...
If you put all your campaign IDs first, great, get all the NULLs out of the way... Now, you have 1,000 campaigns and the Files_ID spans MANY campaigns and they also span many dates, you are going to choke.
By the index I'm projecting, by the Files_ID first, you have each "files_id" already ordered to match your group by. Then, within that, all the earliest dates are at the top of the indexed list... great, almost there, then, by campaign ID. Skip over whatever NULL may be there and you are done, on to the next Files_ID
Hope this makes sense -- unless you have TONs of entries with NULL value campaigns.
Also, by having all 3 parts of the index matching the criteria and output columns of your query, it never has to go back to the raw data file for the data, it gets it all from the index directly.
I'd create a covering index (CAMPAIGN_ID, files_id, date_from) and check that performance. I suspect your issue is due to the grouping not and date_from not being able to use the same index.
CREATE INDEX your_index_name ON files_history2 (CAMPAIGN_ID, files_id, date_from);
If this works you could drop the point index CAMPAIGN_ID as it's included in the composite index.
Well the query is slow due to the aggregation ( function MIN ) along with grouping.
One of the solution is altering your query by moving the aggregating subquery from the WHERE clause to the FROM clause, which will be lot faster than the approach you are using.
try following:
SELECT f.files_id
FROM file_history2 AS f
JOIN (
SELECT campaign_id, MIN(date_from) AS datefrom
FROM file_history2
GROUP BY files_id
) AS f1 ON f.campaign_id = f1.campaign_id AND f.date_from = f1.datefrom;
This should have lot better performance, if doesn't work temporary table would only be the choice to go with.
I have several tables in MySQL in wich are stored chronological data. I added covering index for this tables with date field in the end. In my queries i'm selecting data for some period using BETWEEN operation for date field. So my WHERE statement consists from all fields from covering index.
When i'm executing EXPLAIN query in Extra column i have "Using where" - so, as i think, it means, that date field doesn't searched in index. When i'm selecting data for one period - i'm using "=" operation instead of BETWEEN and "Using where" doesn't appear - all searched in index.
What can i do, to all my WHERE statement to be searched in index, containing BETWEEN operation?
UPDATE:
table structure:
CREATE TABLE phones_stat (
id_site int(10) unsigned NOT NULL,
group smallint(5) unsigned NOT NULL,
day date NOT NULL,
id_phone mediumint(8) unsigned NOT NULL,
sessions int(10) unsigned NOT NULL,
PRIMARY KEY (id_site,group,day,id_phone) USING BTREE
) ;
query:
SELECT id_phone,
SUM(sessions) AS cnt
FROM phones_stat
WHERE id_site = 25
AND group = 1
AND day BETWEEN '2010-01-01' AND '2010-01-31'
GROUP BY id_phone
ORDER BY cnt DESC
How many rows do you have? Sometimes an index is not used if the optimizer deems it unnecessary (for instance, if the number of rows in your table(s) is very small). Could you give us an idea of what your SQL looks like?
You could try hinting your index usage and seeing what you get in EXPLAIN, just to confirm that your index is being overlooked, e.g.
http://dev.mysql.com/doc/refman/5.1/en/optimizer-issues.html
If you're GROUPing by id_phone, then a more useful index will be one which starts with that i.e.
... PRIMARY KEY (id_phone, id_site, `group`, day) USING BTREE
If you change the index to that and rerun the query, does it help?