Below are my table structure and index I have created. This table is having 160+ Million rows.
create table test
(
client_id varchar(100),
user_id varchar(100),
ad_id varchar(100),
attr0 varchar(250),
scp_id varchar(250),
attr1 datetime null default null,
attr2 datetime null default null,
attr3 datetime null default null,
attr4 datetime null default null,
sent_date date null default null,
channel varchar(100)
)ENGINE=InnoDB;
CREATE INDEX idx_test_cid_sd ON test (client_id,sent_date);
CREATE INDEX idx_test_uid ON test (user_id);
CREATE INDEX idx_test_aid ON test (ad_id);
Below is the queries I am running:
select
count(distinct user_id) as users
count(distinct ad_id) as ads
, count(attr1) as attr1
, count(attr2) as attr2
, count(attr3) as attr3
, count(attr4) as attr4
from test
where client_id = 'abcxyz'
and sent_date >= '2017-01-01' and sent_date < '2017-02-01';
Issues: Above query taking a lot of time more than 1 hour to return the result. When I saw the explain plan, it is using indexing and scanning only 8 millions record but the weird problem is it is taking more than 1 hour to return the results.
Can anyone tell me what is going wrong here or any suggestions on optimization part?
Shrink the table to decrease the need for I/O. This includes normalizing (where practical). Using AUTO_INCREMENT of reasonable size for the various ids - instead of VARCHAR. If you could explain those varchars, I could assess whether this is practical and how much benefit you might get.
Have a PRIMARY KEY. InnoDB does not like not having one. (This will not help the particular problem. If some combination of columns is UNIQUE, then make that the PK. If not, use id INT UNSIGNED AUTO_INCREMENT; it won't run out of ids until after 4 billion.
Change the PRIMARY KEY to make the query run faster. (Though perhaps not faster than Simulant's "covering" index.) But it would be less bulky:
Assuming you add id .. AUTO_INCREMENT, then:
PRIMARY KEY(client_id, sent_date, id),
INDEX(id)
How big (GB) is the data? The indexes? You may be at the cusp of "too big to cache", and paying for more RAM may help.
Summary tables are great for COUNT, but not great for COUNT(DISTINCT ...). That is, the counts could be done in seconds. For Uniques, see my blog. Alas it is rather sketchy; ask for help. It provides rolling up COUNT(DISTINCT...) as efficiently as COUNT, but with a 1-2% error.
The gist of a Summary table: PRIMARY KEY(client_id, day) with columns for each day's counts. Then getting the values for a month is SUMming the counts for 31 days. Very fast. More on Summary Tables.
You could add a covering index containing to only the columns of the where-clause but also the selected columns for the result. In this way the query can read the whole result from the index and does not have to read a single row. Your columns used in the where clause need to stay as first columns of the index so this index can be used for the where restriction.
CREATE INDEX idx_test_cid_sd_cover_all ON test
(client_id, sent_date, user_id, ad_id, attr1, attr2, attr3, attr4);
However this index will be bigger than than your existing indexes, because nearly all your data of the table will exist as a copy in the index.
Related
We have a large MySQL table (device_data) with the following columns:
ID (int)
dt (timestamp)
serial_number (char(20))
data1 (double)
data2 (double)
... // other columns
The table receives around 10M rows every day.
We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD). However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates.
SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07';
Therefore, we think that creating the sharding based on the serial number will be more effective. Basically, we will have:
device_data_<serial_number>
device_data_0012393746
device_data_7891238456
Hence, when we want to find data for a particular device, we can easily reference as:
SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07';
This approach seems to be effective because:
The application at all time will access the data based on the device first.
We have checked that there is no query that access the data without specifying the device serial number first.
The table for each device will be relatively small (9000 rows per day)
A few challenges that we think we will face is:
We have alot of devices. This means that the table device_data_ will be alot too. I have checked that MySQL does not provide limitation in the number of tables in the database. Will this impact on performance vs keeping them in one table?
How will this impact on later on when we would like to scale MySQL (e.g. using master / slave, etc)?
Are there other alternative / solution in resolving this?
Update. Below is the show create table result from our existing table:
CREATE TABLE `test_udp_new` (
`id` int(20) unsigned NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` varchar(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` decimal(10,5) DEFAULT NULL,
`lng` decimal(10,5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
The most frequent queries being run:
SELECT *
FROM test_udp_new
WHERE device_sn = 'xxx'
AND dt >= 'xxx'
AND dt <= 'xxx'
ORDER BY dt DESC;
The optimal way to handle that query is in a non-partitioned table with
INDEX(serial_number, dt)
Even better is to change the PRIMARY KEY. Assuming you currently have id AUTO_INCREMENT because there is not a unique combination of columns suitable for being a "natural PK",
PRIMARY KEY(serial_number, dt, id), -- to optimize that query
INDEX(id) -- to keep AUTO_INCREMENT happy
If there are other queries that are run often, please provide them; this may hurt them. In large tables, it is a juggling task to find the optimal index(es).
Other Comments:
There are very few use cases for which partitioning actually speed up processing.
Making lots of 'identical' tables is a maintenance nightmare, and, again, not a performance benefit. There are probably a hundred Q&A on stackoverflow shouting not to do such.
By having serial_number first in the PRIMARY KEY, all queries referring to a single serial_number are likely to benefit.
A million serial_numbers? No problem.
One common use case for partitioning involves purging "old" data. This is because big DELETEs are much more costly than DROP PARTITION. That involves PARTITION BY RANGE(TO_DAYS(dt)). If you are interested in that, my PK suggestion still stands. (And the query in question will run about the same speed with or without this partitioning.)
How many months before the table outgrows your disk? (If this will be an issue, let's discuss it.)
Do you need 8-byte DOUBLE? FLOAT has about 7 significant digits of precision and takes only 4 bytes.
You are using InnoDB?
Is serial_number fixed at 20 characters? If not, use VARCHAR. Also, CHARACTER SET ascii may be better than the default of utf8?
Each table (or each partition of a table) involves at least one file that the OS must deal with. When you have "too many", the OS groans, often before MySQL groans. (It is hard to make either "die" of overdose.)
Addressing the query
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
-->
PRIMARY KEY(`device_sn`,`dt`, id),
INDEX(id)
KEY `dt_sn` (`dt`,`device_sn`),
KEY `data` (`data`) USING BTREE,
Notes:
By starting the PK with device_sn, dt, you get the clustering benefits to make the query with WHERE device_sn = .. AND dt BETWEEN ...
INDEX(id) is to keep AUTO_INCREMENT happy.
When you have INDEX(a,b), INDEX(a) is redundant.
The (20) is meaningless; id will max out at about 4 billion.
I tossed the last index because it is probably helped enough by the new PK.
lng decimal(10,5) -- Don't need 5 decimal places to left of point; only need 3 or 2. So: lat decimal(7,5),lng decimal(8,5)`. This will save a total of 3 bytes per row.
I have a monitoring table with the following structure:
CREATE TABLE `monitor_data` (
`monitor_id` INT(10) UNSIGNED NOT NULL,
`monitor_data_time` INT(10) UNSIGNED NOT NULL,
`monitor_data_value` INT(10) NULL DEFAULT NULL,
INDEX `monitor_id_data_time` (`monitor_id`, `monitor_data_time`),
INDEX `monitor_data_time` (`monitor_data_time`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
This is a very high traffic table with potentially thousands of rows every minute. Each row belongs to a monitor and contains a value and time (unix_timestamp)
I have three issues:
1.
Suddenly, after a number of months in dev, the table suddenly became very slow. Queries that previously was done under a second could now take up to a minute. I'm using standard settings in my.cnf since this is a dev machine, but the behavior was indeed very strange to me.
2.
I'm not sure that I have optimal indexes. A "normal" query looks like this:
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
FROM monitor_data md
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 1484076760
AND md.monitor_data_time <= 1487271199
ORDER BY md.monitor_data_time ASC;
A EXPLAIN on the query above looks like this:
id;select_type;table;type;possible_keys;key;key_len;ref;rows;Extra
1;SIMPLE;md;range;monitor_id_data_time,monitor_data_time;monitor_id_data_time;8;\N;149799;Using index condition; Using temporary; Using filesort
What do you think about the indexes?
3.
If I leave out the DISTINCT in the query above, I actually get duplicate rows even though there aren't any duplicate rows in the table. Any explanation to this behavior?
Any input is greatly appreciated!
UPDATE 1:
New suggestion on table structure:
CREATE TABLE `monitor_data_test` (
`monitor_id` INT UNSIGNED NOT NULL,
`monitor_data_time` INT UNSIGNED NOT NULL,
`monitor_data_value` INT UNSIGNED NULL DEFAULT NULL,
PRIMARY KEY (`monitor_data_time`, `monitor_id`),
INDEX `monitor_data_time` (`monitor_data_time`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
is the same as
SELECT DISTINCT md.monitor_data_time, monitor_data_value
That is, the pair is distinct. It does not dedup just the time. Is that what you want?
If you are trying to de-dup just the time, then do something like
SELECT time, AVG(value)
...
GROUP BY time;
For optimal performance of
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 14840767604 ...
you need
PRIMARY KEY (monitor_id, monitor_data_time)
and it must be in that order. The opposite order is much less useful. The guiding principle is: Start with the '=', then move on to the 'range'. More discussion here.
Do you have 4 billion monitor_id values? INT takes 4 bytes; consider using a smaller datatype.
Do you have other queries that need optimizing? It is better to design the index(es) after gather all the important queries.
Why PK
In InnoDB, the PRIMARY KEY is "clustered" with the data. That is, the data is an ordered list of triples: (id, time, value) stored in a B+Tree. Locating id = 165 AND time = 1484076760 is a basic operation of a BTree. And it is very fast. Then scanning forward (that's the "+" part of "B+Tree") until time = 1487271199 is a very fast operation of "next row" in this ordered list. Furthermore, since value is right there with the id and time, there is no extra effort to get the values.
You can't scan the requested rows any faster. But it requires PRIMARY KEY. (OK, UNIQUE(id, time) would be 'promoted' to be the PK, but let's not confuse the issue.)
Contrast... Given an index (time, id), it would do the scan over the dates fine, but it would have to skip over any entries where id != 165 But it would have to read all those rows to discover they do not apply. A lot more effort.
Since it is unclear what you intended by DISTINCT, I can't continue this detailed discussion of how that plays out. Suffice it to say: The possible rows have been found; now some kind of secondary pass is needed to do the DISTINCT. (It may not even need to do a sort.)
What do you think about the indexes?
The index on (monitor_id,monitor_data_time) seems appropriate for the query. That's suited to an index range scan operation, very quickly eliminating boatloads of rows that need to be examined.
Better would be a covering index that also includes the monitor_data_value column. Then the query could be satisfied entirely from the index, without a need to lookup pages from the data table to get monitor_data_value.
And even better would be having the InnoDB cluster key be the PRIMARY KEY or UNIQUE KEY on the columns, rather than incurring the overhead of the synthetic row identifier that InnoDB creates when an appropriate index isn't defined.
If I wasn't allowing duplicate (monitor_id, monitor_data_time) tuples, then I'd define the table with a UNIQUE index on those non-nullable columns.
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, UNIQUE KEY `monitor_id_data_time` (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
or equivalent, specify PRIMARY in place of UNIQUE and remove the identifier
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, PRIMARY KEY (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
Any explanation to this behavior?
If the query (shown in the question) returns a different number of rows with the DISTINCT keyword, then there must be duplicate (monitor_id,monitor_data_time,monitor_data_value) tuples in the table. There's nothing in the table definition that guarantees us that there aren't duplicates.
There are a couple of other possible explanations, but those explanations are all related to rows being added/changed/removed, and the queries seeing different snapshots, transaction isolation levels, yada, yada. If the data isn't changing, then there are duplicate rows.
A PRIMARY KEY constraint (or UNIQUE KEY constraint non-nullable columns) would guarantee us uniqueness.
Note that DISTINCT is a keyword in the SELECT list. It's not a function. The DISTINCT keyword applies to all expressions in the SELECT list. The parens around md.monitor_date_time are superfluous.
Leaving the DISTINCT keyword out would eliminate the need for the "Using filesort" operation. And that can be expensive for large sets, particularly when the set is too large to sort in memory, and the sort has to spill to disk.
It would be much more efficient to have guaranteed uniqueness, omit the DISTINCT keyword, and return rows in order by the index, preferably the cluster key.
Also, the secondary index monitor_data_time doesn't benefit this query. (There may be other queries that can make effective use of the index, though one suspects that those queries would also make effective use of a composite index that had monitor_data_time as the leading column.
I have following table with millions rows:
CREATE TABLE `points` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`DateNumber` int(10) unsigned DEFAULT NULL,
`Count` int(10) unsigned DEFAULT NULL,
`FPTKeyId` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_UNIQUE` (`id`),
KEY `index3` (`FPTKeyId`,`DateNumber`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=16755134 DEFAULT CHARSET=utf8$$
As you can see i have created indexes. I donnt know am i do it right may be not.
The problem is queries execute super slow.
Let's take a simple query
SELECT fptkeyid, count FROM points group by fptkeyid
I cannt get result because query aborting by timeout(10 min). What i am doing wrong?
Beware MySQL's stupid behaviour: GROUP BYing implicitly executes ORDER BY.
To prevent this, explicitely add ORDER BY NULL, which prevents unnecessary ordering.
http://dev.mysql.com/doc/refman/5.0/en/select.html says:
If you use GROUP BY, output rows are sorted according to the GROUP BY
columns as if you had an ORDER BY for the same columns. To avoid the
overhead of sorting that GROUP BY produces, add ORDER BY NULL:
SELECT a, COUNT(b) FROM test_table GROUP BY a ORDER BY NULL;
+
http://dev.mysql.com/doc/refman/5.6/en/group-by-optimization.html says:
The most important preconditions for using indexes for GROUP BY are
that all GROUP BY columns reference attributes from the same index,
and that the index stores its keys in order (for example, this is a
BTREE index and not a HASH index).
Your query does not make sense:
SELECT fptkeyid, count FROM points group by fptkeyid
You group by fptkeyid so count is not useful here. There should be an aggregate function. Not a count field. Next that that count is also a MySQL function which makes it not very useful / advisable to use the same name for a field.
Don't you need something like:
SELECT fptkeyid, SUM(`count`) FROM points group by fptkeyid
If not please explain what result you expect from the query.
Created a database with test data, half a million records, to see if I can find something equal to your issue. This is what the explain tells me:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE points index NULL index3 10 NULL 433756
And on the SUM query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE points index NULL index3 10 NULL 491781
Both queries are done on a laptop (macbook air) within a second, nothing takes long. Inserting though took some time, few minutes to get half a million records. But retrieving and calculating does not.
We need more to answer your question totally complete. Maybe the configuration of the database is wrong, for example almost no memory allocated?
I would personally start with your AUTO_INCREMENT value. You have set it to increase by 16,755,134 for each new record. Your field value is set to INT UNSIGNED which means that the range of values is 0 to 4,294,967,295 (or almost 4.3 billion). This means that you would have only 256 values before the field goes beyond the data type limits thereby compromising the purpose of the PRIMARY KEY INDEX.
You could changed the data type to BIGINT UNSIGNED and you would have a value range of 0 to 18,446,744,073,709,551,615 (or slightly more then 18.4 quintillion) which would allow you to have up to 1,100,960,700,983 (or slightly more then 1.1 trillion) unique values with this AUTO_INCREMENT value.
I would first ask if you really need to have your AUTO_INCREMENT value set to such a large number and if not then I would suggest changing that to 1 (or at least some lower number) as storing the field values as INT vs BIGINT will save considerable disk space within larger tables such as this. Either way, you should get a more stable PRIMARY KEY INDEX which should help improve queries.
I think the problem is your server bandwidth. Having a million rows would probably need at least high megabyte bandwidths.
I have a table with over 250 million records. Our Reporting server queries regularly to that table using similar kind of query.
SELECT
COUNT(*),
DATE(updated_at) AS date,
COUNT(DISTINCT INT_FIELD)
FROM
TABLE_WITH_250_Million
WHERE
Field1 = 'value in CHAR'
AND field2 = 'VALUE in CHAR'
AND updated_at > '2012-04-27'
AND updated_at < '2012-04-28 00:00:00'
GROUP BY
Field2,
DATE(updated_at)
ORDER BY
date DESC
I have tried to create a BTREE index on the table including Field1,Field2,Field3 DESC in the same order but its not giving me the right result.
Can anyone help me how do I optimize it. My problem is I can't change the query as I don't have code where this reporting server is executing query from.
Any help would be really appreciated.
Thanks
Here's my table:
CREATE TABLE backup_jobs (
id int(11) unsigned NOT NULL AUTO_INCREMENT,
backup_profile_id int(11) DEFAULT NULL,
state varchar(32) DEFAULT NULL,
limit int(11) DEFAULT NULL,
file_count int(11) DEFAULT NULL,
byte_count bigint(20) DEFAULT NULL,
created_at datetime DEFAULT NULL,
updated_at datetime DEFAULT NULL,
status_type varchar(32) DEFAULT NULL,
status_param_1 varchar(255) DEFAULT NULL,
status_param_2 varchar(255) DEFAULT NULL,
status_param_3 varchar(255) DEFAULT NULL,
started_at datetime DEFAULT NULL,
PRIMARY KEY (id),
KEY index_backup_jobs_on_state (state),
KEY index_backup_jobs_on_backup_profile_id (backup_profile_id),
KEY index_backup_jobs_created_at (created_at),
KEY idx_backup_jobs_state_updated_at (state,updated_at) USING BTREE,
KEY idx_backup_jobs_state_status_param_1_updated_at (state,status_param_1,updated_at) USING BTREE
) ENGINE=MyISAM AUTO_INCREMENT=508748682 DEFAULT CHARSET=utf8;
Add the int_field into the index:
CREATE INDEX idx_backup_jobs_state_status_param_1_updated_at_backup_profile_id ON backup_jobs (state, status_param_1, updated_at, backup_profile_id)
to make it cover all fields.
This way, table lookups go (you will see Using index in the plan) which will make your query some 10x faster (your mileage may vary).
Also note that (at least for the single-date range provided) GROUP BY DATE(updated_at) and ORDER BY date DESC are redundant and will only make the query to use temporary and filesort without any real purpose. Not that you can do much about it, though, if you cannot change the query.
I'm sure that all 250M rows didn't occur in the date range of interest.
The problem is that the between nature of the date check forces a table scan, because you can't know where the date falls.
I'd recommend that you partition the 250M row table into weeks, months, quarters, or years and only scan the partitions need for a given date range. You'll only have to scan the partitions within the range. That'll help matters.
If you go down the partition road, you'll need to talk to a MySQL DBA, preferrably someone who's familiar with partioning. It's not for the faint of heart.
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Per your query, you'll have to take the lead here -- smallest granularity. We have no idea what the frequency is of activity, what the Field1, Field2 status entries are, how far back your data goes, how many entries would be normal on a given SINGLE DATE. All that said, I would build my indexes based on smallest granularity first that closely matches your querying criteria.
Ex: if your "Field1" has a dozen possible "CHAR" values, and you are applying an "IN" clause, and Field1 is first in your index, it will hit each char for each date and field2 value. 250 million records could force a lot of index paging activity especially based on history. Likewise with your Field2. However, due to your "Group By" clause on Field2 and date updated, I would have ONE of those respectively in the first/second position of the index. Based on historical data, I would even tend to shoot at the following index to have dates as the primary basis, and within that, the secondary criteria.
index ( Updated_At, Field2, Field1, INT_FIELD )
This way, your entire query can be done on just the index alone and does not need to query against the raw data of the actual record. All the fields are right there in the index to pull from. You have a finite date range, so your updated_at is right-away qualified, and in order prep of the group by. From that, your "CHAR" values from Field2 will right-along nicely finish your group by. Field1 to qualify your 3rd criteria of "IN" char list, and finally your INT_FIELD for count( distinct ).
Don't know how long the index will take to build on 250 million, but that is where I would start.
I have several tables in MySQL in wich are stored chronological data. I added covering index for this tables with date field in the end. In my queries i'm selecting data for some period using BETWEEN operation for date field. So my WHERE statement consists from all fields from covering index.
When i'm executing EXPLAIN query in Extra column i have "Using where" - so, as i think, it means, that date field doesn't searched in index. When i'm selecting data for one period - i'm using "=" operation instead of BETWEEN and "Using where" doesn't appear - all searched in index.
What can i do, to all my WHERE statement to be searched in index, containing BETWEEN operation?
UPDATE:
table structure:
CREATE TABLE phones_stat (
id_site int(10) unsigned NOT NULL,
group smallint(5) unsigned NOT NULL,
day date NOT NULL,
id_phone mediumint(8) unsigned NOT NULL,
sessions int(10) unsigned NOT NULL,
PRIMARY KEY (id_site,group,day,id_phone) USING BTREE
) ;
query:
SELECT id_phone,
SUM(sessions) AS cnt
FROM phones_stat
WHERE id_site = 25
AND group = 1
AND day BETWEEN '2010-01-01' AND '2010-01-31'
GROUP BY id_phone
ORDER BY cnt DESC
How many rows do you have? Sometimes an index is not used if the optimizer deems it unnecessary (for instance, if the number of rows in your table(s) is very small). Could you give us an idea of what your SQL looks like?
You could try hinting your index usage and seeing what you get in EXPLAIN, just to confirm that your index is being overlooked, e.g.
http://dev.mysql.com/doc/refman/5.1/en/optimizer-issues.html
If you're GROUPing by id_phone, then a more useful index will be one which starts with that i.e.
... PRIMARY KEY (id_phone, id_site, `group`, day) USING BTREE
If you change the index to that and rerun the query, does it help?