I'm running a table that has built up to 600 million rows and is rapidly growing, which has been slowing down queries that need to run as quickly as possible. Current schema is:
CREATE TABLE `user_history` (
`userId` int(11) NOT NULL,
`asin` varchar(10) COLLATE utf8_unicode_ci NOT NULL,
`dateSent` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
KEY `userId` (`userId`,`asin`,`dateSent`),
KEY `dateSent` (`dateSent`,`asin`),
KEY `asin` (`asin`,`dateSent`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Everything I've read about partitioning suggested that this was a prime candidate for partitioning by date range. We only tend to use the last 14 days data, but the client doesn't want to delete old data. The new schema looks like:
CREATE TABLE `user_history_partitioned` (
`userId` int(11) NOT NULL,
`asin` varchar(10) COLLATE utf8_unicode_ci NOT NULL,
`dateSent` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`dateSent`,`asin`,`userId`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
PARTITION BY RANGE ( UNIX_TIMESTAMP(dateSent) ) (
PARTITION Apr2013 VALUES LESS THAN (UNIX_TIMESTAMP('2013-05-01')),
etc...
PARTITION Mar2014 VALUES LESS THAN (UNIX_TIMESTAMP('2014-04-01')),
PARTITION Apr2014 VALUES LESS THAN (UNIX_TIMESTAMP('2014-05-01')),
PARTITION May2014 VALUES LESS THAN (UNIX_TIMESTAMP('2014-06-01')),
PARTITION Future VALUES LESS THAN MAXVALUE);
The idea of the Future partition is because a REORGANIZE PARTITION run on a populated partition was taking a long time to complete. So Future will always be empty and can reorganized into new partitions instantly. And other queries using this table have been reordered to use the primary key only, to reduce the number of indexes on the table.
The time-critical query is apropos of:
SELECT SQL_NO_CACHE *
FROM books B
WHERE (non-relevant stuff deleted)
AND NOT EXISTS
(
SELECT 1 FROM user_history H
WHERE
H.userId=$userId
AND H.asin=B.ASIN
AND dateSent > DATE_SUB(NOW(), INTERVAL 14 DAY)
)
AND (non-relevant stuff deleted)
LIMIT 1
So we're avoid duplicates that have already been selected for the same user in the last 14 days. And this currently returns in < 0.1 secs, which is okay but slower than it used to be on the current schema.
For the new schema, the inner SELECT has been reordered to:
SELECT 1 FROM user_history_partitioned H
WHERE dateSent > DATE_SUB(NOW(), INTERVAL 14 DAY)
AND H.asin=B.ASIN
AND H.userId=$userId
And it's taking 5 minutes per query. and I can't see why. The idea is that the current partition and indexes should be in memory (or maybe the previous month too, at some times of the month), and the primary index covers the WHERE clause. But from the time it's taking, it could be performing a full table scan on asin or userId. Which is difficult to identify from EXPLAIN because it's an inner query.
What am I missing? Do I need another combined index for (asin, userID)? If so, why?
Thanks,
PS: Tried wrapping the DATE_SUB(...) as UNIX_TIMESTAMP(DATE_SUB(...)) just in case it was a type conversion issue, but made no difference.
Related
I have to tables with 65.5 Million rows:
1)
CREATE TABLE RawData1 (
cdasite varchar(45) COLLATE utf8_unicode_ci NOT NULL,
id int(20) NOT NULL DEFAULT '0',
timedate datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
type int(11) NOT NULL DEFAULT '0',
status int(11) NOT NULL DEFAULT '0',
branch_id int(20) DEFAULT NULL,
branch_idString varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (id,cdasite,timedate),
KEY idx_timedate (timedate,cdasite)
) ENGINE=InnoDB;
2)
Same table with partition (call it RawData2)
PARTITION BY RANGE ( TO_DAYS(timedate))
(PARTITION p20140101 VALUES LESS THAN (735599) ENGINE = InnoDB,
PARTITION p20140401 VALUES LESS THAN (735689) ENGINE = InnoDB,
.
.
PARTITION p20201001 VALUES LESS THAN (738064) ENGINE = InnoDB,
PARTITION future VALUES LESS THAN MAXVALUE ENGINE = InnoDB);
I'm using the same query:
SELECT count(id) FROM RawData1
where timedate BETWEEN DATE_FORMAT(date_sub(now(),INTERVAL 2 YEAR),'%Y-%m-01') AND now();
2 problems:
1. why the partitioned table runs longer then the regular table?
2. the regular table returns 36380217 in 17.094 Sec. is it normal, all R&D leaders think it is not fast enough, it need to return in ~2 Sec.
What do I need to check / do / change ?
Is it realistic to scan 35732495 rows and retrieve 36380217 in less then 3-4 sec?
You have found one example of why PARTITIONing is not a performance panacea.
Where does id come from?
How many different values are there for cdasite? If thousands, not millions, build a table mapping cdasite <=> id and switch from a bulky VARCHAR(45) to a MEDIUMINT UNSIGNED (or whatever is appropriate). This item may help the most, but perhaps not enough.
Ditto for status, but probably using TINYINT UNSIGNED. Or think about ENUM. Either is 1 byte, not 4.
The (20) on INT(20) means nothing. You get a 4-byte integer with a limit of about 2 billion.
Are you sure there are no duplicate timedates?
branch_id and branch_idString -- this smells like a pair that needs to be in another table, leaving only the id here?
Smaller -> faster.
COUNT(*) is the same as COUNT(id) since id is NOT NULL.
Do not include future partitions before they are needed; it slows things down. (And don't use partitioning at all.)
To get that query even faster, build and maintain a Summary Table. It would have at least a DATE in the PRIMARY KEY and at least COUNT(*) as a column. Then the query would fetch from that table. More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables
I have a query which is getting slower and slower because there are more and more records in my table. So I'm trying to speed things up.
Database size:
Records: 1,200,000
Data 22,9 MiB
Index 46,8 MiB
Total 69,7 MiB
The purpose of the query is counting the number of records that exist that match the conditions. The conditions are a date (current date) and a status number. See query below:
SELECT
COUNT(id) AS total
FROM
order_process
WHERE
DATE(datetime) = CURDATE() AND
status = '7';
At the moment, this query is taking 800ms. And I need to run this query multiple times with different dates. These are all in the same script so script execution is going over the 3 seconds at the moment. How can I speed this up?
What have I already done:
Created indexes (Index on status and datetime both don't speed up the query).
Tested InnoDB engine (which is slower, mostly reading on this table)
To make it complete, below the current table setup.
CREATE TABLE IF NOT EXISTS `order_process` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`order_id` int(11) NOT NULL,
`status` int(11) NOT NULL,
`datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`remark` text NOT NULL,
PRIMARY KEY (`id`),
KEY `orderid` (`order_id`),
KEY `datetime` (`datetime`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
When you use date() function on a timestamp/datetime column and even if the column is indexed it can't use the index
So you need to construct the query as
where
datetime >= concat(CURDATE(),' 00:00:00')
and datetime <= concat(CURDATE(),' 23:59:59')
and status = '7'
I have a table storing weekly viewing statistic for around 40K businesses, the tables passed 2.2M records and is starting to slow things down, I'm looking at partitioning it to speed things up but I'm not sure how best to do it.
My ORM requires an id field as a primary key, but that field has no relevance to the data, I've been using a unique index on fields for year, week number and business ID.
As I need the primary key to be involved in the partition map, I'm not sure how best to organise this (I've never used partitioning before).
Currently I have...
CREATE TABLE `weekly_views` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`business_id` int(11) NOT NULL,
`year` smallint(4) UNSIGNED NOT NULL,
`week` tinyint(2) UNSIGNED NOT NULL,
`hits` int(5) NOT NULL,
`created` timestamp NOT NULL ON UPDATE CURRENT_TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
`updated` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
UNIQUE `search` USING BTREE (business_id, `year`, `week`),
UNIQUE `id` USING BTREE (id, `week`)
) ENGINE=`InnoDB` AUTO_INCREMENT=2287009 DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci ROW_FORMAT=COMPACT CHECKSUM=0 DELAY_KEY_WRITE=0 PARTITION BY LIST(week) PARTITIONS 52 (PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
(5 ... 51)
PARTITION p52 VALUES IN (52) ENGINE = InnoDB);
One partition per week seemed the only logical way to break them up. Am I right that when I search for a record for the current week/business using 'business_id = xx and week = xx and year = xx' it's going to know which partition to use without searching them all? But, when I get the result and save it via the ORM, it's going to use the id field and not know which partition to use?
I guess I could use a custom query to insert or update (I haven't originally done this as the ORM doesn't support it).
Am I going the right way about this, or is there a better way to partition a table like this?
Thanks for your help!
As long as the query has week column in WHERE clause, MySQL will look in correct partition. However, weeks repeat each year and you'll end up with data from different years in the same partition.
Also you need 53 not 52 partitions, as you'll need to deal with leap years.
I want to partition a mysql table by datetime column. One day a partition.The create table scripts is like this:
CREATE TABLE raw_log_2011_4 (
id bigint(20) NOT NULL AUTO_INCREMENT,
logid char(16) NOT NULL,
tid char(16) NOT NULL,
reporterip char(46) DEFAULT NULL,
ftime datetime DEFAULT NULL,
KEY id (id)
) ENGINE=InnoDB AUTO_INCREMENT=286802795 DEFAULT CHARSET=utf8
PARTITION BY hash (day(ftime)) partitions 31;
But when I select data of some day.It could not locate the partition.The select statement is like this:
explain partitions select * from raw_log_2011_4 where day(ftime) = 30;
when i use another statement,it could locate the partition,but I coluld not select data of some day.
explain partitions select * from raw_log_2011_4 where ftime = '2011-03-30';
Is there anyone tell me How I could select data of some day and make use of partition.Thanks!
Partitions by HASH is a very bad idea with datetime columns, because it cannot use partition pruning. From the MySQL docs:
Pruning can be used only on integer columns of tables partitioned by
HASH or KEY. For example, this query on table t4 cannot use pruning
because dob is a DATE column:
SELECT * FROM t4 WHERE dob >= '2001-04-14' AND dob <= '2005-10-15';
However, if the table stores year values in an INT column, then a
query having WHERE year_col >= 2001 AND year_col <= 2005 can be
pruned.
So you can store the value of TO_DAYS(DATE()) in an extra INTEGER column to use pruning.
Another option is to use RANGE partitioning:
CREATE TABLE raw_log_2011_4 (
id bigint(20) NOT NULL AUTO_INCREMENT,
logid char(16) NOT NULL,
tid char(16) NOT NULL,
reporterip char(46) DEFAULT NULL,
ftime datetime DEFAULT NULL,
KEY id (id)
) ENGINE=InnoDB AUTO_INCREMENT=286802795 DEFAULT CHARSET=utf8
PARTITION BY RANGE( TO_DAYS(ftime) ) (
PARTITION p20110401 VALUES LESS THAN (TO_DAYS('2011-04-02')),
PARTITION p20110402 VALUES LESS THAN (TO_DAYS('2011-04-03')),
PARTITION p20110403 VALUES LESS THAN (TO_DAYS('2011-04-04')),
PARTITION p20110404 VALUES LESS THAN (TO_DAYS('2011-04-05')),
...
PARTITION p20110426 VALUES LESS THAN (TO_DAYS('2011-04-27')),
PARTITION p20110427 VALUES LESS THAN (TO_DAYS('2011-04-28')),
PARTITION p20110428 VALUES LESS THAN (TO_DAYS('2011-04-29')),
PARTITION p20110429 VALUES LESS THAN (TO_DAYS('2011-04-30')),
PARTITION future VALUES LESS THAN MAXVALUE
);
Now the following query will only use partition p20110403:
SELECT * FROM raw_log_2011_4 WHERE ftime = '2011-04-03';
Hi You are doing the wrong partition in definition of the table the table definition would like this:
CREATE TABLE raw_log_2011_4 (
id bigint(20) NOT NULL AUTO_INCREMENT,
logid char(16) NOT NULL,
tid char(16) NOT NULL,
reporterip char(46) DEFAULT NULL,
ftime datetime DEFAULT NULL,
KEY id (id)
) ENGINE=InnoDB AUTO_INCREMENT=286802795 DEFAULT CHARSET=utf8
PARTITION BY hash (TO_DAYS(ftime)) partitions 31;
And your select command would be:
explain partitions
select * from raw_log_2011_4 where TO_DAYS(ftime) = '2011-03-30';
The above command would select all the date required, as if you use the TO_DAYS command as
mysql> SELECT TO_DAYS(950501);
-> 728779
mysql> SELECT TO_DAYS('2007-10-07');
-> 733321
Why to use the TO_DAYS AS The MySQL optimizer will recognize two date-based functions for partition pruning purposes:
1.TO_DAYS()
2.YEAR()
and this would solve your problem..
I just recently read a MySQL blog post relating to this, at http://dev.mysql.com/tech-resources/articles/mysql_55_partitioning.html.
Versions earlier than 5.1 required special gymnastics in order to do partitioning based on dates. The link above discusses it and shows examples.
Versions 5.5 and later allowed you to do direct partitioning using non-numeric values such as dates and strings.
Don't use CHAR, use VARCHAR. That will save a lot of space, hence decrease I/O, hence speed up queries. (Exception: If the column is really fixed length, then use CHAR. And it will probably be CHARACTER SET ascii.)
reporterip: (46) is unnecessarily big for an IP address, even IPv6. See My blog for further discussion, including how to shrink it to 16 bytes.
PARTITION BY RANGE(TO_DAYS(...)) as #Steyx suggested, but don't have more than about 50 partitions. The more partitions you have, the slower queries get, in spite of the "pruning". HASH partitioning is essentially useless.
More discussion of partitioning, especially the type you are looking at. That includes code for a sliding set of partitions over time.
I have a table that stores a pupil_id, a category and an effective date (amongst other things). The dates can be past, present or future. I need a query that will extract a pupil's current status from the table.
The following query works:
SELECT *
FROM pupil_status
WHERE (status_pupil_id, status_date) IN (
SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW() -- to ensure we ignore the "future status"
GROUP BY status_pupil_id );
In MySQL, the table is defined as follows:
CREATE TABLE IF NOT EXISTS `pupil_status` (
`status_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`status_pupil_id` int(10) unsigned NOT NULL, -- a foreign key
`status_category_id` int(10) unsigned NOT NULL, -- a foreign key
`status_date` datetime NOT NULL, -- effective date/time of status change
`status_modify` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`status_staff_id` int(10) unsigned NOT NULL, -- a foreign key
`status_notes` text NOT NULL, -- notes detailing the reason for status change
PRIMARY KEY (`status_id`),
KEY `status_pupil_id` (`status_pupil_id`,`status_category_id`),
KEY `status_pupil_id_2` (`status_pupil_id`,`status_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1409 ;
However, with 950 pupils and just over 1400 statuses in the table, the query takes 0.185 seconds to process. Perhaps acceptable now, but when the table swells, I'm worried about scalability. It is likely that the production system will have over 10000 pupils and each will have 15-20 statuses each.
Is there a better way to write this query? Are there better indexes that I should have to assist the query? Please let me know.
There are the following things you could try
1 Use an INNER JOIN instead of the WHERE
SELECT *
FROM pupil_status ps
INNER JOIN
(SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW()
GROUP BY status_pupil_id) X
ON ps.status_pupil_id = x.status_pupil_id
AND ps.status_date = x.status_date
2 Have a variable and store the value for NOW() - I am not sure if the DB engine optimizes this call to NOW() as just one call but if it doesnt, then this might help a bit
These are some suggestions however you will need to compare the query plans and see if there is any appreciable improvement or not.
Based on your usage of indexes as per the Query plan, robob's suggestion above could also come in handy
Find out how long query takes when you load the system with 10000 pupils each with have 15-20 statuses each.
Only refactor if it takes too long.