Backgroud
I have a very big table, the table is just like this
CREATE TABLE tb_doc (
did mediumint(8) unsigned NOT NULL auto_increment,
title varchar(80) NOT NULL default '',
...,
PRIMARY KEY (did),
KEY title (title)
)
TYPE=MyISAM;
The type of title is varchar(80), most of the time title will be pure number strings like '111111','2222222','44444444', some times it will be utf-8 strings, like '3a','a4' or "中国" (Chinese characters).
I've already used HASH (did) to do partition, but my SELECT statements are alway like
SELECT did, title,... FROM tb_doc WHERE title= '1111111';
SELECT did, title,... FROM tb_doc WHERE title= '2222222';
So I want to use title to do partition, hope this would be faster. Now it comes the question.
Experiment
I used the following statement:
PARTITION BY RANGE COLUMNS (title)(
PARTITION p00 VALUES LESS THAN (1), # not pure number strings
PARTITION p01 VALUES LESS THAN (500000), # pure number strings from 1 to 500k
PARTITION p02 VALUES LESS THAN (1000000), # pure number strings from 500k to 1000k
PARTITION p03 VALUES LESS THAN (1500000), # pure number strings from 1000k to 1500k
.......... # ......
PARTITION pn VALUES LESS THAN (25000000), # the biggest number now
)
;
Similar Questions
I read the following two Q&As:
Partitioning a database table in MySQL
How to Partitioning a table using a LIKE criteria in Mysql
but they are for English world, not work in my situation.
Questions
Use title to do partition is better, right?
Can you give me a "utf-8" RANGE example?
I tryied '500000','1000000',...,but they do not work.
If I use SELECT xxx from tb_doc WHERE title='12345', dose MySQL fetch data from partation 1 only?
This table is ~50GB, how many partitions are optimum?
Thank you in advance.
May I note that VARCHAR will have problems with storing characters from multiple languages properly, better use NVARCHAR.
HASH partitioning is used to distribute load over partitions evenly. I would say, that first you should partition by something meaningful to a human (columns appearing in Where clause often) and then do HASH sub-partitioning to utilise as many cores as possible at the same time. So number of HASH sub-partitiong in this case will <= No cores.
I would suggest you creating a clustered index on title column. This will speed up your queries.
And in relation to your questions:
Not neceserally. It will speed up queries, because of clustered index, not partitioning.
Use partitioning to manage the table: eg. delete many rows quickly.
If a good proportion of your queries looks for many rows (not just 1) or title is not a UNIQUE column, then you may consider Partition
As an example of UTF-8 partition boundary I would say: less then ('c')
Depending on how you define partitioning it may hit 1, several or all partitions.
There is no penalty for having many partitions, but a table in MySQL 5.5+ can have upto 1024 partitions and sub-partitions.
When you whan to do partitioning by string value, use KEY PARTITIONING as described here: 18.2.5. KEY Partitioning.
Example:
CREATE TABLE tm1 (
s1 CHAR(32) PRIMARY KEY
)
PARTITION BY KEY(s1)
PARTITIONS 10;
Set number of partitions same as there are letters in your alphabet (or all alphabets you anticipate to see in the table) to begin with.
Partitioning by title, even if you could do it, will not speed up
SELECT did, title,... FROM tb_doc WHERE title= '1111111';
For a further discussion of the limitations of PARTITIONing, plus what few use cases it will help, see my blog;
Related
In case I have a table partitioned by year; how do I avoid the scanning of all partitions when I have to lookup a row by its ID and can't use partition pruning in the lookup query?
CREATE TABLE part_table (
id bigint NOT NULL auto_increment,
moment datetime NOT NULL,
KEY (id),
KEY (moment)
)-- partitioning information (in years)
PARTITION BY RANGE( YEAR(moment) ) (
PARTITION p2020 VALUES LESS THAN (2021),
PARTITION p2021 VALUES LESS THAN (2022),
PARTITION p2022 VALUES LESS THAN (2023),
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025),
PARTITION p2025 VALUES LESS THAN (2026),
PARTITION pFuture VALUES LESS THAN (maxvalue) )
;
With e.g. lookup query:
SELECT * FROM part_table WHERE ID = <nr>
Don't you want PRIMARY KEY(id, moment) or PRIMARY KEY(moment, id) instead of INDEX(id)?
Indexes are partitioned. Each partition is essentially a "table". It has a `BTree for the data and PK, and a BTree for each secondary index.
So, to find id=123 requires checking INDEX(id) in each partition. Herein lies one of the reasons why a PARTITIONed table is sometimes slower than the equivalent non-partitioned table.
It is inefficient to pre-create future partitions (other than one).
Show us the main queries you have. I will probably explain why you should not partition the table. I see two possible benefits in your definition:
Dropping 'old' data is much faster than DELETEing it.
`WHERE something-else AND moment between ..
Some cases
For this discussion, I assuming partitioning by a datetime in some fashion (BY RANGE(TO_DAYS(moment)) or BY ... (YEAR(moment)), etc).
WHERE id BETWEEN 111 and 222
Partitioning probably hurts slightly because, regardless of what indexes are available, the query must look in every partition.
WHERE id BETWEEN 111 and 222
AND moment > NOW() - INTERVAL 1 MONTH
with some index starting with `id`
This is a case where partition "pruning" is beneficial. It will look in one or two partitions (depending on whether or not the query is being run in January). Then it will somewhat efficiently use the index to lookup by id.
Now let be discuss two flavors if an index starting with id (and assuming either of the WHERE clauses, above:
PRIMARY KEY(id, moment)
The PK is "clustered" with the data. That is, the data is sorted by first id then moment. Hence the id BETWEEN... will find the rows consecutively in the BTree -- this is the most efficient. The AND moment... works to filter out some of the rows.
INDEX(id)
is not "clustered". It is a secondary index. Secondary indexes take two steps. (1) search the secondary BTree for the ids, but without filtering by moment; (2) reach into the data BTree using the artificial PK that was provided for you; (3) now the filtering by moment can happen. More steps, more blocks to read, etc.
DROP PARTITION p2020
id much faster and less invasive than `DELETE .. WHERE moment < '2021-01-01'.
More
It is important to look at all the main queries. X=constant versus X BETWEEN... can make a big difference in optimization; please provide concrete examples that are realistic for your app.
Also, sometimes a "covering" index can make up for otherwise inefficient indexes. So those examples need to show all the columns in the important queries. And what datatypes they are.
In the absence of such details, I will make the following broad statements (which might be invalidated by the specifics):
If the WHERE references only one column, the PARTITIONing is probably never beneficial.
If the WHERE has one = test and one 'range' test, there is probably a composite index that will work much better than partitioning.
Partitioning may shine when there are two range tests, but only if 'pruning' can be applied. (There are a lot of limitations on pruning.)
With 2 ranges, the one that is not being pruned on should be at the beginning of the PRIMARY KEY.
When pruning is used but the rest of the WHERE cannot use some index, that implies a scan of the partition. If there are only a few partitions, that could be a big scan.
Don't pre-build more than one partition. When not pruning, it is somewhat costly to open all the partitions only to find some are empty.
What is good approach to handle 3b rec table where concurrent read/write is very frequent within few days?
Linux server, running MySQL v8.0.15.
I have this table that will log device data history. The table need to retain its data for one year, possibly two years. The growth rate is very high: 8,175,000 rec/day (1mo=245m rec, 1y=2.98b rec). In the case of device number growing, the table is expected to be able to handle it.
The table read is frequent within last few days, more than a week then this frequency drop significantly.
There are multi concurrent connection to read and write on this table, and the target to r/w is quite close to each other, therefore deadlock / table lock happens but has been taken care of (retry, small transaction size).
I am using daily partitioning now, since reading is hardly spanning >1 partition. However there will be too many partition to retain 1 year data. Create or drop partition is on schedule with cron.
CREATE TABLE `table1` (
`group_id` tinyint(4) NOT NULL,
`DeviceId` varchar(10) COLLATE utf8mb4_unicode_ci NOT NULL,
`DataTime` datetime NOT NULL,
`first_log` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`first_res` tinyint(1) NOT NULL DEFAULT '0',
`last_log` datetime DEFAULT NULL,
`last_res` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`group_id`,`DeviceId`,`DataTime`),
KEY `group_id` (`group_id`,`DataTime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
/*!50100 PARTITION BY RANGE (to_days(`DataTime`))
(
PARTITION p_20191124 VALUES LESS THAN (737753) ENGINE = InnoDB,
PARTITION p_20191125 VALUES LESS THAN (737754) ENGINE = InnoDB,
PARTITION p_20191126 VALUES LESS THAN (737755) ENGINE = InnoDB,
PARTITION p_20191127 VALUES LESS THAN (737756) ENGINE = InnoDB,
PARTITION p_future VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
Insert are performed in size ~1500/batch:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
VALUES(%s, %s, FROM_UNIXTIME(%s), %s)
ON DUPLICATE KEY UPDATE last_log=NOW(), last_res=values(first_result);
Select are mostly to get count by DataTime or DeviceId, targeting specific partition.
SELECT DataTime, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DataTime HAVING ct<50;
SELECT DeviceId, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DeviceId HAVING ct<50;
So the question:
Accord to RickJames blog, it is not a good idea to have >50 partitions in a table, but if partition is put monthly, there are 245m rec in one partition. What is the best partition range in use here? Does RJ's blog still taken place with current mysql version?
Is it a good idea to leave the table not partitioned? (the index is running well atm)
note: I have read this stack question, having multiple table is a pain, therefore if it is not necessary i wish not to break the table. Also, sharding is currently not possible.
First of all, INSERTing 100 records/second is a potential bottleneck. I hope you are using SSDs. Let me see SHOW CREATE TABLE. Explain how the data is arriving (in bulk, one at a time, from multiple sources, etc) because we need to discuss batching the input rows, even if you have SSDs.
Retention for 1 or 2 years? Yes, PARTITIONing will help, but only with the deleting via DROP PARTITION. Use monthly partitions and use PARTITION BY RANGE(TO_DAYS(DataTime)). (See my blog which you have already found.)
What is the average length of DeviceID? Normally I would not even mention normalizing a VARCHAR(10), but with billions of rows, it is probably worth it.
The PRIMARY KEY you have implies that a device will not provide two values in less than one second?
What do "first" and "last" mean in the column names?
In older versions of MySQL, the number of partitions had impact on performance, hence the recommendation of 50. 8.0's Data Dictionary may have a favorable impact on that, but I have not experimented yet to see if the 50 should be raised.
The size of a partition has very little impact on anything.
In order to judge the indexes, let's see the queries.
Sharding is not possible? Do too many queries need to fetch multiple devices at the same time?
Do you have Summary tables? That is a major way for Data Warehousing to avoid performance problems. (See my blogs on that.) And, if you do some sort of "staging" of the input, the summary tables can be augmented before touching the Fact table. At that point, the Fact table is only an archive; no regular SELECTs need to touch it? (Again, let's see the main queries.)
One table per day (or whatever unit) is a big no-no.
Ingestion via IODKU
For the batch insert via IODKU, consider this:
collect the 1500 rows in a temp table, preferably with a single, 1500-row, INSERT.
massage that data if needed
do one IODKU..SELECT:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
ON DUPLICATE KEY UPDATE
last_log=NOW(), last_res=values(first_result)
SELECT group_id, DeviceId, DataTime, first_result
FROM tmp_table;
If necessary, the SELECT can do some de-dupping, etc.
This approach is likely to be significantly faster than 1500 separate IODKUs.
DeviceID
If the DeviceID is alway 10 characters and limited to English letters and digits, then make it
CHAR(10) CHARACTER SET ascii
Then pick between COLLATION ascii_general_ci and COLLATION ascii_bin, depending on whether you allow case folding or not.
Just for your reference:
I have a large table right now over 30B rows, grows 11M rows daily.
The table is innodb table and is not partitioned.
Data over 7 years is archived to file and purged from the table.
So if your performance is acceptable, partition is not necessary.
From management perspective, it is easier to manage the table with partitions, you might partition the data by week. It will 52 - 104 partitions if you keep last or 2 years data online
I couldn't find an example like mine, so here's the thing:
I have a big data set that I need to aggregate on top of.
We're talking about ~ %500M rows with a date field ranging from 2y ago until now.
My first instinct was to partition the table by this field (creating a partition on the date field), which leaves roughly 20M rows per partition.
Then I have indexes on the other fields I will aggregate/group by.
Here's my table definition (simplified for brevity sake):
create table t1(
date_field datetime not null,
additional_id int not null,
category_id int not null,
value_field1 double,
value_field2 double,
primary key(additional_id,date_field)
)
ENGINE=InnoDB
PARTITION BY RANGE(YEAR(date_field)*100 + MONTH(date_field)) (
PARTITION p_201411 VALUES LESS THAN (201411),
PARTITION p_201412 VALUES LESS THAN (201412),
#all the partitions until the current month...
PARTITION p_201610 VALUES LESS THAN (201610),
PARTITION p_201611 VALUES LESS THAN (201610),
PARTITION p_catchall VALUES LESS THAN MAXVALUE );
If I execute a query that gets a date directly, only the partition for the month is used, based on the output of explain partitions on top of a query such as the following one:
select value_field1 where additional_id=x and date_field='2014-11-05'
However, if I use a date range (even if inside the same partition), all partitions are scanned
select value_field1 where additional_id=x and date_field> '2014-11-05' and date_field <'2014-11-10'
(Same result if I use between).
What am I missing here? Is this really the right way to partition this table?
Thanks in advance
Short answer: Do not use complex expressions for PARTITION BY RANGE.
Long answer: (Aside from criticizing the implementation of BY RANGE with range queries.)
Instead, do this:
PARTITION BY RANGE (TO_DAYS(date_field)) (
PARTITION p_201411 VALUES LESS THAN (TO_DAYS('2014-11-01')),
...
PARTITION p_catchall VALUES LESS THAN MAXVALUE ); -- unchanged
Newer versions of MySQL have slightly more friendly expressions you can use.
If this is your typical query:
additional_id=x and date_field> '2014-11-05'
and date_field <'2014-11-10'
then partitioning is no faster than the equivalent non-partitioned table. You even have the perfect index for the non-partitioned version.
If, on the other hand, you are DROPping old partitions when they 'expire', the PARTITIONing is excellent.
25 partitions is good.
More discussion .
A side note: additional_id int is limited to 2 billion, so you are 1/4 of the way to overflowing. INT UNSIGNED would get you to 4 billion; you might consider an ALTER. (Of course, I don't know whether additional_id is unique in this table; so maybe it is not an issue.)
I want to partition a table in MySQL while preserving the table's structure.
I have a column, 'Year', based on which I want to split up the table into different tables for each year respectively. The new tables will have names like 'table_2012', 'table_2013' and so on. The resultant tables need to have all the fields exactly as in the source table.
I have tried the following two pieces of SQL script with no success:
1.
CREATE TABLE all_data_table
( column1 int default NULL,
column2 varchar(30) default NULL,
column3 date default NULL
) ENGINE=InnoDB
PARTITION BY RANGE ((year))
(
PARTITION p0 VALUES LESS THAN (2010),
PARTITION p1 VALUES LESS THAN (2011) , PARTITION p2 VALUES LESS THAN (2012) ,
PARTITION p3 VALUES LESS THAN (2013), PARTITION p4 VALUES LESS THAN MAXVALUE
);
2.
ALTER TABLE all_data_table PARTITION BY RANGE COLUMNS (`year`) (
PARTITION p0 VALUES LESS THAN (2011),
PARTITION p1 VALUES LESS THAN (2012),
PARTITION p2 VALUES LESS THAN (2013),
PARTITION p3 VALUES LESS THAN (MAXVALUE)
);
Any assistance would be appreciated!
This is old, but seeing as it comes up highly ranked in partitioning searches, I figured I'd give some additional details for people who might hit this page. What you are talking about in having a table_2012 and table_2013 is not "MySQL Partitioning" but "Manual Partitioning".
Partitioning means that you have one "logical table" with a single table name, which--behind the scenes--is divided among multiple files. When you have millions to billions of rows, over years, but typically you are only searching a single month, partitioning by Year/Month can have a great performance benefit because MySQL only has to search against the file that contains the Year/Month that you are searching for...so long as you include the partition key in your WHERE.
When you create multiple tables like table_2012 and table_2013, you are MANUALLY partitioning the tables, which you don't do with the MySQL PARTITION configuration. To manually partition the tables, during 2012, you put all data into the 2012 table. When you hit 2013, you start putting all the data into the 2013 table. You have to make sure to create the table before you hit 2013 or it won't have any place to go. Then, when you query across the years (e.g. from Nov 2012 - Jan 2013), you have to do a UNION between table_2012 and table_2013.
SELECT * FROM table_2012 WHERE #...
UNION
SELECT * FROM table_2013 WHERE #...
With partitioning, this manual work is not necessary. You do the initial setup of the partitions, then you treat is as a single table. No unions required, no checking the date before you insert, etc. This makes life much easier. MySQL handles figuring out what tables it needs to query. However, you MUST make sure to query against the Year column or it will have to scan ALL files. E.g. SELECT * FROM all_data_table WHERE Month=12 will scan all partitions for Month=12. To ensure you are only scanning the partition files that you need to scan, you want to make sure to include the partition column in every query that you can.
Possible negatives to partitioning...if you have billions of rows and you do an ALTER TABLE on the table to--say--add a column...it's going to have to update every row taking a VERY long time. At the company I currently work for, the boss doesn't think it's worth the time it takes to update the billion rows historically when we are adding a new column for going forward...so this is one of the reasons we do manual partitioning instead of letting MySQL do it.
DISCLAIMER: I am not an expert at partitioning...so if I'm wrong in any of this, please let me know and I'll fix the incorrect parts.
From what I see you want to create many tables from one big table.
I think you should try to create views instead.
Since from what I look around about partitioning, it actually partitions the physical storage of that table and then store them separately. But if you see from the top perspective you will see them as a single table.
I'm a complete newbie with MySQL indexes. I have several MyISAM tables on MySQL 5.0x having utf8 charsets and collations with 100k+ records each. The primary keys are generally integer. Many columns on each table may have duplicate values.
I need to quickly count, sum, average, or otherwise perform custom calculations on any number of fields in each table or joined on any number of others.
I found this page giving an overview of MySQL index usage: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html, but I'm still not sure I'm using indexes right. Just when I think I've made the perfect index out of a collection of fields I want to calculate against, I get the "index must be under 1000 bytes" error.
Can anyone explain how to most efficiently create and use indexes to speed up queries?
Caveat: upgrading Mysql is not possible in this case. Using Navicat Light for db administration, but this app isn't required.
When you create an index on a column or columns in MySQL table, the database is creating a data structure called a B-tree (assuming you use the default index setting), for which the key of each record is a concatenation of the values in the indexed columns.
For example, let's say you have a table that is defined like:
CREATE TABLE mytable (
id int unsigned auto_increment,
column_a char(32) not null default '',
column_b int unsigned not null default 0,
column_c varchar(512),
column_d varchar(512),
PRIMARY KEY (id)
) ENGINE=MyISAM;
Then let's give it some data:
INSERT INTO mytable VALUES (1, 'hello', 2, null, null);
INSERT INTO mytable VALUES (2, 'hello', 3, 'hi', 'there');
INSERT INTO mytable VALUES (3, 'how', 4, 'are', 'you?');
INSERT INTO mytable VALUES (4, 'foo', 5, '', 'bar');
Now suppose you decide to add a key to column_a and column_b like:
ALTER TABLE mytable ADD KEY (column_a, column_b);
The database is going to create the aforementioned B-tree, which will have four keys in it, one for each row:
hello-2
hello-3
how-4
foo-5
When you perform a search that references the column_a column, or that references the column_a AND column_b columns, the database will be able to use this index to narrow the record set it has to examine. Let's say you have a query like:
SELECT ... FROM mytable WHERE column_a = 'hello';
Even though the above query does not specify a value for the column_b column, it can still take advantage of our index by looking for all keys that begin with "hello". For the same reason, if you had a query like:
SELECT ... FROM mytable WHERE column_b = '2';
This query would NOT be able to use our index, because it would have to parse the index keys themselves to try to determine which keys' second value matches '2', which is terribly inefficient.
Now, let's address your original question of the maximum length. Suppose we try to create an index spanning all four non-PK columns in this table:
ALTER TABLE mytable ADD KEY (column_a, column_b, column_c, column_d);
You will get an error:
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes
In this case our column lengths are 32, 10, 512, and 512, which in a single-byte-per-character situation is 1066, which is above the limit of 1000. Suppose that it DID work; you would be creating the following keys:
hello-2-
hello-3-hi-there
how-4-are-you?
foo-5--bar
Now, suppose that you had values in column_c and column_d that were very long -- 512 characters each. Even in a basic single-byte character set, your keys would now be over 1000 bytes in length, which is what MySQL is complaining about. It gets even worse with multibyte character sets, where seemingly "small" columns can still push the keys over the limit.
If you MUST use a large compound key, one solution is to use InnoDB tables rather than the default MyISAM tables, which support a larger key length (3500 bytes) -- you can do this by swapping ENGINE=InnoDB instead of ENGINE=MyISAM in the declaration above. However, generally speaking, if you are using long keys there is probably something wrong with your table design.
Remember that single-column indexes often provide more utility than multi-column indexes. You want to use a multi-column index when you are going to often/always take advantage of it by specifying all of the necessary criteria in your queries. Also, as others have mentioned, do NOT index every column of a table, since each index is adding storage overhead to your database. You want to limit your indexes to the columns that are frequently used by queries, and if it seems like you need too many, you should probably think about breaking up your tables up into more logical components.
Indexes generally aren't well suited for custom calculations where the user is able to construct their own queries. Typically you choose the indexes to match the specific queries you intend to run, using EXPLAIN to see if the index is being used.
In the case that you have absolutely no idea what queries might be performed it is generally best to create one index per column - and not one index covering all columns.
If you have a good idea of what queries might be run often you could create an extra index for those specific queries. You can also add indexes later if your users complain that certain types of queries run too slow.
Also, indexes generally aren't that useful for calculating counts, sums and averages since these types of calculations require looking at every row.
It sounds like you are trying to put too many fields into your index. The limit is the probably the number of bytes it takes to encode all the fields.
The index is used in looking up the records, so you want to choose the fields which you are "WHERE"ing on. In choosing between those fields, you want to choose the ones that will narrow the results the quickest.
As an example, a filter on Male/Female will usually not help much because you are only going to save about 50% of the time. However, a filter on State may be useful because you'll break down into many more categories. However, if almost everybody in the database is in a single state then that won't work.
Remember that indexes are for sorting and finding rows.
The error message you got sounds like it is talking about the 1000 byte Prefix Limit for MyISAM table indexes. From http://dev.mysql.com/doc/refman/5.0/en/create-index.html:
The statement shown here creates an
index using the first 10 characters of
the name column:
CREATE INDEX part_of_name ON customer
(name(10)); If names in the column
usually differ in the first 10
characters, this index should not be
much slower than an index created from
the entire name column. Also, using
column prefixes for indexes can make
the index file much smaller, which
could save a lot of disk space and
might also speed up INSERT operations.
Prefix support and lengths of prefixes
(where supported) are storage engine
dependent. For example, a prefix can
be up to 1000 bytes long for MyISAM
tables, and 767 bytes for InnoDB
tables.
Maybe you can try a FULLTEXT index for problematic columns.