mysql select query optimization of partitioned table in non-cluster environment - mysql

I have select query on a partitioned table with 123 million records which is taking more then 10 minutes to fetch data. My query looks like 'select * from tableName where column1='1.1.1.1' order by timestamp desc';
Table is already indexed on column1.
Any help appreciated.
(From comments)
CREATE TABLE mytable (
column1 varchar(256) NOT NULL,
column2 varchar(100) NOT NULL,
column3 smallint(5) unsigned NOT NULL,
column4 smallint(5) unsigned NOT NULL,
timestamp bigint(20) unsigned NOT NULL,
KEY mytable_idx (column2,timestamp,column3,column4),
KEY ip_addr_index (column1),
KEY ts_idx (timestamp)
) /*!50100 PARTITION BY RANGE ((TIMESTAMP))
(PARTITION p1498800000 VALUES LESS THAN (1498800000) ENGINE = InnoDB,
PARTITION p1500000000 VALUES LESS THAN (1500000000) ENGINE = InnoDB,
PARTITION p1501200000 VALUES LESS THAN (1501200000) ENGINE = InnoDB,
PARTITION p1502400000 VALUES LESS THAN (1502400000) ENGINE = InnoDB,
PARTITION p1503600000 VALUES LESS THAN (1503600000) ENGINE = InnoDB,
PARTITION p1504800000 VALUES LESS THAN (1504800000) ENGINE = InnoDB,
PARTITION p1506000000 VALUES LESS THAN (1506000000) ENGINE = InnoDB
) */

For this query:
select *
from tableName
where column1 = '1.1.1.1'
order by timestamp desc;
You want an index on (column1, timestamp desc). Note: The desc may be ignored in earlier versions of MySQL.

PARTITIONing does not intrinsically provide speed. Please provide SHOW CREATE TABLE so we can discuss whether partitioning actually hurts performance in your case.
INDEX(column1, timestamp) -- In this order
is optimal whether the table is partitioned or not. In particular, that index will work just as good for non-partitioned. (Gordon's comment about DESC has no impact on performance, whether old or new version.)
With 123 million rows, you should keep an eye on datatypes. If you have
column1 VARCHAR(15) CHARACTER SET utf8
then that ipv4_address can be improved from up-to-17 bytes to exactly 4:
BINARY(4)
with suitable conversions on INSERT and SELECT. Making that change would also allow for CDR and other range tests, which are not possible with VARCHAR. Will you need to handle IPv6? I discuss that here.
How many rows match 1.1.1.1? Are there any TEXT columns? What is the PRIMARY KEY? Which Engine? Each of those questions may have an impact on the "10 minutes".
It is important to understand when a "composite" index is better than a single-column index. More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
after CREATE
Replace this
KEY ip_addr_index (column1)
with
KEY ip_addr_index (column1, timestamp)
Don't create more than one future partition before it is needed. Always have a LESS THAN (MAXVALUE) partition just in case.
IPv4 can live with VARCHAR(15); IPv6 fits in (39) or `BINARY(16) after packing.
For that one query, 7 queries must be done (one per partition); the results put together, then sorted. Without partitioning, it becomes one query, no sort (since the index is already sorted). So, (I believe) that partitioning slows that query down.
When discussing performance in 123M rows, I need to see all the main queries in one sitting in order to advise. Optimizing for one query is all to likely to de-optimize for some other.
There seems to be no reason to use BIGINT for TIMESTAMP. INT UNSIGNED would save 4 bytes per row of data, plus more for the indexes. Perhaps a total savings of 2GB of disk space. That translates into some speedup for some queries.
If timestamp is always used in a "range", then this index (column2,timestamp,column3,column4) is probably in an inefficient order. Please provide the query that benefits from this index so I can further elaborate.

Related

mariadb (mysql) sub partition error (total sub partition count exceeds 64)

enter image description here
Hello
I want to configure a partition (monthly)/subpartition (day by day) as the query above.
If the total number of subpartitions exceeds 64,
'(errno: 168 "Unknown (generic) error from engine")'
The table is not created due to an error. (Creating less than 64 is successed).
I know that the maximum number of partitions (including subpartitions) that can be created is 8,192, is there anything I missed?
Below is the log table.
create table detection_log
(
id bigint auto_increment,
detected_time datetime default '1970-01-01' not null,
malware_title varchar(255) null,
malware_category varchar(30) null,
user_name varchar(30) null,
department_path varchar(255) null,
PRIMARY KEY (detected_time, id),
INDEX `detection_log_id_uindex` (id),
INDEX `detection_log_malware_title_index` (malware_title),
INDEX `detection_log_malware_category_index` (malware_category),
INDEX `detection_log_user_name_index` (user_name),
INDEX `detection_log_department_path_index` (departmen`enter code here`t_path)
);
SUBPARTITIONs provide no benefit that I know of.
HASH partitioning either provides no benefit or hurts performance.
So... Explain what you hoped to gain by partitioning; then we can discuss whether any type of partitioning is worth doing. Also, provide the likely SELECTs so we can discuss the optimal INDEXes. If you need a "two-dimensional" index, that might indicate a need for partitioning (but still not subpartitioning).
More
I see PRIMARY KEY(detected_time,id). This provides a very fast way to do
SELECT ...
WHERE detected_time BETWEEN ... AND ...
ORDER BY detected_time, id
In fact, it will probably be faster than if you also partition the table. (As a general rule it is useless to partition on the first part of the PK.)
If you need to do
SELECT ...
WHERE user_id = 123
AND detected_time BETWEEN ... AND ...
ORDER BY detected_time, id
Then this is optimal:
INDEX(user_id, detected_time, id)
Again, probably faster than any form of partitioning on any column(s).
And
A "point query" (WHERE key = 123) takes a few milliseconds more in a 1-billion-row table compared to a 1000-row table. Rarely is the difference important. The depth of the BTree (perhaps 5 levels vs 2 levels) is the main difference. If you PARTITION the table, you are removing perhaps 1 or 2 levels of the BTree, but replacing them with code to "prune" down to the desired partition. I claim that this tradeoff does not provide a performance benefit.
A "range query" is very nearly the same speed regardless of the table size. This is because the structure is actually a B+Tree, so it is very efficient to fetch the 'next' row.
Hence, the main goal in optimizing queries on a huge table is to take advantage of the characteristics of the B+Tree.
Pagination
SELECT log.detected_time, log.user_name, log.department_path,
log.malware_category, log.malware_title
FROM detection_log as log
JOIN
(
SELECT id
FROM detection_log
WHERE user_name = 'param'
ORDER BY detected_time DESC
LIMIT 25 OFFSET 1000
) as temp ON temp.id = log.id;
The good part: Finding ids, then fetching the data.
The slow part: Using OFFSET.
Have this composite index: INDEX(user_name, detected_time, id) in that order. Make another index for when you use department_path.
Instead of OFFSET, "remember where you left off". A blog specifically about that: http://mysql.rjweb.org/doc.php/pagination
Purging
Deleting after a year is an excellent use of PARTITIONing. Use PARTITION BY RANGE(TO_DAYS(detected_time)) and have either ~55 weekly or 15 monthly partitions. See HTTP://mysql.rjweb.org/doc.php/partitionmaint for details. DROP PARTITION is immensely faster than DELETE. (This partitioning will not speed up SELECT.)

MySQL : optimize partitioning to speed up requests [duplicate]

I have a huge table that stores many tracked events, such as a user click.
The table is already in the 10s of millions, and it's growing larger every day.
The queries are starting to get slower when I try to fetch events from a large timeframe, and after reading quite a bit on the subject I understand that partitioning the table may boost the performance.
What I want to do is partition the table on a per month basis.
I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?
If not, what is the command to do it manually considering my partitioned by column is a datetime?
As explained by the manual: http://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
This is easily possible by hash partitioning of the month output.
CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)
ENGINE=INNODB
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Do note that this only partitions by month and not by year, also there are only 6 partitions (so 6 months) in this example.
And for partitioning an existing table (manual: https://dev.mysql.com/doc/refman/5.7/en/alter-table-partition-operations.html):
ALTER TABLE ti
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Querying can be done both from the entire table:
SELECT * from ti;
Or from specific partitions:
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
CREATE TABLE `mytable` (
`post_id` int DEFAULT NULL,
`viewid` int DEFAULT NULL,
`user_id` int DEFAULT NULL,
`post_Date` datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY RANGE (extract(year_month from `post_Date`))
(PARTITION P0 VALUES LESS THAN (202012) ENGINE = InnoDB,
PARTITION P1 VALUES LESS THAN (202104) ENGINE = InnoDB,
PARTITION P2 VALUES LESS THAN (202108) ENGINE = InnoDB,
PARTITION P3 VALUES LESS THAN (202112) ENGINE = InnoDB,
PARTITION P4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
Be aware of the "lazy" effect doing it partitioning by hash:
As docs says:
You should also keep in mind that this expression is evaluated each time a row is inserted or updated (or possibly deleted); this means that very complex expressions may give rise to performance issues, particularly when performing operations (such as batch inserts) that affect a great many rows at one time.
The most efficient hashing function is one which operates upon a single table column and whose value increases or decreases consistently with the column value, as this allows for “pruning” on ranges of partitions. That is, the more closely that the expression varies with the value of the column on which it is based, the more efficiently MySQL can use the expression for hash partitioning.
For example, where date_col is a column of type DATE, then the expression TO_DAYS(date_col) is said to vary directly with the value of date_col, because for every change in the value of date_col, the value of the expression changes in a consistent manner. The variance of the expression YEAR(date_col) with respect to date_col is not quite as direct as that of TO_DAYS(date_col), because not every possible change in date_col produces an equivalent change in YEAR(date_col).
HASHing by month with 6 partitions means that two months a year will land in the same partition. What good is that?
Don't bother partitioning, index the table.
Assuming these are the only two queries you use:
SELECT * from ti;
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
then start the PRIMARY KEY with the_date.
The first query simply reads the entire table; no change between partitioned and not.
The second query, assuming you want a single month, not all the months that map into the same partition, would need to be
SELECT * FROM ti WHERE the_date >= '2019-03-01'
AND the_date < '2019-03-01' + INTERVAL 1 MONTH;
If you have other queries, let's see them.
(I have not found any performance justification for ever using PARTITION BY HASH.)

partitioning mysql table with 3b records per year

What is good approach to handle 3b rec table where concurrent read/write is very frequent within few days?
Linux server, running MySQL v8.0.15.
I have this table that will log device data history. The table need to retain its data for one year, possibly two years. The growth rate is very high: 8,175,000 rec/day (1mo=245m rec, 1y=2.98b rec). In the case of device number growing, the table is expected to be able to handle it.
The table read is frequent within last few days, more than a week then this frequency drop significantly.
There are multi concurrent connection to read and write on this table, and the target to r/w is quite close to each other, therefore deadlock / table lock happens but has been taken care of (retry, small transaction size).
I am using daily partitioning now, since reading is hardly spanning >1 partition. However there will be too many partition to retain 1 year data. Create or drop partition is on schedule with cron.
CREATE TABLE `table1` (
`group_id` tinyint(4) NOT NULL,
`DeviceId` varchar(10) COLLATE utf8mb4_unicode_ci NOT NULL,
`DataTime` datetime NOT NULL,
`first_log` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`first_res` tinyint(1) NOT NULL DEFAULT '0',
`last_log` datetime DEFAULT NULL,
`last_res` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`group_id`,`DeviceId`,`DataTime`),
KEY `group_id` (`group_id`,`DataTime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
/*!50100 PARTITION BY RANGE (to_days(`DataTime`))
(
PARTITION p_20191124 VALUES LESS THAN (737753) ENGINE = InnoDB,
PARTITION p_20191125 VALUES LESS THAN (737754) ENGINE = InnoDB,
PARTITION p_20191126 VALUES LESS THAN (737755) ENGINE = InnoDB,
PARTITION p_20191127 VALUES LESS THAN (737756) ENGINE = InnoDB,
PARTITION p_future VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
Insert are performed in size ~1500/batch:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
VALUES(%s, %s, FROM_UNIXTIME(%s), %s)
ON DUPLICATE KEY UPDATE last_log=NOW(), last_res=values(first_result);
Select are mostly to get count by DataTime or DeviceId, targeting specific partition.
SELECT DataTime, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DataTime HAVING ct<50;
SELECT DeviceId, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DeviceId HAVING ct<50;
So the question:
Accord to RickJames blog, it is not a good idea to have >50 partitions in a table, but if partition is put monthly, there are 245m rec in one partition. What is the best partition range in use here? Does RJ's blog still taken place with current mysql version?
Is it a good idea to leave the table not partitioned? (the index is running well atm)
note: I have read this stack question, having multiple table is a pain, therefore if it is not necessary i wish not to break the table. Also, sharding is currently not possible.
First of all, INSERTing 100 records/second is a potential bottleneck. I hope you are using SSDs. Let me see SHOW CREATE TABLE. Explain how the data is arriving (in bulk, one at a time, from multiple sources, etc) because we need to discuss batching the input rows, even if you have SSDs.
Retention for 1 or 2 years? Yes, PARTITIONing will help, but only with the deleting via DROP PARTITION. Use monthly partitions and use PARTITION BY RANGE(TO_DAYS(DataTime)). (See my blog which you have already found.)
What is the average length of DeviceID? Normally I would not even mention normalizing a VARCHAR(10), but with billions of rows, it is probably worth it.
The PRIMARY KEY you have implies that a device will not provide two values in less than one second?
What do "first" and "last" mean in the column names?
In older versions of MySQL, the number of partitions had impact on performance, hence the recommendation of 50. 8.0's Data Dictionary may have a favorable impact on that, but I have not experimented yet to see if the 50 should be raised.
The size of a partition has very little impact on anything.
In order to judge the indexes, let's see the queries.
Sharding is not possible? Do too many queries need to fetch multiple devices at the same time?
Do you have Summary tables? That is a major way for Data Warehousing to avoid performance problems. (See my blogs on that.) And, if you do some sort of "staging" of the input, the summary tables can be augmented before touching the Fact table. At that point, the Fact table is only an archive; no regular SELECTs need to touch it? (Again, let's see the main queries.)
One table per day (or whatever unit) is a big no-no.
Ingestion via IODKU
For the batch insert via IODKU, consider this:
collect the 1500 rows in a temp table, preferably with a single, 1500-row, INSERT.
massage that data if needed
do one IODKU..SELECT:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
ON DUPLICATE KEY UPDATE
last_log=NOW(), last_res=values(first_result)
SELECT group_id, DeviceId, DataTime, first_result
FROM tmp_table;
If necessary, the SELECT can do some de-dupping, etc.
This approach is likely to be significantly faster than 1500 separate IODKUs.
DeviceID
If the DeviceID is alway 10 characters and limited to English letters and digits, then make it
CHAR(10) CHARACTER SET ascii
Then pick between COLLATION ascii_general_ci and COLLATION ascii_bin, depending on whether you allow case folding or not.
Just for your reference:
I have a large table right now over 30B rows, grows 11M rows daily.
The table is innodb table and is not partitioned.
Data over 7 years is archived to file and purged from the table.
So if your performance is acceptable, partition is not necessary.
From management perspective, it is easier to manage the table with partitions, you might partition the data by week. It will 52 - 104 partitions if you keep last or 2 years data online

mySQL query optimisation for browse tracker

I have been reading lots of great answers to different problems over the time on this site but this is the first time I am posting. So in advance thanks for your help.
Here is my question:
I have a MySQL table that tracks visits to different websites we have. This is the table structure:
create table navigation_base (
uid int(11) NOT NULL,
date datetime not null,
dia date not null,
ip int(4) unsigned not null default 0,
session_id int unsigned not null,
cliente smallint unsigned not null default 0,
campaign mediumint unsigned not null default 0,
trackcookie int unsigned not null,
adgroup int unsigned not null default 0,
PRIMARY KEY (uid)
) ENGINE=MyISAM;
This table has aprox. 70 million rows (an average of 110,000 per day).
On that table we have created indexes with following commands:
alter table navigation_base add index dia_cliente_campaign_ip (dia,cliente,campaign,ip);
alter table navigation_base add index dia_cliente_campaign_ip_session (dia,cliente,campaign,ip,session_id);
alter table navigation_base add index dia_cliente_campaign_ip_session_trackcookie (dia,cliente,campaign,ip,session_id,trackcookie);
We then use this table to get visitor statistics grouped by clients, days and campaigns with the following query:
select
dia,
navigation_base.campaign,
navigation_base.cliente,
count(distinct ip) as visitas,
count(ip) as paginas_vistas,
count(distinct session_id) as sesiones,
count(distinct trackcookie) as cookies
from navigation_base where
(dia between '2017-01-01' and '2017-01-31')
group by dia,cliente,campaign order by NULL
Even having those indexes created, the response times for periods of one month are relatively slow; On our server about 3 seconds.
Are there some ways of speeding up these queries?
Thanks in advance.
With this much of data, indexing alone may not be all that helpful since there is a lot of similarity in the data. Besides you have GROUP BY and SORT along with aggregation. All these things combined makes optimization very hard. partitioning is the way forward, because:
Some queries can be greatly optimized in virtue of the fact that data
satisfying a given WHERE clause can be stored only on one or more
partitions, which automatically excludes any remaining partitions from
the search. Because partitions can be altered after a partitioned
table has been created, you can reorganize your data to enhance
frequent queries that may not have been often used when the
partitioning scheme was first set up.
And if this doesn't work for you, it's still possible to
In addition, MySQL 5.7 supports explicit partition selection for
queries. For example, SELECT * FROM t PARTITION (p0,p1) WHERE c < 5
selects only those rows in partitions p0 and p1 that match the WHERE
condition.
ALTER TABLE navigation_base
PARTITION BY RANGE( TO_DAYS(dia)) (
PARTITION p0 VALUES LESS THAN (TO_DAYS('2018-12-31')),
PARTITION p1 VALUES LESS THAN (TO_DAYS('2017-12-31')),
PARTITION p2 VALUES LESS THAN (TO_DAYS('2016-12-31')),
PARTITION p3 VALUES LESS THAN (TO_DAYS('2015-12-31')),
..
PARTITION p10 VALUES LESS THAN MAXVALUE));
Use bigger or smaller partitions as you see fit.
The most important factor to keep in mind is that mysql can only use one index per table. So choose your index wisely.
If you only do COUNT(DISTINCT ...) at the granularity of a day, then build and incrementally maintain a summary table. It would augmented each night by a query nearly identical to your SELECT, but only fetching yesterday's data.
Then use this Summary Table for the monthly "report".
More on Summary Tables

MySQL table partition strange behavior (slow query suddenly)

(MySQL version: 5.6.15)
I have a huge table (Table_A) with 10M rows, in entity-attribute-value model.
It has a compound unique key [Field_A + Element + DataTime].
CREATE TABLE TABLE_A
(
`Field_A` varchar(5) NOT NULL,
`Element` varchar(5) NOT NULL,
`DataTime` datetime NOT NULL,
`Value` decimal(10,2) DEFAULT NULL,
UNIQUE KEY `A_ELE_TIME` (`Field_A`,`Element`,`DataTime`),
KEY `DATATIME` (`DataTime`),
KEY `ELEID` (`ELEID`),
KEY `ELE_TIME` (`ELEID`,`DataTime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Rows are inserted/updated to the table every minutes, hence the row size of each [DataTime] (i.e. every minute) is regular, around 3K rows.
I have a "select" query from this table, after the above "inserted/updated".
The query selects one specified elements within most recent 25 hours (around 30K rows). This query usually processes within 3 sec.
SELECT
Field_A, Element, DataTime, `Value`
FROM
Table_A
WHERE
Element="XX"
AND DataTime between [time] and [time].
The original housekeeping would be remove any row after 3 days, every 5 minutes.
For better housekeeping, I try to partition the table base on [DataTime], every 6 hours. (00,06,12,18 local time)
PARTITION BY RANGE (TO_DAYS(DataTime)*100+hour(DataTime))
(PARTITION p2014103112 VALUES LESS THAN (73590212) ENGINE = InnoDB,
...
PARTITION p2014110506 VALUES LESS THAN (73590706) ENGINE = InnoDB,
PARTITION pFuture VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
My housekeeping script will drop the expired partition then create a new one
ALTER TABLE TABLE_A REORGANIZE PARTITION pFuture INTO (
PARTITION [new_partition_name] VALUES LESS THAN ([bound_value]),
PARTITION pFuture VALUES LESS THAN MAXVALUE
)
The new process seems running smoothly.
However, the SELECT query would slow down suddenly (> 100 sec).
The query is still slow even all process stopped. It won't be fixed until "analyzing partitions" (reads and stores the key distributions for partitions).
It usually happens ones a day.
It does not happen to a non-partitioned table.
Therefore, we think it is caused by corrupted indexing in a partitioned MySQL (huge) table.
Does anyone have any idea on how to solve it?
Many Thanks!!
If you PARTITION BY RANGE (TO_DAYS(DataTime)*100+hour(DataTime)), when you filter datetime with between [from] and [to] operation, mysql will scan all partitions unless [from] equals [to].
So it's reasonable that your query slow down suddenly.
My suggestion is partition using TO_DAYS(DataTime) without hour, if you query recent 25 hours data, it will scan up to 2 partitions only.
I'm not good at MySql, and I couldn't explain it, wish other smart guys can explain it further. But you could using EXPLAIN PARTITIONS to prove it. And here is the Sql Fiddle Demo.