MySQL - Move data between partitions aka re-partition - mysql

I have a mysql table whose partitions look as below
p2015h1 - Contains data where date < 2015-07-01 (Has data from 2016-06-01. Hence only month worth of data)
p2015h2 - Contains data where date < 2016-01-01
p2016h1 - Contains data where date < 2016-07-01
p2016h2 - Contains data where date < 2017-01-01
I'd like the new partitions to be quarterly based as below -
p0 - Contains data where date < 2015-10-01
p1 - Contains data where date < 2016-01-01
p2 - Contains data where date < 2016-04-01
p3 - Contains data where date < 2016-07-01
I started by reorganizing the first partition & executed the below command. All went well.
alter table `table1` reorganize partition `p2015half1` into (partition `p0` values less than ('2015-10-01'));
Now as the existing partition p2015h2 has data that includes data upto 2015-10-01, how could I move this part into the partition p0 ? I would need to do the same thing with the other partitions too as I continue building the new ones.
I did try to remove partitioning on the table fully, but, the table is billions of rows in size & hence the operation will take days. Post this I will have to rebuild the partitions which will take days again. Hence, I decided to take the approach of splitting partitions.
I'm stuck at this point in time. I'd fully appreciate any guidance here please.

mysql> alter table `table1` reorganize partition p0,p2015half2 into (partition p00 values less than ('2015-07-01'), partition p1 values less than ('2016-01-01'));
mysql> alter table `table1` reorganize partition p00 into (partition p0 values less than ('2015-07-01'));
mysql> alter table `table1` reorganize partition p2016half1,p2016half2 into (partition p2 values less than ('2016-04-01'), partition p3 values less than ('2016-07-01'),partition p4 values less than maxvalue);

Related

What data type should I use in MySQL to store daily items and plot 1-year charts?

I am developing a web application and I want to plot a 1-year chart with daily data points.
The x-axis is time (date) and the y-axis is of numeric type.
MySQL version: 8.0 (or higher)
The DDBB must store data points for multiple customers.
For each customer I want to show the last 365 data points (1-year data).
Each data point is a tuple: (date, int). For example: (2022/11/10, 35)
The chart displays data for one single customer at a time.
Every day a new data point is calculated and added to the customer dataset.
Every customer must contain up to 5 years of data points
The number of customers is 1000.
Assuming customer is a foreign key (FK) to the Customers table, I have considered two options for the dataset.
Option A
Primary Key
Customer(FK)
Date
Value
1
Customer 1
Date 1
Val1
2
Customer 1
Date 2
Val2
...
...
...
...
N
Customer 1
Date N
ValN
N+1
Customer 2
Date 1
ValN+1
...
...
...
...
2N
Customer 2
Date N
Val2N
Option B
Use a JSON type for the dataset
Primary Key
Customer(FK)
Dataset
1
Customer 1
Dataset 1
2
Customer 2
Dataset 2
Where each dataset looks like:
((2022/01/01, 35), (2022/01/02, 17), ...., (2022/12/31, 42))
Comments:
My interest is to plot the chart as fast as possible and since data insert/update operations only happen once a day (for every customer), my question is:
Which option is better for data retrieval?
Right now I have around 50 customers and 2-year data history, but I don't know how the DDBB will perform when I increase both, the number of customers and years.
Additionally, I am using a JavaScript plotting library in the frontend so I was wondering whether the JSON data type approach could fit better for this purpose.
CREATE TABLE datapoints (
c_id SMALLINT UNSIGNED NOT NULL,
date DATE NOT NULL,
datapoint SMALLINT/MEDIUMINT/INT [UNSIGNED] /FLOAT NOT NULL,
PRIMARY KEY(c_id, date),
) ENGINE=InnoDB;
Pick the smallest datatype that is appropriate for your values. For example, SMALLINT UNSIGNED takes only 2 bytes and allows non-negative values up to 64K. FLOAT is 4 bytes and has a big range and far more significant digits (about 7) than you can reasonably graph.
The main queries. First various ways to do the daily INSERT:
INSERT INTO datapoints (c_id, date, datapoint)
VALUES(?,?,?);
or
INSERT INTO datapoints (c_id, date, datapoint)
VALUES
(?,?,?),
(?,?,?), ...
(?,?,?); -- 1000 rows batched
or
LOAD DATA ...
Fetching for the graph:
SELECT date, datapoint
FROM datapoints
WHERE c_id = ...
AND date >= CURDATE() - INTERVAL 1 YEAR -- or whatever
ORDER BY date;
1.8M rows (probably under 1GB) is not very big. Still, I recommend the PRIMARY KEY be in that order and not involve an AUTO_INCREMENT. The INSERT(s) will poke into the table at 1000 places once a day. The SELECT (for graphing) will find all the data clustered together -- very fast.
If you will be keeping the data past the year, we can discuss things further. Meanwhile, to purge after 5 years, this will be slow, but it is only once a day:
DELETE FROM datapoints
WHERE date < CURDATE() - INTERVAL 5 YEAR;

MySQL Partition By DATEDIFF

I have a table with the ReferenceDate field. I intend to partition it using this field as follows:
partition_0: Values more than 1 year old;
partition_1: Values older than 6 months;
partition_2: Values older than 3 months;
partition_3: Values for the last 3 months;
For this I tried the following script to change the table:
ALTER TABLE `MyTable`
PARTITION BY RANGE (DATEDIFF(NOW(), `ReferenceDate`))
(
PARTITION p0_historic_data VALUES LESS THAN (90),
PARTITION p1_intermediary_data VALUES LESS THAN (180),
PARTITION p2_intermediary_data VALUES LESS THAN (365),
PARTITION p3_current_data VALUES LESS THAN MAXVALUE
);
However, I believe that I cannot use the Now () function, in the partitioning clause, something I was able to do was use TO_DATE, but it doesn't give me the return I need, with DIFF I have the value of the difference of the current date and ReferenceDate , TO_DATE returns the value in days from year 0 to the current date.
I would like to know if there is really no way to use DIFF, or if there is any alternative in that sense.
A PARTITIONed table is one where some of the rows are permanently put in one 'sub-table' or another, based on the instructions in PARTITION BY ....
So, it is flatly not possible. To implement such, MySQL would have to move rows from one partition to another, even when you are not touching the table.
Even if it were possible, it might not provide any performance improvement. After all, you can have something like this:
WHERE ReferenceDate >= NOW() - INTERVAL 180 DAY
AND ReferenceDate < NOW() - INTERVAL 90 DAY
Then, if you also have
AND CustomerId = 123
then this index would be excellent for finding the desired rows:
INDEX(CustomerId, ReferenceDate)
That does not need PARTITIONing.

How to generate faster mysql query with 1.6M rows

I have a table that has 1.6M rows. Whenever I use the query below, I get an average of 7.5 seconds.
select * from table
where pid = 170
and cdate between '2017-01-01 0:00:00' and '2017-12-31 23:59:59';
I tried adding a LIMIT 1000 or 10000 or change the date to filter for 1 month, it still processes it to an average of 7.5s. I tried adding a composite index for pid and cdate but it resulted to 1 second slower.
Here is the INDEX list
https://gist.github.com/primerg/3e2470fcd9b21a748af84746554309bc
Can I still make it faster? Is this an acceptable performance considering the amount of data?
Looks like the index is missing. Create this index and see if its helping you.
CREATE INDEX cid_date_index ON table_name (pid, cdate);
And also modify your query to below.
select * from table
where pid = 170
and cdate between CAST('2017-01-01 0:00:00' AS DATETIME) and CAST('2017-12-31 23:59:59' AS DATETIME);
Please provide SHOW CREATE TABLE clicks.
How many rows are returned? If it is 100K rows, the effort to shovel that many rows is significant. And what will you do with that many rows? If you then summarize them, consider summarizing in SQL!
Do have cdate as DATETIME.
Do you use id for anything? Perhaps this would be better:
PRIMARY KEY (pid, cdate, id) -- to get benefit from clustering
INDEX(id) -- if still needed (and to keep AUTO_INCREMENT happy)
This smells like Data Warehousing. DW benefits significantly from building and maintaining Summary table(s), such as one that has the daily click count (etc), from which you could very rapidly sum up 365 counts to get the answer.
CAST is unnecessary. Furthermore 0:00:00 is optional -- it can be included or excluded for either DATE or DATETIME. I prefer
cdate >= '2017-01-01'
AND cdate < '2017-01-01' + INTERVAL 1 YEAR
to avoid leap year, midnight, date arithmetic, etc.

Dropping Partitions in Vertica

I have a table like below in Vertica,
Seq_No CO_NO DATE
1 PQ01 01-Sep-15
2 XY01 01-Oct-15
3 AB01 01-Nov-15
4 PQ02 01-Dec-15
. . .
. . .
. . .
14 XYZ9 01-Oct-16
And table has Partition by Month and Year based on the DATE column.
At any point of time there has to be only 13 partitions ie 13 months of data.
If the current months data comes in(Oct-16) then we need to drop last years SEP month partition(SEP-15) by keeping only 13 partitions on the table(ie Latest 13 months of data).
How can we achieve this in Vertica?
To do this use the drop partition procedure
SELECT DROP_PARTITION('schema.table',CAST(TO_CHAR(ADD_MONTHS(SYSDATE,-13),'YYYYMM') AS INTEGER));
What you need is cron job that will run every beginning of the month.
Before drop all partitions prior to 13 manually and them let the job do it`s work.
Note: your table must be partitioned like :
PARTITION BY (((date_part('year', Datecol) * 100) + date_part('month', Datecol)))
test the drop partition before using it, create a dummy table and run it.
I'm assuming your focus is on the "At any point of time" part of your question. One of two solutions, I guess.
Add a script to your loading job that finds any partitions older than your threshold and drops them (look at the partitions system view, if you are trying to come up with a more generic approach you can extract the partition expression from the tables system view).
Instead of having to be on top of the partition drops, you could just create a view around your table and use that instead to only show the past 1 year of data. Example:
create view myview
as
select * from mytable
where mydate >= current_timestamp - interval '1 year'
Or something similar, like trunc(current_timestamp - interval '1 year','MM'), etc. Then you can drop partitions at your leisure.

doesn't partition pruning work if I have range size larger than number of partitions?

I've 15 million of rows in my table and data comes on every 4 second basis. So, I have decided to make partitions on each day as follows
ALTER TABLE vehicle_gps
PARTITION BY RANGE(UNIX_TIMESTAMP(gps_time)) (
PARTITION p01 VALUES LESS THAN (UNIX_TIMESTAMP('2014-01-01 00:00:00')),
.
.
.
PARTITION p365 VALUES LESS THAN (UNIX_TIMESTAMP('2015-01-01 00:00:00')));
I had to make 365 partitions as shown. Each partitioned day contains data around 100 thousand rows.
And if I want to fetch the data by giving a query
SELECT gps_time FROM vehicle_gps
WHERE gps_time BETWEEN '2014-05-01 00:00:00' AND '2014-05-06 00:00:00';
I found that Partitioning pruning not happening. MySQL manual says if Values in between range are larger than number of partitions, Pruning won't happen. If so then what is the need of creating partitions with tables which contain huge data as mine. Since I'm new to partitioning I'm confused, please guide me if I'm wrong, help me in learning.
Thank You :)
It just doesn't work with dates, small extract from the MySQL Documentation
Pruning can be used only on integer columns of tables partitioned by HASH or KEY. For example, this query cannot use pruning because dob is a DATE column:
SELECT * FROM t4 WHERE dob >= '2001-04-14' AND dob <= '2005-10-15';
However, if the table stores year values in an INT column, then a query having WHERE year_col >= 2001 AND year_col <= 2005 can be pruned.
Hope it helps!