partitioning mysql table with 3b records per year - mysql

What is good approach to handle 3b rec table where concurrent read/write is very frequent within few days?
Linux server, running MySQL v8.0.15.
I have this table that will log device data history. The table need to retain its data for one year, possibly two years. The growth rate is very high: 8,175,000 rec/day (1mo=245m rec, 1y=2.98b rec). In the case of device number growing, the table is expected to be able to handle it.
The table read is frequent within last few days, more than a week then this frequency drop significantly.
There are multi concurrent connection to read and write on this table, and the target to r/w is quite close to each other, therefore deadlock / table lock happens but has been taken care of (retry, small transaction size).
I am using daily partitioning now, since reading is hardly spanning >1 partition. However there will be too many partition to retain 1 year data. Create or drop partition is on schedule with cron.
CREATE TABLE `table1` (
`group_id` tinyint(4) NOT NULL,
`DeviceId` varchar(10) COLLATE utf8mb4_unicode_ci NOT NULL,
`DataTime` datetime NOT NULL,
`first_log` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`first_res` tinyint(1) NOT NULL DEFAULT '0',
`last_log` datetime DEFAULT NULL,
`last_res` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`group_id`,`DeviceId`,`DataTime`),
KEY `group_id` (`group_id`,`DataTime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
/*!50100 PARTITION BY RANGE (to_days(`DataTime`))
(
PARTITION p_20191124 VALUES LESS THAN (737753) ENGINE = InnoDB,
PARTITION p_20191125 VALUES LESS THAN (737754) ENGINE = InnoDB,
PARTITION p_20191126 VALUES LESS THAN (737755) ENGINE = InnoDB,
PARTITION p_20191127 VALUES LESS THAN (737756) ENGINE = InnoDB,
PARTITION p_future VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
Insert are performed in size ~1500/batch:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
VALUES(%s, %s, FROM_UNIXTIME(%s), %s)
ON DUPLICATE KEY UPDATE last_log=NOW(), last_res=values(first_result);
Select are mostly to get count by DataTime or DeviceId, targeting specific partition.
SELECT DataTime, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DataTime HAVING ct<50;
SELECT DeviceId, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DeviceId HAVING ct<50;
So the question:
Accord to RickJames blog, it is not a good idea to have >50 partitions in a table, but if partition is put monthly, there are 245m rec in one partition. What is the best partition range in use here? Does RJ's blog still taken place with current mysql version?
Is it a good idea to leave the table not partitioned? (the index is running well atm)
note: I have read this stack question, having multiple table is a pain, therefore if it is not necessary i wish not to break the table. Also, sharding is currently not possible.

First of all, INSERTing 100 records/second is a potential bottleneck. I hope you are using SSDs. Let me see SHOW CREATE TABLE. Explain how the data is arriving (in bulk, one at a time, from multiple sources, etc) because we need to discuss batching the input rows, even if you have SSDs.
Retention for 1 or 2 years? Yes, PARTITIONing will help, but only with the deleting via DROP PARTITION. Use monthly partitions and use PARTITION BY RANGE(TO_DAYS(DataTime)). (See my blog which you have already found.)
What is the average length of DeviceID? Normally I would not even mention normalizing a VARCHAR(10), but with billions of rows, it is probably worth it.
The PRIMARY KEY you have implies that a device will not provide two values in less than one second?
What do "first" and "last" mean in the column names?
In older versions of MySQL, the number of partitions had impact on performance, hence the recommendation of 50. 8.0's Data Dictionary may have a favorable impact on that, but I have not experimented yet to see if the 50 should be raised.
The size of a partition has very little impact on anything.
In order to judge the indexes, let's see the queries.
Sharding is not possible? Do too many queries need to fetch multiple devices at the same time?
Do you have Summary tables? That is a major way for Data Warehousing to avoid performance problems. (See my blogs on that.) And, if you do some sort of "staging" of the input, the summary tables can be augmented before touching the Fact table. At that point, the Fact table is only an archive; no regular SELECTs need to touch it? (Again, let's see the main queries.)
One table per day (or whatever unit) is a big no-no.
Ingestion via IODKU
For the batch insert via IODKU, consider this:
collect the 1500 rows in a temp table, preferably with a single, 1500-row, INSERT.
massage that data if needed
do one IODKU..SELECT:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
ON DUPLICATE KEY UPDATE
last_log=NOW(), last_res=values(first_result)
SELECT group_id, DeviceId, DataTime, first_result
FROM tmp_table;
If necessary, the SELECT can do some de-dupping, etc.
This approach is likely to be significantly faster than 1500 separate IODKUs.
DeviceID
If the DeviceID is alway 10 characters and limited to English letters and digits, then make it
CHAR(10) CHARACTER SET ascii
Then pick between COLLATION ascii_general_ci and COLLATION ascii_bin, depending on whether you allow case folding or not.

Just for your reference:
I have a large table right now over 30B rows, grows 11M rows daily.
The table is innodb table and is not partitioned.
Data over 7 years is archived to file and purged from the table.
So if your performance is acceptable, partition is not necessary.
From management perspective, it is easier to manage the table with partitions, you might partition the data by week. It will 52 - 104 partitions if you keep last or 2 years data online

Related

Better Way of Storing Old Data for Faster Access

The application we are developing is writing around 4-5 millions rows of data every day. And, we need to save these data for the past 90 days.
The table user_data has the following structure (simplified):
id INT PRIMARY AUTOINCREMENT
dt TIMESTAMP CURRENT_TIMESTAMP
user_id varchar(20)
data varchar(20)
About the application:
Data that is older than 7 days old will not be written / updated.
Data is mostly accessed based on user_id (i.e. all queries will have WHERE user_id = XXX)
There are around 13000 users at the moment.
User can still access older data. But, in accessing the older data, we can restrict that he/she can only get the whole day data only and not a time range. (e.g. If a user attempts to get the data for 2016-10-01, he/she will get the data for the whole day and will not be able to get the data for 2016-10-01 13:00 - 2016-10-01 14:00).
At the moment, we are using MySQL InnoDB to store the latest data (i.e. 7 days and newer) and it is working fine and fits in the innodb_buffer_pool.
As for the older data, we created smaller tables in the form of user_data_YYYYMMDD. After a while, we figured that these tables cannot fit into the innodb_buffer_pool and it started to slow down.
We think that separating / sharding based on dates, sharding based on user_ids would be better (i.e. using smaller data sets based on user and dates such as user_data_[YYYYMMDD]_[USER_ID]). This will keep the table in much smaller numbers (only around 10K rows at most).
After researching around, we have found that there are a few options out there:
Using mysql tables to store per user per date (i.e. user_data_[YYYYMMDD]_[USER_ID]).
Using mongodb collection for each user_data_[YYYYMMDD]_[USER_ID]
Write the old data (json encoded) into [USER_ID]/[YYYYMMDD].txt
The biggest con I see in this is that we will have huge number of tables/collections/files when we do this (i.e. 13000 x 90 = 1.170.000). I wonder if we are approaching this the right way in terms of future scalability. Or, if there are other standardized solutions for this.
Scaling a database is an unique problem to the application. Most of the times someone else's approach cannot be used as almost all applications writes its data in its own way. So you have to figure out how you are going to manage your data.
Having said that, if your data continue to grow, best solution is the shadring where you can distribute the data across different servers. As long as bound to a single server like creating different tables you are getting hit by resource limits like memory, storage and processing power. Those cannot be increased unlimited manner.
How to distribute the data, that you have to figure out based on your business use cases. As you mentioned, if you are not getting more request on old data, the best way to distribute the data base on date. Like DB for 2016 data, DB for 2015 and so on. Later you may purge or shutdown the servers which you have more old data.
This is a big table, but not unmanageable.
If user_id + dt is UNIQUE, make it the PRIMARY KEY, and get rid if id, thereby saving space. (More in a minute...)
Normalize user_id to a SMALLINT UNSIGNED (2 bytes) or, to be safer MEDIUMINT UNSIGNED (3 bytes). This will save a significant amount of space.
Saving space is important for speed (I/O) for big tables.
PARTITION BY RANGE(TO_DAYS(dt))
with 92 partitions -- the 90 you need, plus 1 waiting to be DROPped and one being filled. See details here .
ENGINE=InnoDB
to get the PRIMARY KEY clustered.
PRIMARY KEY(user_id, dt)
If this is "unique", then it allows efficient access for any time range for a single user. Note: you can remove the "just a day" restriction. However, you must formulate the query without hiding dt in a function. I recommend:
WHERE user_id = ?
AND dt >= ?
AND dt < ? + INTERVAL 1 DAY
Furthermore,
PRIMARY KEY(user_id, dt, id),
INDEX(id)
Would also be efficient even if (user_id, dt) is not unique. The addition of id to the PK is to make it unique; the addition of INDEX(id) is to keep AUTO_INCREMENT happy. (No, UNIQUE(id) is not required.)
INT --> BIGINT UNSIGNED ??
INT (which is SIGNED) will top out at about 2 billion. That will happen in a very few years. Is that OK? If not, you may need BIGINT (8 bytes vs 4).
This partitioning design does not care about your 7-day rule. You may choose to keep the rule and enforce it in your app.
BY HASH
will not work as well.
SUBPARTITION
is generally useless.
Are there other queries? If so they must be taken into consideration at the same time.
Sharding by user_id would be useful if the traffic were too much for a single server. MySQL, itself, does not (yet) have a sharding solution.
Try TokuDB engine at https://www.percona.com/software/mysql-database/percona-tokudb
Archive data are great for TokuDB. You will need about six times less disk space to store AND memory to PROCESS your dataset compared to InnoDB or about 2-3 times less than archived myisam.
1 million+ tables sounds like a bad idea. Having sharding via dynamic table naming by the app code at runtime has also not been a favorable pattern for me. My first go-to for this type of problem would be partitioning. You probably don't want 400M+ rows in a single unpartitioned table. In MySQL 5.7 you can even subpartition (but that gets more complex). I would first range partition on your date field, with one partition per day. Index on the user_id. If you are on 5.7 and want to dabble with subpartitioning, I would suggest range partition by date, then hash subpartition by user_id. As a starting point, try 16 to 32 hash buckets. Still index the user_id field.
EDIT: Here's something to play with:
CREATE TABLE user_data (
id INT AUTO_INCREMENT
, dt TIMESTAMP DEFAULT CURRENT_TIMESTAMP
, user_id VARCHAR(20)
, data varchar(20)
, PRIMARY KEY (id, user_id, dt)
, KEY (user_id, dt)
) PARTITION BY RANGE (UNIX_TIMESTAMP(dt))
SUBPARTITION BY KEY (user_id)
SUBPARTITIONS 16 (
PARTITION p1 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-25')),
PARTITION p2 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-26')),
PARTITION p3 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-27')),
PARTITION p4 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-28')),
PARTITION pMax VALUES LESS THAN MAXVALUE
);
-- View the metadata if you're interested
SELECT * FROM information_schema.partitions WHERE table_name='user_data';

MySQL table setup for stock information

I am collecting about 3 - 6 millions lines of stock data per day and storing it in a MySQL database.
All of the data is coming from Interactive Brokers every piece of information comes with these five fields: Symbol, Date, Time, Value and Type (type being information on what type of data I am receiving such as price, volume etc)
Here is my create table statement. idticks is just my unique key but I almost never am able to use it in queries.
CREATE TABLE `ticks` (
`idticks` int(11) NOT NULL AUTO_INCREMENT,
`symbol` varchar(30) NOT NULL,
`date` int(11) NOT NULL,
`time` int(11) NOT NULL,
`value` double NOT NULL,
`type` double NOT NULL,
KEY `idticks` (`idticks`),
KEY `symbol` (`symbol`),
KEY `date` (`date`),
KEY `idx_ticks_symbol_date` (`symbol`,`date`),
KEY `idx_ticks_type` (`type`),
KEY `idx_ticks_date_type` (`date`,`type`),
KEY `idx_ticks_date_symbol_type` (`date`,`symbol`,`type`),
KEY `idx_ticks_symbol_date_time_type` (`symbol`,`date`,`time`,`type`)
) ENGINE=InnoDB AUTO_INCREMENT=13533258 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (`date`)
PARTITIONS 1 */;
As you can see, I have no idea what I am doing because I just keep on creating indexes to make my queries go faster.
Right now the data is being stored on a rather slow computer for testing purposes so I understand that my queries are not nearly as fast as they could be (I have a 6 core, 64gig of ram, SSD machine arriving tomorrow which should help significantly)
That being said, I am running queries like this one
select time, value from ticks where symbol = "AAPL" AND date = 20150522 and type = 8 order by time asc
The query above, if I do not limit it, returns 12928 records for one of my test days and takes 10.2 seconds if I do it from cleared cache.
I am doing lots of graphing and eventually would like to be able to just query the data as I need to it graph. Right now I haven't noticed a lot of difference in speed between getting part of a days worth of data vs just getting the entire day's. It would be cool to have those queries respond fast enough that there is barely any delay when I moving to the next day/screen whatever.
Another query I am using for usability of a program I am writing to interact with the data include
String query = "select distinct `date` from ticks where symbol = '" + symbol + "' order by `date` desc";
But most of my need is the ability to pull a certain type of data from a certain day for a certain symbol like my first query.
I've googled all over the place and I think I understand that creating tons of indexes makes the database bigger and slows down the input speed (I get about 300 pieces of information per second on a busy day). Should I just index each column individually?
I am willing to throw more harddrives at things if it means responsive interface.
Basically, my questions relate to the creation/altering of my table. Based on the above query, can you think of anything I could do to make that faster? Or an indexing system that would help me out? Is InnoDB even the right engine? I tried googling this vs MyISam and after a couple of hours of this, I still wasn't sure.
Thanks :)
Combine date and time into a DATETIME field
Assuming Price and Volume always come in together, put them together (2 columns) and get rid if type.
Get rid of the AUTO_INCREMENT; change to PRIMARY KEY(symbol, datetime)
Get rid of any indexes that are the left part of some other index.
Once you are using DATETIME, use date ranges to find everything in a single date (if you need such). Do not use DATE(datetime) = '...', performance will be terrible.
Symbol can probably be ascii, not utf8.
Use InnoDB, the clustering of the Primary Key can be beneficial.
Do you expect to collect (and use) more data than will fit in innodb_buffer_pool_size? If so, we need to discuss your SELECTs and look into PARTITIONing.
Make those changes, then come back for more advice/abuse.
You're creating a historical database, so MyISAM would work as well as InnoDB. InnoDB is a transactional relational database, and is better suited for relational databases with multiple tables that must remain synchronized.
Your Stock table looks like this.
Stock
-----
Stock ID (idticks)
Symbol
Date
Time
Value
Type
It would be better if you combine the date and time into a time stamp column, and unpack the types like this.
Stock
-----
Stock ID
Symbol
Time Stamp
Volume
Open
Close
Bid
Ask
...
This makes it easier for the database to return rows for a query on a particular type, like the close value.
As far as indexes, you can create as many indexes as you want. You're adding (inserting) information, so the increased time to add information is offset by the decreased time to query the information.
I'd have a primary index on Stock ID, and a unique index on Symbol and Time Stamp descending. You could also have indexes on the values you query most often, like Close.

MySQL table partition strange behavior (slow query suddenly)

(MySQL version: 5.6.15)
I have a huge table (Table_A) with 10M rows, in entity-attribute-value model.
It has a compound unique key [Field_A + Element + DataTime].
CREATE TABLE TABLE_A
(
`Field_A` varchar(5) NOT NULL,
`Element` varchar(5) NOT NULL,
`DataTime` datetime NOT NULL,
`Value` decimal(10,2) DEFAULT NULL,
UNIQUE KEY `A_ELE_TIME` (`Field_A`,`Element`,`DataTime`),
KEY `DATATIME` (`DataTime`),
KEY `ELEID` (`ELEID`),
KEY `ELE_TIME` (`ELEID`,`DataTime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Rows are inserted/updated to the table every minutes, hence the row size of each [DataTime] (i.e. every minute) is regular, around 3K rows.
I have a "select" query from this table, after the above "inserted/updated".
The query selects one specified elements within most recent 25 hours (around 30K rows). This query usually processes within 3 sec.
SELECT
Field_A, Element, DataTime, `Value`
FROM
Table_A
WHERE
Element="XX"
AND DataTime between [time] and [time].
The original housekeeping would be remove any row after 3 days, every 5 minutes.
For better housekeeping, I try to partition the table base on [DataTime], every 6 hours. (00,06,12,18 local time)
PARTITION BY RANGE (TO_DAYS(DataTime)*100+hour(DataTime))
(PARTITION p2014103112 VALUES LESS THAN (73590212) ENGINE = InnoDB,
...
PARTITION p2014110506 VALUES LESS THAN (73590706) ENGINE = InnoDB,
PARTITION pFuture VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
My housekeeping script will drop the expired partition then create a new one
ALTER TABLE TABLE_A REORGANIZE PARTITION pFuture INTO (
PARTITION [new_partition_name] VALUES LESS THAN ([bound_value]),
PARTITION pFuture VALUES LESS THAN MAXVALUE
)
The new process seems running smoothly.
However, the SELECT query would slow down suddenly (> 100 sec).
The query is still slow even all process stopped. It won't be fixed until "analyzing partitions" (reads and stores the key distributions for partitions).
It usually happens ones a day.
It does not happen to a non-partitioned table.
Therefore, we think it is caused by corrupted indexing in a partitioned MySQL (huge) table.
Does anyone have any idea on how to solve it?
Many Thanks!!
If you PARTITION BY RANGE (TO_DAYS(DataTime)*100+hour(DataTime)), when you filter datetime with between [from] and [to] operation, mysql will scan all partitions unless [from] equals [to].
So it's reasonable that your query slow down suddenly.
My suggestion is partition using TO_DAYS(DataTime) without hour, if you query recent 25 hours data, it will scan up to 2 partitions only.
I'm not good at MySql, and I couldn't explain it, wish other smart guys can explain it further. But you could using EXPLAIN PARTITIONS to prove it. And here is the Sql Fiddle Demo.

Using MySQL Partitioning to speed up concurrent deletes and selects?

I have a MySQL Innodb table that contains about 8.5 million rows. The table structure basically looks like this:
CREATE TABLE `mydatatable` (
`ext_data_id` int(10) unsigned NOT NULL,
`datetime_utc` date NOT NULL DEFAULT '0000-00-00',
`type` varchar(8) NOT NULL DEFAULT '',
`value` decimal(6,2) DEFAULT NULL,
PRIMARY KEY (`ext_data_id`,`datetime_utc`,`type`),
KEY `datetime_utc` (`datetime_utc`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Every night, I delete the expired values from this table with the following query:
delete from mydatatable where datetime_utc < '2013-09-23'
This query does not seem to use indizes, and it takes quite some time to run. However I also get concurrent updates and selects on the same table. These get locked then, causing my website to be unresponsive at that time.
I am looking for various ways to speed up this setup. I cam across MySQL partitioning and I am wondering if this would be a good fit. I am always adding and selecting the newer data to this table and deleting old ones. I could create partitions based on something like MOD(DAYOFYEAR(datetime),4). Now when I delete, I will always delete the values from another partition than the one I am reading or writing from.
Will I experience locking with this setup? Will partitioning improve the query speed and availability in my case? Or should I look for another solution, and if so, which one?
Since MySQL 5.5 you could use function COLUMNS, that simplifies your partitioning for non-integer columns (such as datetime_utc).
As for the performance:
Dropping a partition is a constant time operation for LIST and RANGE partitioning. The speed is equivalent of a TRUNCATE TABLE or rm file, so practically independent of the size of the partition.
Doing SELECT on a partitioned table benefits from partition pruning, so that you read only from the partition that match you search criteria. That can also speed up range scans.
Tip:
Do not forget to add an "default" partition, such as
PARTITION the_last_one VALUES LESS THAN(MAXVALUE)
in order to avoid INSERT/UPDATE statements fail since no partition found to insert into.
Absolutely you are on right track. You should create daily partitions here and store data in them, you queries will get revolutionised and will run like ferrari. Also take a look at local indexes.
Also with partitions, if your previous data will not interfere, so you can keep or delete it wont make much difference. In fact instead of deleteing you can simply drop partitions. This is also very fast.

Partitioning a MySQL table based on a column value.

I want to partition a table in MySQL while preserving the table's structure.
I have a column, 'Year', based on which I want to split up the table into different tables for each year respectively. The new tables will have names like 'table_2012', 'table_2013' and so on. The resultant tables need to have all the fields exactly as in the source table.
I have tried the following two pieces of SQL script with no success:
1.
CREATE TABLE all_data_table
( column1 int default NULL,
column2 varchar(30) default NULL,
column3 date default NULL
) ENGINE=InnoDB
PARTITION BY RANGE ((year))
(
PARTITION p0 VALUES LESS THAN (2010),
PARTITION p1 VALUES LESS THAN (2011) , PARTITION p2 VALUES LESS THAN (2012) ,
PARTITION p3 VALUES LESS THAN (2013), PARTITION p4 VALUES LESS THAN MAXVALUE
);
2.
ALTER TABLE all_data_table PARTITION BY RANGE COLUMNS (`year`) (
PARTITION p0 VALUES LESS THAN (2011),
PARTITION p1 VALUES LESS THAN (2012),
PARTITION p2 VALUES LESS THAN (2013),
PARTITION p3 VALUES LESS THAN (MAXVALUE)
);
Any assistance would be appreciated!
This is old, but seeing as it comes up highly ranked in partitioning searches, I figured I'd give some additional details for people who might hit this page. What you are talking about in having a table_2012 and table_2013 is not "MySQL Partitioning" but "Manual Partitioning".
Partitioning means that you have one "logical table" with a single table name, which--behind the scenes--is divided among multiple files. When you have millions to billions of rows, over years, but typically you are only searching a single month, partitioning by Year/Month can have a great performance benefit because MySQL only has to search against the file that contains the Year/Month that you are searching for...so long as you include the partition key in your WHERE.
When you create multiple tables like table_2012 and table_2013, you are MANUALLY partitioning the tables, which you don't do with the MySQL PARTITION configuration. To manually partition the tables, during 2012, you put all data into the 2012 table. When you hit 2013, you start putting all the data into the 2013 table. You have to make sure to create the table before you hit 2013 or it won't have any place to go. Then, when you query across the years (e.g. from Nov 2012 - Jan 2013), you have to do a UNION between table_2012 and table_2013.
SELECT * FROM table_2012 WHERE #...
UNION
SELECT * FROM table_2013 WHERE #...
With partitioning, this manual work is not necessary. You do the initial setup of the partitions, then you treat is as a single table. No unions required, no checking the date before you insert, etc. This makes life much easier. MySQL handles figuring out what tables it needs to query. However, you MUST make sure to query against the Year column or it will have to scan ALL files. E.g. SELECT * FROM all_data_table WHERE Month=12 will scan all partitions for Month=12. To ensure you are only scanning the partition files that you need to scan, you want to make sure to include the partition column in every query that you can.
Possible negatives to partitioning...if you have billions of rows and you do an ALTER TABLE on the table to--say--add a column...it's going to have to update every row taking a VERY long time. At the company I currently work for, the boss doesn't think it's worth the time it takes to update the billion rows historically when we are adding a new column for going forward...so this is one of the reasons we do manual partitioning instead of letting MySQL do it.
DISCLAIMER: I am not an expert at partitioning...so if I'm wrong in any of this, please let me know and I'll fix the incorrect parts.
From what I see you want to create many tables from one big table.
I think you should try to create views instead.
Since from what I look around about partitioning, it actually partitions the physical storage of that table and then store them separately. But if you see from the top perspective you will see them as a single table.