Maintaining large quantities of historical data efficiently

Maintaining large quantities of historical data efficiently - mysql

I've been thinking about keeping a history in the following table structure:
`id` bigint unsigned not null auto_increment,
`userid` bigint unsigned not null,
`date` date not null,
`points_earned` int unsigned not null,
primary key (`id`),
key `userid` (`userid`),
key `date` (`date`)
This will allow me to do something like SO does with its Reputation Graph (where I can see my rep gain since I joined the site).
Here's the problem, though: I just ran a simple calculation:
SELECT SUN(DATEDIFF(`lastclick`,`registered`)) FROM `users`
The result was as near as makes no difference 25,000,000 man-days. If I intend to keep one row per user per day, that's a [expletive]ing large table, and I'm expecting further growth. Even if I exclude days where a user doesn't come online, that's still huge.
Can anyone offer any advice on maintaining such a large amount of data? The only queries that will be run on this table are:
SELECT * FROM `history` WHERE `userid`=?
SELECT SUM(`points_earned`) FROM `history` WHERE `userid`=? AND `date`>?
INSERT INTO `history` VALUES (null,?,?,?)
Would the ARCHIVE engine be of any use here, for instance? Or do I just not need to worry because of the indexes?

Assuming its mysql:
for history tables you should consider partitioning you can set the best partition rule for you and looking at what queries you have there are 2 choices :
a. partition by date (1 partition = 1 month for example)
b. partition by user (lets say you have 300 partitions and 1 partition = 100000 users)
this will help you allot if you will use partition pruning (here)
you could use a composite index for user,date (it will be used for the first 2 queries)
avoid INSERT statement, when you have huge data use LOAD DATA (this will not work is the table is partitioned )
And most important ... the best engine for huge volumes of data is MyISAM

Related

MySQL Large Table Sharding to Smaller Table based on Unique ID

We have a large MySQL table (device_data) with the following columns:
ID (int)
dt (timestamp)
serial_number (char(20))
data1 (double)
data2 (double)
... // other columns
The table receives around 10M rows every day.
We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD). However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates.
SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07';
Therefore, we think that creating the sharding based on the serial number will be more effective. Basically, we will have:
device_data_<serial_number>
device_data_0012393746
device_data_7891238456
Hence, when we want to find data for a particular device, we can easily reference as:
SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07';
This approach seems to be effective because:
The application at all time will access the data based on the device first.
We have checked that there is no query that access the data without specifying the device serial number first.
The table for each device will be relatively small (9000 rows per day)
A few challenges that we think we will face is:
We have alot of devices. This means that the table device_data_ will be alot too. I have checked that MySQL does not provide limitation in the number of tables in the database. Will this impact on performance vs keeping them in one table?
How will this impact on later on when we would like to scale MySQL (e.g. using master / slave, etc)?
Are there other alternative / solution in resolving this?
Update. Below is the show create table result from our existing table:
CREATE TABLE `test_udp_new` (
`id` int(20) unsigned NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` varchar(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` decimal(10,5) DEFAULT NULL,
`lng` decimal(10,5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
The most frequent queries being run:
SELECT *
FROM test_udp_new
WHERE device_sn = 'xxx'
AND dt >= 'xxx'
AND dt <= 'xxx'
ORDER BY dt DESC;

The optimal way to handle that query is in a non-partitioned table with
INDEX(serial_number, dt)
Even better is to change the PRIMARY KEY. Assuming you currently have id AUTO_INCREMENT because there is not a unique combination of columns suitable for being a "natural PK",
PRIMARY KEY(serial_number, dt, id), -- to optimize that query
INDEX(id) -- to keep AUTO_INCREMENT happy
If there are other queries that are run often, please provide them; this may hurt them. In large tables, it is a juggling task to find the optimal index(es).
Other Comments:
There are very few use cases for which partitioning actually speed up processing.
Making lots of 'identical' tables is a maintenance nightmare, and, again, not a performance benefit. There are probably a hundred Q&A on stackoverflow shouting not to do such.
By having serial_number first in the PRIMARY KEY, all queries referring to a single serial_number are likely to benefit.
A million serial_numbers? No problem.
One common use case for partitioning involves purging "old" data. This is because big DELETEs are much more costly than DROP PARTITION. That involves PARTITION BY RANGE(TO_DAYS(dt)). If you are interested in that, my PK suggestion still stands. (And the query in question will run about the same speed with or without this partitioning.)
How many months before the table outgrows your disk? (If this will be an issue, let's discuss it.)
Do you need 8-byte DOUBLE? FLOAT has about 7 significant digits of precision and takes only 4 bytes.
You are using InnoDB?
Is serial_number fixed at 20 characters? If not, use VARCHAR. Also, CHARACTER SET ascii may be better than the default of utf8?
Each table (or each partition of a table) involves at least one file that the OS must deal with. When you have "too many", the OS groans, often before MySQL groans. (It is hard to make either "die" of overdose.)

Addressing the query
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
-->
PRIMARY KEY(`device_sn`,`dt`, id),
INDEX(id)
KEY `dt_sn` (`dt`,`device_sn`),
KEY `data` (`data`) USING BTREE,
Notes:
By starting the PK with device_sn, dt, you get the clustering benefits to make the query with WHERE device_sn = .. AND dt BETWEEN ...
INDEX(id) is to keep AUTO_INCREMENT happy.
When you have INDEX(a,b), INDEX(a) is redundant.
The (20) is meaningless; id will max out at about 4 billion.
I tossed the last index because it is probably helped enough by the new PK.
lng decimal(10,5) -- Don't need 5 decimal places to left of point; only need 3 or 2. So: lat decimal(7,5),lng decimal(8,5)`. This will save a total of 3 bytes per row.

MySQL Partitioning a VARCHAR(60)

I have a very large 500 million rows table with the following columns:
id - Bigint - Autoincrementing primary index.
date - Datetime - Approximately 1.5 million rows per date, data older 1 year is deleted.
uid - VARCHAR(60) - A user ID
sessionNumber - INT
start - INT - epoch of start time.
end - INT - epoch of end time.
More columns not relevant for this query.
The combination of uid and sessionNumber forms a uinque index. I also have an index on date.
Due to the sheer size, I'd like to partition the table.
Most of my accesses would be by date, so partitioning by date ranges seems intuitive, but as the date is not part of the unique index, this is not an option.
Option 1: RANGE PARTITION on Date and BEFORE INSERT TRIGGER
I don't really have a regular issue with the uid and sessionNumber uniqueness being violated. The source data is consistent, but sessions that span two days may be inserted on two consecutive days with midnight being the end time of the first and start time of the second.
I'm trying to understand if I could remove the unique key and instead use a trigger that would
Check if there is a session with the same identifiers the previous day and if so,
Updates the end date.
cancels the actual insert.
However, I am not sure if I can 1) trigger an update on the same table. or 2) prevent the actual insert.
Option 2: LINEAR HASH PARTITION on UID
My second option is to use a linear hash partition on the UID. However I cannot see any example that utilizes a VARCHAR and converts it to an INTEGER which is used for the HASH partitioning.
However I cannot finde a permitted way to convert from VARCHAR to INTEGER. For example
ALTER TABLE mytable
PARTITION BY HASH (CAST(md5(uid) AS UNSIGNED integer))
PARTITIONS 20
returns that the partition function is not allowed.

HASH partitioning must work with a 32-bit integer. But you can't convert an MD5 string to an integer simply with CAST().
Instead of MD5, CRC32() can take an arbitrary string and converts to a 32-bit integer. But this is also not a valid function for partitioning.
mysql> alter table v partition by hash(crc32(uid));
ERROR 1564 (HY000): This partition function is not allowed
You could partition by the string using KEY Partitioning instead of HASH partitioning. KEY Partitioning accepts strings. It passes whatever input string through MySQL's built-in PASSWORD() function, which is basically related to SHA1.
However, this leads to another problem with your partitioning strategy:
mysql> alter table v partition by key(uid);
ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function
Your table's primary key id does not include the column uid that you want to partition by. This is a restriction of MySQL's partitioning:
every unique key on the table must use every column in the table's partitioning expression.
Here's the table I'm testing with (it would have been a good idea for you to include this in your question):
CREATE TABLE `v` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`uid` varchar(60) NOT NULL,
`sessionNumber` int(11) NOT NULL,
`start` int(11) NOT NULL,
`end` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uid` (`uid`,`sessionNumber`),
KEY `date` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Before going any further, I have to wonder why you want to use partitioning anyway? "Sheer size" is not a reason to partition a table.
Partitioning, like any optimization, is done for the sake of specific queries you want to optimize for. Any optimization improves one query at the expense of other queries. Optimization has nothing to do with the table. The table is happy to sit there with 5 billion rows, and it doesn't care. Optimization is for the queries.
So you need to know which queries you want to optimize for. Then decide on a strategy. Partitioning might not be the best strategy for the set of queries you need to optimize!

I'll assume your 'uid' is a 128-bit UUID kind of value, which can be stored as a BINARY(16), because that is generally worth the trouble.
Next, stay away from the 'datetime' type, as it is stored like a packed string, and doesn't hold any timezone information. Store date-time-values either as pure numerical values (the number of seconds since the UNIX-epoch), or let MySQL do that for you and use the timestamp(N) type.
Also don't call a column 'date', not just because that is a reserved word, but also because the value contains time details too.
Next, stay away from using anything else than latin1 as the CHARSET of (all) your tables. Only ever do UTF-8-ness at the column level. This to prevent unnecessarily byte-wide columns and indexes creeping in over time. Adopt this habit and you'll happily look back on it after some years, promised.
This makes the table look like:
CREATE TABLE `v` (
`uuid` binary(16) NOT NULL,
`mysql_created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`visitor_uuid` BINARY(16) NOT NULL,
`sessionNumber` int NOT NULL,
`start` int NOT NULL,
`end` int NOT NULL,
PRIMARY KEY (`uuid`),
UNIQUE KEY (`visitor_uuid`,`sessionNumber`),
KEY (`mysql_created_at`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITIONED BY RANGE COLUMNS (`uuid`)
( PARTITION `p_0` VALUES LESS THAN (X'10')
, PARTITION `p_1` VALUES LESS THAN (X'20')
...
, PARTITION `p_9` VALUES LESS THAN (X'A0')
, PARTITION `p_A` VALUES LESS THAN (X'B0')
...
, PARTITION `p_F` VALUES LESS THAN (MAXVALUE)
);
To make the KEY (mysql_created_at) be only on the date-part, needs a calculated column, which can be added in-place, and then an index on it is also light to add, so I'll leave that as homework.

MySQL Database Operations

If I want to read the table id from a mysql database in order to add it to a write operation would I need to perform two queries? Or is there a way to perform just one when using mysql? As a rule of thumb, you rarely need two queries but should never exceed two queries for a single operation correct?

Can I use one query?
When you say "table id" I suppose you mean the id column of a table... No need to use two queries. You can use one query and you can insert multiple records at once if you wish (recommended).
An example: Insert two products from a products list as new entries of an order (with 37 as order_id) into an orders table. Each product_id (2, resp. 3) will be read from the products table based on the specified product_code value (6587, resp. 9678).
INSERT INTO orders (
order_id,
product_id
) VALUES (
37,
(SELECT id FROM products WHERE product_code = 6587)
), (
37,
(SELECT id FROM products WHERE product_code = 9678)
)
Where the tables have the following structures:
CREATE TABLE `products` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`product_code` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8;
CREATE TABLE `orders` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`order_id` int(11) DEFAULT NULL,
`product_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and the table products have the following values:
INSERT INTO `products` (`id`, `product_code`)
VALUES
(1,'1234'),
(2,'6587'),
(3,'9678'),
(4,'5676');
The result in the orders table will look like this:
id order_id product_id
--------------------------
1 37 2
2 37 3
Rules of thumb
I, also, never heard about such a rule regarding the number of queries needed for an operation. Anyway, these are some main rules that I strictly follow:
If you have the chance to achieve specific data access operations using only one query - even if it becomes very complex - then don't hesitate to do it. Put the database engine to work to a maximum whenever you have the chance, even if it seems easier to "chunk" the db operations and benefit from the features of some programming language in order to run them.
Make use of - good designed - indexes. They are very powerful in regard of speed optimization. Use EXPLAIN for checking them.
Design your tables in such a way, that no redundant data is to be found in them. For example, in analogy to my example above, the product_code should be saved only in the products table, even it could make some sense to be saved also in orders table.
"Standardize" your own naming rules across the tables in a/all database(s).
Good luck!

I do not think that there are such rules of thumb.
You should use as many queries as you like to achieve your goal. What you end up with actually will depend on the actions to be done, on the performance needed, on the maintainability needed (think of co-workers which might have to change it) and on other requirements or preferences.
If it really would be possible to do every "single operation" with one query, we probably would not need transactions.
My advice would be to solve your problem with as many queries as you need so that you and your co-workers some years in the future still understand what has been done, and to look into transactions (the MySQL manual and a myriad of tutorials on the net explain them quite well).

Performance of simple SELECT operation on big (2 GB) table

I've really simple query to get MIN and MAX values, it looks like:
SELECT MAX(value_avg)
, MIN(value_avg)
FROM value_data
WHERE value_id = 769
AND time_id BETWEEN 214000 AND 219760;
And here you are the schema of the value_data table:
CREATE TABLE `value_data` (
`value_id` int(11) NOT NULL,
`time_id` bigint(20) NOT NULL,
`value_min` float DEFAULT NULL,
`value_avg` float DEFAULT NULL,
`value_max` float DEFAULT NULL,
KEY `idx_vdata_vid` (`value_id`),
KEY `idx_vdata_tid` (`time_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
As you see, the query and the table are simple and I don't see anything wrong here, but when I execute this query, it takes about ~9 seconds to get data. I also made profile of this query, and 99% of time is "Sending data".
The table is really big and it weighs about 2 GB, but is it a problem? I don't think this table is too big, it must be something else...

MySQL can easily handle a database of that size. However, you should be able to improve the performance of this query and probably the table in general. By changing the time_id column to an UNSIGNED INT NOT NULL, you can significantly decrease the size of the data and indexes on that column. Also, the query you mention could benefit from a composite index on (value_id, time_id). With that index, it would be able to use the index for both parts of the query instead of just one as it is now.
Also, please edit your question with an EXPLAIN of the query. It should confirm what I expect about the indexes, but it's always helpful information to have.
Edit:
You don't have a PRIMARY index defined for the table, which definitely isn't helping your situation. If the values of (value_id, time_id) are unique, you should probably make the new composite index I mention above the PRIMARY index for the table.

Multi-column index or multiple indexes for timeseries MySQL table?

I have a MySQL MyISAM table with about 400 million rows of price data (7GB data + 9GB index) with 3 columns:
CREATE TABLE `prices` (
`ts` datetime NOT NULL,
`id` int(10) unsigned NOT NULL,
`price` double NOT NULL,
PRIMARY KEY (`ts`,`instrid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1$$
The number of distinct ids (I think cardinality is the word) is ~500 and for most time ranges of interest, inside those time ranges, the cardinality of id is a lower ~20 (so there are only 20 or so different ids between March 1st and 2nd).
The queries are almost exclusively of the form:
select ts, price from prices where ts between {t1} and {t2} and id = {id}.
It seems like some index(s) should speed things up.
Would a combined index on ts and id or separate indexes on ts and id be better? Some 3rd alternative? I would also appreciate recommendations to where I could learn how to answer this question for myself.
Would another table type (InnoDB?) be more appropriate for my purposes?

I'd go for a single combined index on ts, price and id - normally MySQL does two operations, first it finds the row using the index, then it retrieves the row from the database. However if you have all of the data in the index then it will simply grab the data straight from the index without retrieving the row from the database. It's called a "covering index".
On database choice, most people seem to recommend InnoDB for serious use, there's a good comparison here

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008