MySQL: Splitting a large table into partitions or separate tables? - mysql

I have a MySQL database with over 20 tables, but one of them is significantly large because it collects measurement data from different sensors. It's size is around 145 GB on disk and it contains over 1 billion records. All this data is also being replicated to another MySQL server.
I'd like to separate the data to smaller "shards", so my question is which of the below solutions would be better. I'd use the record's "timestamp" for dividing the data by years. Almost all SELECT queries that are executed on this table contain the "timestamp" field in the "where" part of the query.
So below are the solutions that I cannot decide on:
Using the MySQL partitioning and dividing the data by years (e.g. partition1 - 2010, partition2 - 2011 etc.)
Creating separate tables and dividing the data by years (e.g. measuring_2010, measuring_2011 etc. tables)
Are there any other (newer) possible options that I'm not aware of?
I know that in the first case MySQL itself would get the data from the 'shards' and in the second case I'd have to write a kind of wrapper for it and do it by myself. Is there any other way for the second case that would make all separate tables to be seen as 'one big table' to fetch data from?
I know this question has already been asked in the past, but maybe somebody came up with some new solution (that I'm not aware of) or that the best practice solution changed by now. :)
Thanks a lot for your help.
Edit:
The schema is something similar to this:
device_id (INT)
timestamp (DATETIME)
sensor_1_temp (FLOAT)
sensor_2_temp (FLOAT)
etc. (30 more for instance)
All sensor temperatures are written at the same moment once a minute. Note that there around 30 different sensors measurements written in a row. This data is mostly used for displaying graphs and some other statistic purposes.

Well, if you are hoping for a new answer, that means you have probably read my answers, and I sound like a broken record. See Partitioning blog for the few use cases where partitioning can help performance. Yours does not sound like any of the 4 cases.
Shrink device_id. INT is 4 bytes; do you really have millions of devices? TINYINT UNSIGNED is 1 byte and a range of 0..255. SMALLINT UNSIGNED is 2 bytes and a range of 0..64K. That will shrink the table a little.
If your real question is about how to manage so much data, then let's "think outside the box". Read on.
Graphing... What date ranges are you graphing?
The 'last' hour/day/week/month/year?
An arbitrary hour/day/week/month/year?
An arbitrary range, not tied to day/week/month/year boundaries?
What are you graphing?
Average value over a day?
Max/min over the a day?
Candlesticks (etc) for day or week or whatever?
Regardless of the case, you should build (and incrementally maintain) a Summary Table with data. A row would contain summary info for one hour. I would suggest
CREATE TABLE Summary (
device_id SMALLINT UNSIGNED NOT NULL,
sensor_id TINYINT UNSIGNED NOT NULL,
hr TIMESTAMP NOT NULL,
avg_val FLOAT NOT NULL,
min_val FLOAT NOT NULL,
max_val FLOAT NOT NULL
PRIMARY KEY (device_id, sensor_id, hr)
) ENGINE=InnoDB;
The one Summary table might be 9GB (for current amount of data).
SELECT hr,
avg_val,
min_val,
max_val
FROM Summary
WHERE device_id = ?
AND sensor_id = ?
AND hr >= ?
AND hr < ? + INTERVAL 20 DAY;
Would give you the hi/lo/avg values for 480 hours; enough to graph? Grabbing 480 rows from the summary table is a lot faster than grabbing 60*480 rows from the raw data table.
Getting similar data for a year would probably choke a graphing package, so it may be worth building a summary of the summary -- with resolution of a day. It would be about 0.4GB.
There are a few different ways to build the Summary table(s); we can discuss that after you have pondered its beauty and read Summary tables blog. It may be that gathering one hour's worth of data, then augmenting the Summary table, is the best way. That would be somewhat like the flip-flop discussed my Staging table blog.
And, if you had the hourly summaries, do you really need the minute-by-minute data? Consider throwing it away. Or, maybe data after, say, one month. That leads to using partitioning, but only for its benefit in deleting old data as discussed in "Case 1" of Partitioning blog. That is, you would have daily partitions, using DROP and REORGANIZE every night to shift the time of the "Fact" table. This would lead to decreasing your 145GB footprint, but without losing much data. New footprint: About 12GB (Hourly summary + last 30 days' minute-by-minute details)
PS: The Summary Table blog shows how to get standard deviation.

You haven't said much about how you use/query the data or what the schema looks like but I try to make something up.
One thing how you can split your table is based on entities
(different sensors are different entities). That's useful if
different sensors require different columns. So you don't need to
force them into one schema that fits all of them (least common
multiple). Though it's not good if sensors are added or removed
dynamically since you would have to add tables at runtime.
Another approach is to split the table based on time. This is the
case if after some time data can be "historized" and it is not used for
the actual business logic anymore but for statistic purposes.
Both approaches can also be combined. Furthermore be sure that the table is properly indexed according to your query needs.
I strongly discourage any approach that often requires adding a table after some time or anything similar. As always I wouldn't split anything before there's a performance issue.
Edit:
I would clearly restructure the table to following and not split it at all:
device_id (INT)
timestamp (DATETIME)
sensor_id (INT) -- could be unique or not. if sensor_id is not unique make a
-- composite key from device_id and sensor_id given that you
-- need it for queries
sensor_temp (FLOAT)
If data grows fast and you're expecting to generate terabytes of data soon you are better off with a NoSQL approach. But that's a different story.

Related

How to create sub tables in MySQL? [duplicate]

I have a MySQL database with over 20 tables, but one of them is significantly large because it collects measurement data from different sensors. It's size is around 145 GB on disk and it contains over 1 billion records. All this data is also being replicated to another MySQL server.
I'd like to separate the data to smaller "shards", so my question is which of the below solutions would be better. I'd use the record's "timestamp" for dividing the data by years. Almost all SELECT queries that are executed on this table contain the "timestamp" field in the "where" part of the query.
So below are the solutions that I cannot decide on:
Using the MySQL partitioning and dividing the data by years (e.g. partition1 - 2010, partition2 - 2011 etc.)
Creating separate tables and dividing the data by years (e.g. measuring_2010, measuring_2011 etc. tables)
Are there any other (newer) possible options that I'm not aware of?
I know that in the first case MySQL itself would get the data from the 'shards' and in the second case I'd have to write a kind of wrapper for it and do it by myself. Is there any other way for the second case that would make all separate tables to be seen as 'one big table' to fetch data from?
I know this question has already been asked in the past, but maybe somebody came up with some new solution (that I'm not aware of) or that the best practice solution changed by now. :)
Thanks a lot for your help.
Edit:
The schema is something similar to this:
device_id (INT)
timestamp (DATETIME)
sensor_1_temp (FLOAT)
sensor_2_temp (FLOAT)
etc. (30 more for instance)
All sensor temperatures are written at the same moment once a minute. Note that there around 30 different sensors measurements written in a row. This data is mostly used for displaying graphs and some other statistic purposes.
Well, if you are hoping for a new answer, that means you have probably read my answers, and I sound like a broken record. See Partitioning blog for the few use cases where partitioning can help performance. Yours does not sound like any of the 4 cases.
Shrink device_id. INT is 4 bytes; do you really have millions of devices? TINYINT UNSIGNED is 1 byte and a range of 0..255. SMALLINT UNSIGNED is 2 bytes and a range of 0..64K. That will shrink the table a little.
If your real question is about how to manage so much data, then let's "think outside the box". Read on.
Graphing... What date ranges are you graphing?
The 'last' hour/day/week/month/year?
An arbitrary hour/day/week/month/year?
An arbitrary range, not tied to day/week/month/year boundaries?
What are you graphing?
Average value over a day?
Max/min over the a day?
Candlesticks (etc) for day or week or whatever?
Regardless of the case, you should build (and incrementally maintain) a Summary Table with data. A row would contain summary info for one hour. I would suggest
CREATE TABLE Summary (
device_id SMALLINT UNSIGNED NOT NULL,
sensor_id TINYINT UNSIGNED NOT NULL,
hr TIMESTAMP NOT NULL,
avg_val FLOAT NOT NULL,
min_val FLOAT NOT NULL,
max_val FLOAT NOT NULL
PRIMARY KEY (device_id, sensor_id, hr)
) ENGINE=InnoDB;
The one Summary table might be 9GB (for current amount of data).
SELECT hr,
avg_val,
min_val,
max_val
FROM Summary
WHERE device_id = ?
AND sensor_id = ?
AND hr >= ?
AND hr < ? + INTERVAL 20 DAY;
Would give you the hi/lo/avg values for 480 hours; enough to graph? Grabbing 480 rows from the summary table is a lot faster than grabbing 60*480 rows from the raw data table.
Getting similar data for a year would probably choke a graphing package, so it may be worth building a summary of the summary -- with resolution of a day. It would be about 0.4GB.
There are a few different ways to build the Summary table(s); we can discuss that after you have pondered its beauty and read Summary tables blog. It may be that gathering one hour's worth of data, then augmenting the Summary table, is the best way. That would be somewhat like the flip-flop discussed my Staging table blog.
And, if you had the hourly summaries, do you really need the minute-by-minute data? Consider throwing it away. Or, maybe data after, say, one month. That leads to using partitioning, but only for its benefit in deleting old data as discussed in "Case 1" of Partitioning blog. That is, you would have daily partitions, using DROP and REORGANIZE every night to shift the time of the "Fact" table. This would lead to decreasing your 145GB footprint, but without losing much data. New footprint: About 12GB (Hourly summary + last 30 days' minute-by-minute details)
PS: The Summary Table blog shows how to get standard deviation.
You haven't said much about how you use/query the data or what the schema looks like but I try to make something up.
One thing how you can split your table is based on entities
(different sensors are different entities). That's useful if
different sensors require different columns. So you don't need to
force them into one schema that fits all of them (least common
multiple). Though it's not good if sensors are added or removed
dynamically since you would have to add tables at runtime.
Another approach is to split the table based on time. This is the
case if after some time data can be "historized" and it is not used for
the actual business logic anymore but for statistic purposes.
Both approaches can also be combined. Furthermore be sure that the table is properly indexed according to your query needs.
I strongly discourage any approach that often requires adding a table after some time or anything similar. As always I wouldn't split anything before there's a performance issue.
Edit:
I would clearly restructure the table to following and not split it at all:
device_id (INT)
timestamp (DATETIME)
sensor_id (INT) -- could be unique or not. if sensor_id is not unique make a
-- composite key from device_id and sensor_id given that you
-- need it for queries
sensor_temp (FLOAT)
If data grows fast and you're expecting to generate terabytes of data soon you are better off with a NoSQL approach. But that's a different story.

How to design a database/table which adds a lot of rows every minute

I am in the situation where I need to store data for 1900+ cryptocurrencies every minute, i use MySQL innoDB.
Currently, the table looks like this
coins_minute_id | coins_minute_coin_fk | coins_minute_usd | coins_minute_btc | coins_minute_datetime | coins_minute_timestamp
coins_minute_id = autoincrement id
coins_minute_coin_fk = medium int unsigned
coins_minute_usd = decimal 20,6
coins_minute_btc = decimal 20,8
coins_minute_datetime = datetime
coins_minute_timestamp = timestamp
The table grew incredibly fast in the matter of no time, every minute 1900+ rows are added to the table.
The data will be used for historical price display as a D3.js line graph for each cryptocurrency.
My question is how do i optimize this database the best, i have thought of only collecting the data every 5 minutes instead of 1, but it will still add up to a lot of data in no time, i have also thought if it was better to create a unique table for each cryptocurrency, does any of you who loves to design databases know some other very smart and clever way to do stuff like this?
Kindly Regards
(From Comment)
SELECT coins_minute_coin_fk, coins_minute_usd
FROM coins_minutes
WHERE coins_minute_datetime >= DATE_ADD(NOW(),INTERVAL -1 DAY)
AND coins_minute_coin_fk <= 1000
ORDER BY coins_minute_coin_fk ASC
Get rid of coins_minute_ prefix; it clutters the SQL without providing any useful info.
Don't specify the time twice -- there are simple functions to convert between DATETIME and TIMESTAMP. Why do you have both 'created' and an 'updated' timestamps? Are you doing UPDATE statements? If so, then the code is more complicated than simply "inserting". And you need a unique key to know which row to update.
Provide SHOW CREATE TABLE; it is more descriptive that what you provided.
30 inserts/second is easily handled. 300/sec may have issues.
Do not PARTITION the table without some real reason to do so. The common valid reason is that you want to delete 'old' data periodically. If you are deleting after 3 months, I would build the table with PARTITION BY RANGE(TO_DAYS(...)) and use weekly partitions. More discussion: http://mysql.rjweb.org/doc.php/partitionmaint
Show us the queries. A schema cannot be optimized without knowing how it will be accessed.
"Batch" inserts are much faster than single-row INSERT statements. This can be in the form of INSERT INTO x (a,b) VALUES (1,2), (11,22), ... or LOAD DATA INFILE. The latter is very good if you already have a CSV file.
Does your data come from a single source? Or 1900 different sources?
MySQL and MariaDB are probably identical for your task. (Again, need to see queries.) PDO is fine for either; no recoding needed.
After seeing the queries, we can discuss what PRIMARY KEY to have and what secondary INDEX(es) to have.
1 minute vs 5 minutes? Do you mean that you will gather only one-fifth as many rows in the latter case? We can discuss this after the rest of the details are brought out.
That query does not make sense in multiple ways. Why stop at "1000"? The output is quite large; what client cares about that much data? The ordering is indefinite -- the datetime is not guaranteed to be in order. Why specify the usd without specifying the datetime? Please provide a rationale query; then I can help you with INDEX(es).

In MySql, is it worthwhile creating more than one multi-column indexes on the same set of columns?

I am new to SQL, and certainly to MySQL.
I have created a table from streaming market data named trade that looks like
date | time |instrument|price |quantity
----------|-----------------------|----------|-------|--------
2017-09-08|2017-09-08 13:16:30.919|12899586 |54.15 |8000
2017-09-08|2017-09-08 13:16:30.919|13793026 |1177.75|750
2017-09-08|2017-09-08 13:16:30.919|1346049 |1690.8 |1
2017-09-08|2017-09-08 13:16:30.919|261889 |110.85 |50
This table is huge (150 million rows per date).
To retrieve data efficiently, I have created an index date_time_inst (date,time,instrument) because most of my queries will select a specific date
or date range and then a time range.
But that does not help speed up a query like:
select * from trade where date="2017-09-08", instrument=261889
So, I am considering creating another index date_inst_time (date, instrument, time). Will that help speed up queries where I wish to get the time-series of one or a few instruments out of the thousands?
In additional database write-time due to index update, should I worry too much?
I get data every second, and take about 100 ms to process it and store in a database. As long as I continue to take less than 1 sec I am fine.
To get the most efficient query you need to query on a clustered index. According the the documentation this is automatically set on the primary key and can not be set on any other columns.
I would suggest ditching the date column and creating a composite primary key on time and instrument
A couple of recommendations:
There is no need to store date and time separately if time corresponds to time of the same date. You can instead have one datetime column and store timestamps in it
You can then have one index on datetime and instrument columns, that will make the queries run faster
With so many inserts and fixed format of SELECT query (i.e. always by date first, followed by instrument), I would suggest looking into other columnar databases (like Cassandra). You will get faster writes and reads for such structure
First, your use case sounds like two indexes would be useful (date, instrument) and (date, time).
Given your volume of data, you may want to consider partitioning the data. This involves storing different "shards" of data in different files. One place to start is with the documentation.
From your description, you would want to partition by date, although instrument is another candidate.
Another approach would be a clustered index with date as the first column in the index. This assumes that the data is inserted "in order", to reduce movement of the data on inserts.
You are dealing with a large quantity of data. MySQL should be able to handle the volume. But, you may need to dive into more advanced functionality, such as partitioning and clustered indexes to get the functionality you need.
Typo?
I assume you meant
select * from trade where date="2017-09-08" AND instrument=261889
^^^
Optimal index for such is
INDEX(instrument, date)
And, contrary to other Comments/Answers, it is better to have the date last, especially if you want more than one day.
Splitting date and time
It is usually a bad idea to split date and time. It is also usually a bad idea to have redundant data; in this case, the date is repeated. Instead, use
WHERE `time` >= "2017-09-08"
AND `time` < "2017-09-08" + INTERVAL 1 DAY
and get rid of the date column. Note: This pattern works for DATE, DATETIME, DATETIME(3), etc, without messing up with the midnight at the end of the range.
Data volume?
150M rows? 10 new rows per second? That means you have about 5 years' data? A steady 10/sec insertion rate is rarely a problem.
Need to see SHOW CREATE TABLE. If there are a lot of indexes, then there could be a problem. Need to see the datatypes to look for shrinking the size.
Will you be purging 'old' data? If so, we need to talk about partitioning for that specific purpose.
How many "instruments"? How much RAM? Need to discuss the ramifications of an index starting with instrument.
The query
Is that the main SELECT you use? Is it always 1 day? One instrument? How many rows are typically returned.
Depending on the PRIMARY KEY and whatever index is used, fetching 100 rows could take anywhere from 10ms to 1000ms. Is this issue important?
Millisecond resolution
It is usually folly to think that any time resolution is not going to have duplicates.
Is there an AUTO_INCREMENT already?
SPACE IS CHEAP. Indexes take time creating/inserting (once), but shave time retrieving (Many many times)
My experience is to create as many indexes with all the relevant fields in all orders. This way, Mysql can choose the best index for your query.
So if you have 3 relevant fields
INDEX 1 (field1,field2,field3)
INDEX 2 (field1,field3)
INDEX 3 (field2,field3)
INDEX 4 (field3)
The first index will be used when all fields are present. The others are for shorter WHERE conditions.
Unless you know that some combinations will never be used, this will give MySQL the best chance to optimize your query. I'm also assuming that field1 is the biggest driver of the data.

Best way to store huge log data

I need an advice on optimal approach to store statistical data. There is a project on Django, which has a database (mysql) of 30 000 online games.
Each game has three statistical parameters:
number of views,
number of plays,
number of likes
Now I need to store historical data for these three parameters on a daily basis, so I was thinking on creating a single database which will has five columns:
gameid, number of views, plays, likes, date (day-month-year data).
So in the end, every day for every game will be logged in one row, so in one day this table will have 30000 rows, in 10 days it will have size of 300000 and in a year it will have size of 10 950 000 rows. I'm not a big specialist in DBA stuff, but this says me, that this quickly will become a performance problem. I'm not talking what will happen in 5 years time.
The data collected in this table is needed for simple graphs
(daily, weekly, monthly, custom range).
Maybe you have better ideas on how to store this data? Maybe noSQL will be more suitable in this case? Really need your advice on this.d
Partitioning in postgresql works great for big logs. First create the parent table:
create table game_history_log (
gameid integer,
views integer,
plays integer,
likes integer,
log_date date
);
Now create the partitions. In this case one for each month, 900 k rows, would be good:
create table game_history_log_201210 (
check (log_date between '2012-10-01' and '2012-10-31')
) inherits (game_history_log);
create table game_history_log_201211 (
check (log_date between '2012-11-01' and '2012-11-30')
) inherits (game_history_log);
Notice the check constraints in each partition. If you try to insert in the wrong partition:
insert into game_history_log_201210 (
gameid, views, plays, likes, log_date
) values (1, 2, 3, 4, '2012-09-30');
ERROR: new row for relation "game_history_log_201210" violates check constraint "game_history_log_201210_log_date_check"
DETAIL: Failing row contains (1, 2, 3, 4, 2012-09-30).
One of the advantages of partitioning is that it will only search in the correct partition reducing drastically and consistently the search size regardless of how many years of data there is. Here the explain for the search for a certain date:
explain
select *
from game_history_log
where log_date = date '2012-10-02';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Result (cost=0.00..30.38 rows=9 width=20)
-> Append (cost=0.00..30.38 rows=9 width=20)
-> Seq Scan on game_history_log (cost=0.00..0.00 rows=1 width=20)
Filter: (log_date = '2012-10-02'::date)
-> Seq Scan on game_history_log_201210 game_history_log (cost=0.00..30.38 rows=8 width=20)
Filter: (log_date = '2012-10-02'::date)
Notice that apart from the parent table it only scanned the correct partition. Obviously you can have indexes on the partitions to avoid a sequential scan.
Inheritance Partitioning
11M rows ins't excessive, but indexing in general and the clustering of your primary key will matter more (on InnoDB). I suggest (game_id, date) for a primary key so that queries for all data about a certain game are in sequential rows. Also, you may want to keep a separate table of just the current values for ranking games, etc, when just the latest figures are necessary.
There is no performance problems with MySQL with 10kk data. You can just apply partitioning by game id (requires atleast 5.5 version).
I have MySQL DB with data like this and currently there is no problems with 980kk rows.
I would advice not to use relational database at all.
Statictics is sort of things which is changing pretty rapidly, because new data are constantly arriving.
I believe smth like HBase will fit better - because adding new records works faster here.
Instead of keeping every row, keep recent data at high precision, middle data at mdeium precision, and long-term at low precision. This is the appoach taken by rrdtool which might be better for this than mysql.

...still not getting results trying to optimize mysql innodb table for fast count

i posted this question here a while ago. i tried out the suggestions and came to the conclusion that i must be doing something fundamentally wrong.
What i basically want to do is this:
i have a table containing 83Mio. time/price pairs. As index im using a millisecond accurate unix timestamp, the price ranges between 1.18775 and 1.60400 (decimal with precision 5).
i have a client that needs to get out the price densities for a given time interval, meaning i want to take a specified interval of time and count how many times all the different prices appear in this interval.
How would you guys do this? How would you design/index the table? Right now im building a temporary subtable containing only the data for the given interval and then do the counts on the prices. Is there a better way to do this? My general db settings are already tuned out and pretty performant. Thanks for any hints! I will provide any additional information needed as fast as i can!
Given that you have a large amount of data and its growing v rapidly I'd be inclined to add a second table of:
price (primary key)
time( some block - also part of PK )
count
Do an 'insert on duplicate key update count++' sort of thing. Group the time field by some predetermined interval (depends on the sorts of queries you get.. ms/sec/hour/whatever). This way you:
don't have to mess with temp tables - with a table of this size it will write to disk - slow even with SSD
don't have to touch the initial table every time you want to do your query - might run into locking issues
You will have to avg out your data a bit but the granularity can be predetermined to cause as few issues as possible.