MySQL: Partition-like function for a single set of data? - mysql

I have a table that has millions of records, and they utilize EFF_FROM and EFF_TO date fields to version the records.
99% of the time, when this table is queried by an application, it is only concerned with records that have an EFF_TO of 2099-12-31, or records that are active and not historical.
I copied just the active records to a test version of the table and the application's SELECT query went from 60 seconds to 3 seconds.
I don't necessarily want to partition every EFF_TO date. I don't want to add that overhead especially to processes that populate the table. I only want the optimization for querying records with 2099-12-31, and I want the performance to be instant.
Is there a straightforward way to do this? Or do I have to resort to creating an active table and a historical table?

Partition like function for a single set of data?
This is something of any oxymoron, however you are asking about partitioning into two sets of data, one where EFF_TO is in the future and one where it is in the past.
have an EFF_TO of 2099-12-31
Design fault - these should be null.
If they were null the the partitioning would be simple. As it stands you will have to drop and recreate the partitions - which is rather an expensive operation (have a look at tools for doing online schema updates).
You could minimize the impact by creating multiple partitions defining the period around NOW then adding an extra one onto the end of and removing one from the beginning at regular intervals.
application's SELECT query went from 60 seconds to 3 seconds.
There are lots of other reasons why the performance improved than just the size of the table
if it's doing a full table table scan, this is a design fault in the application.
You're indexes may not be as up to date as they should be
the logical structure of the indexes may be unbalanced and need optimized
the physical structure of the table and indexes many be fragmented and need optimized

Related

Choosing the right MySQL structure for a very large time-based dataset

I have been using MySQL for the past few months and I have a good handle on smaller database structures. Now, however, I need to decide on how to create a database that can store a large set of time oriented data in either multiple tables, or a single table.
Using a single table, I have tried partitioning it into yearly segments, however, the load times and insert times are still quite long. Especially for searching. The data consists of roughly 8000 reporting stations with about 300-500 reports per day (several per hour). The reports go back all the way to 1980, so easily over 120 million data points and growing.
I am not sure what may provide the best results for searching such a vast amount of data, or if it would be better to separate the data into several tables. Each report has only a couple columns of information (time, temperature and wind).
I am sure this question has been asked many times, but any help would be appreciated.
Thank you!
120M rows is big enough to conisder PARTITIONing. And that it good for time-based data if you need to delete "old" data. This because DROP PARTITION is a lot faster and less invasive than DELETE.
I discuss this at length here.
Loading into a partitioned table should be only slightly slower (or faster in rare cases) than for a non-partitioned table.
Searching problems -- sounds like you did not index the table properly. Some tips:
(Usually) Put the "partition key" last in any index, if it is needed at all.
Use PARTITION BY RANGE(TO_DAYS(...)) only.
40 years? 40 partitions is reasonable.
Do not partition by station, but probably use that column at the start of some indexes.
Please show me the CREATE TABLE so I can be more specific in my tips.
If you won't be deleting 'old' rows, then partitioning is probably a waste. Let's see some of the queries.
On the other hand, if you often use a date range and several stations, then you have the "2D index problem". Partition by year; start the PRIMARY KEY with station
Do not use multiple tables. This is a common Question on this forum, and the answer is always the same.
Quite possibly you need some sort of "summary table". It might include high, low, average temp, etc for each week. For, say, a multi-year temperature graph, this is clearly 7 times as fast. More here.
Inserting only 37 rows/second should not be a problem, even on a slow HDD. If they come in batches, then batch the INSERTs via multiple rows per INSERT statement or via LOAD DATA.

Mysql what if too much data in a table

Data is increasing in one table everyday, it might lower the performance . I was thinking if I can create a trigger which move table A into A1 and create a new table A every a period of time, so that insert or update could be faster in table A. Is this the right way to save performance ? If not, what should I do ?
(for example, insert or update 1000 rows per second in table A, how is the performance after 3 years ?)
We are designing softwares for a factory. There are product lines which pcb boards are made on. We need to insert almost 60 pcb records per second for years. (1000 rows seem to be exaggerated)
First, you are talking about several terabytes for a single table. Is your disk that big? Yes, MySQL can handle that big a table.
Will it slow down? It depends on
The indexes. If you have 'random' indexes, the INSERTs will slow down to about 1 insert per disk hit. On a spinning HDD, that is only about 100 per second. SSD might be able to handle 1000/sec. Please provide SHOW CREATE TABLE.
Does the table have an AUTO_INCREMENT? If so, it needs to be BIGINT, not INT. But, if possible, get rid of it all together (to save space). Again, let's see the SHOW.
"Point" queries (load one row via an index) are mostly unaffected by the size of the table. They will be about twice as slow in a trillion-row table as in a million-row table. A point query will take milliseconds or tens of milliseconds; no big deal.
A table scan will take hours or days; hopefully you are not doing that.
A billion-row scan of part of the table will take days or weeks unless you are using the PRIMARY KEY or have a "covering" index. Let's see the queries and the SHOW.
The best technique is not to store the data. Summarize it as it arrives, save the summaries, then toss the raw data. (OK, you might store the raw in a csv file just in case you need to build a new summary table or fix a bug in an existing one.)
Having a few summary tables instead of the raw data would shrink the data to under 1TB and allow the relevant queries to run 10 times as fast. (OK, point queries would be only slightly faster.)
PARTITIONing (or otherwise splitting up the table)? It depends. Let's see the queries and the SHOW. In many situations, PARTITIONing does not speed up anything.
Will you be deleting or modifying existing rows? I hope not. That adds more dimensions of problems. If, on the other hand, you need to purge 'old' data, then that is an excellent use for PARTITIONing. For 3 years' worth of data, I would PARTITION BY RANGE(TO_DAYS(..)) and have monthly partitions. Then a monthly DROP PARTITION would be very fast.
Very Huge data may decrease the performance of server, So there is a way to handle this :
1) you have to create another table to store archive data ( old data ) using Archive storage mechanism . ( https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html )
2) create MySQL job/scheduler to move older records to archive table. schedule in timeslot
when server is maximum idle.
3) after moving older records to archive table, re-index the original table.
this will serve the purpose of performance.
It is unlikely that 1000 row tables perform sufficiently poorly that doing a table copy every once in a while is an overall net gain. And anyway, what would the new table have that the old one did not which would improve performance?
The key to having tables perform efficiently is intelligent table design and management of indexes. That is how zillion row tables are effective in geospatial work, library catalogs, astronomy, and how internet search engines find useful data, etc.
Each index defined does cause more mysql impact especially at row insert time. Assuming there are more reads than inserts, this is an advantage because most queries are rapidly completed thanks to a suitable index.
Indexes are best defined with a thorough understanding of the queries made against the table—both in quality and quantity. And, if there is any tendency for the nature of the queries to trend over months or years, then the indexes would need additions, modifications, or—yes—even deletions.
It seems to me there is something inherently wrong with the way you are using MySQL to begin with.
A database system is supposed to manage data that is required by your application in order for it to work. If you think flushing the table every so often is something acceptable, then that doesn't seem to be the case.
Perhaps you are better off just using log files. Split them by date, delete old ones if and when you decide they are no longer relevant or need the disk space. It's even safer to do that way from a recovery perspective.
If you need a better suggestion, then improve your question to include exactly what you are trying to accomplish so we can help you with it.

In MySql, is it worthwhile creating more than one multi-column indexes on the same set of columns?

I am new to SQL, and certainly to MySQL.
I have created a table from streaming market data named trade that looks like
date | time |instrument|price |quantity
----------|-----------------------|----------|-------|--------
2017-09-08|2017-09-08 13:16:30.919|12899586 |54.15 |8000
2017-09-08|2017-09-08 13:16:30.919|13793026 |1177.75|750
2017-09-08|2017-09-08 13:16:30.919|1346049 |1690.8 |1
2017-09-08|2017-09-08 13:16:30.919|261889 |110.85 |50
This table is huge (150 million rows per date).
To retrieve data efficiently, I have created an index date_time_inst (date,time,instrument) because most of my queries will select a specific date
or date range and then a time range.
But that does not help speed up a query like:
select * from trade where date="2017-09-08", instrument=261889
So, I am considering creating another index date_inst_time (date, instrument, time). Will that help speed up queries where I wish to get the time-series of one or a few instruments out of the thousands?
In additional database write-time due to index update, should I worry too much?
I get data every second, and take about 100 ms to process it and store in a database. As long as I continue to take less than 1 sec I am fine.
To get the most efficient query you need to query on a clustered index. According the the documentation this is automatically set on the primary key and can not be set on any other columns.
I would suggest ditching the date column and creating a composite primary key on time and instrument
A couple of recommendations:
There is no need to store date and time separately if time corresponds to time of the same date. You can instead have one datetime column and store timestamps in it
You can then have one index on datetime and instrument columns, that will make the queries run faster
With so many inserts and fixed format of SELECT query (i.e. always by date first, followed by instrument), I would suggest looking into other columnar databases (like Cassandra). You will get faster writes and reads for such structure
First, your use case sounds like two indexes would be useful (date, instrument) and (date, time).
Given your volume of data, you may want to consider partitioning the data. This involves storing different "shards" of data in different files. One place to start is with the documentation.
From your description, you would want to partition by date, although instrument is another candidate.
Another approach would be a clustered index with date as the first column in the index. This assumes that the data is inserted "in order", to reduce movement of the data on inserts.
You are dealing with a large quantity of data. MySQL should be able to handle the volume. But, you may need to dive into more advanced functionality, such as partitioning and clustered indexes to get the functionality you need.
Typo?
I assume you meant
select * from trade where date="2017-09-08" AND instrument=261889
^^^
Optimal index for such is
INDEX(instrument, date)
And, contrary to other Comments/Answers, it is better to have the date last, especially if you want more than one day.
Splitting date and time
It is usually a bad idea to split date and time. It is also usually a bad idea to have redundant data; in this case, the date is repeated. Instead, use
WHERE `time` >= "2017-09-08"
AND `time` < "2017-09-08" + INTERVAL 1 DAY
and get rid of the date column. Note: This pattern works for DATE, DATETIME, DATETIME(3), etc, without messing up with the midnight at the end of the range.
Data volume?
150M rows? 10 new rows per second? That means you have about 5 years' data? A steady 10/sec insertion rate is rarely a problem.
Need to see SHOW CREATE TABLE. If there are a lot of indexes, then there could be a problem. Need to see the datatypes to look for shrinking the size.
Will you be purging 'old' data? If so, we need to talk about partitioning for that specific purpose.
How many "instruments"? How much RAM? Need to discuss the ramifications of an index starting with instrument.
The query
Is that the main SELECT you use? Is it always 1 day? One instrument? How many rows are typically returned.
Depending on the PRIMARY KEY and whatever index is used, fetching 100 rows could take anywhere from 10ms to 1000ms. Is this issue important?
Millisecond resolution
It is usually folly to think that any time resolution is not going to have duplicates.
Is there an AUTO_INCREMENT already?
SPACE IS CHEAP. Indexes take time creating/inserting (once), but shave time retrieving (Many many times)
My experience is to create as many indexes with all the relevant fields in all orders. This way, Mysql can choose the best index for your query.
So if you have 3 relevant fields
INDEX 1 (field1,field2,field3)
INDEX 2 (field1,field3)
INDEX 3 (field2,field3)
INDEX 4 (field3)
The first index will be used when all fields are present. The others are for shorter WHERE conditions.
Unless you know that some combinations will never be used, this will give MySQL the best chance to optimize your query. I'm also assuming that field1 is the biggest driver of the data.

Is adding and dropping indexes everyday on huge tables a good practice?

I'm building a Web Application that is connected to a MySQL database.
I've got two huge tables containing each about 40 millions rows at the moment, and they are receiving new rows everyday (which adds ~ 500 000-1000 000 rows everyday).
The process to add new rows runs during the night, while no one can use the application, and the new rows' content depends on the result of some basic SELECT queries on the current database.
In order to get the result of those SELECT statement fast enough, I'm using simple indexes (one column per index) on each column that appears at least once in a WHERE clause.
The thing is, during the day, some totally different queries are run against those tables, including some "range WHERE clause" (SELECT * FROM t1 WHERE a = a1 AND b = b1 AND (date BETWEEN d1 AND d2)).
I found on stack this very helpful mini-cookbook that advises you on which INDEXes you should use depending on how the database is queried: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
They advice to use compound index: in my example query above it would give INDEX(a, b, date).
It indeed increased the speed of the queries run during the day (from 1 minute to 8 seconds so I was truly happy).
However, with those compound indexes, the required time to add new rows during the night totally explode (it would take more than one day to add the daily content).
Here is my question: would that be ok to drop all the indexes every night, add the new content, and set back up the daily indexes?
Or would that be dangerous since indexes are not meant to be rebuilt every day, especially on such big tables?
I know such an operation would take approximately two hours in total (drop and recreate INDEXes).
I am aware of the existence of ALTER TABLE table_name DISABLE KEYS; but I'm using InnoDB and I believe it is not made to work on InnoDB table.
I believe you have answered your own question: You need the indexes during the day, but not at night. Given what you describe, you should drop the indexes for the bulk inserts at night and re-create them afterwards. Dropping indexes for data loads is not unheard of, and seems appropriate in your case.
I would ask about how you are inserting new data. One method is to insert the values one row at a time. Another is to put the values into a temporary table (with no index) and do a bulk insert:
insert into bigtable( . . .)
select . . .
from smalltable;
These have different performance characteristics. You might find that using a single insert (if you are not already doing so) is fast enough for your purposes.
A digression... PARTITIONing by date should be very useful for you since you are deleting things over a year ago. I would recommend PARTITION BY RANGE(TO_DAYS(...)) and breaking it into 14 or 54 partitions (months or weeks, plus some overhead). This will eliminate the time it takes to delete the old rows, since DROP PARTITION is almost instantaneous.
More details are in my partition blog. Your situation sounds like both Use case #1 and Use case #3.
But back to your clever idea of dropping and rebuilding indexes. To others, I point out the caveat that you have the luxury of not otherwise touching the table for long enough to do the rebuild.
With PARTITIONing, all the rows being inserted will go into the 'latest' partition, correct? This partition is a lot smaller than the entire table, so there is a better chance that the indexes will fit in RAM, thereby be 10 times as fast to update (without rebuilding the indexes). If you provide SHOW CREATE TABLE, SHOW TABLE STATUS, innodb_buffer_pool_size, and RAM size, I can help you do the arithmetic to see if your 'last' partition will fit in RAM.
A note about index updates in InnoDB -- they are 'delayed' by sitting in the "Change buffer", which is a portion of the buffer_pool. See innodb_change_buffer_size_max, available since 5.6. Are you using that version, or newer? (If not, you ought to upgrade, for many reasons.)
The default for that setting is 25, meaning that 25% of the buffer_pool is set aside for pending updates to indexes, as caused by INSERT, etc. That acts like a "cache", such that multiple updates to the same index block are held there until they get bumped out. A higher setting should make index updates hit the disk less often, hence finish faster.
Where I am heading with this... By increasing this setting, you would make the inserts (direct, not rebuild) more efficient. I'm thinking that this might speed it up:
Just before the nightly INSERTs:
innodb_change_buffer_size_max = 70
innodb_old_blocks_pct = 10
Soon after the nightly INSERTs:
innodb_change_buffer_size_max = 25
innodb_old_blocks_pct = 37
(I am not sure about that other setting, but it seems reasonable to push it out of the way.)
Meanwhile, what is the setting of innodb_buffer_pool_size? Typically, it should be 70% of available RAM.
In a similar application, I had big, hourly, dumps to load into a table, and a 90-day retention. I stretched my Partition rules by having 90 daily partitions and 24 hourly partitions. Every night, I spent a lot of time (but less than an hour) doing REORGANIZE PARTITION to turn the 24 hourly partitions into a new daily (and dropping the 90-day-old partition). During each hour, the load had the added advantage that nothing else was touching the 1-hour partition -- I could do normalization, summarization, and loading all in 7 minutes. The entire 90 days fit in 400GB. (Side note: a large number of partitions is a performance killer until 8.0; so don't even consider daily partitions for you 1-year retention.)
The Summary tables made so that 50-minute queries (in the prototype) shrank to only 2 seconds. Perhaps you need a summary table with PRIMARY KEY (a, b, date)? That will let you get rid of such an index on the 'Fact' table. Oops, that eliminates the entire premise of your original question ! See the links at the bottom of my blogs; look for "Summary Tables". A general rule: Don't have any indexes (other than the PRIMARY KEY) on the Fact table; use Summary tables for things that need messier indexes.

Move inactive rows to another table?

I have a table where when a row is created, it will be active for 24 hours with some writes and lots of reads. Then it becomes inactive after 24 hours and will have no more writes and only some reads, if any.
Is it better to keep these rows in the table or move them when they become inactive (or via batch jobs) to a separate table? Thinking in terms of performance.
This depends largely on how big your table will get, but if it grows forever, and has a significant number of rows per day, then there is a good chance that moving old data to another table would be a good idea. There are a few different ways you could accomplish this, and which is best depends on your application and data access patterns.
Essentially as you said, when a row becomes "old", INSERT to the archive table, and DELETE from the current table.
Create a new table every day (or perhaps every week, or every month, depending on how big your dataset is), and never worry about moving old rows. You'll just have to query old tables when accessing old data, but for the current day, you only ever access the current table.
Have a "today" table and a "all time" table. Duplicate the "today" rows in both tables, keeping them in sync with triggers or other mechanisms. When a row becomes old, simply delete from the "today" table, leaving the "all time" row in tact.
One advantage to #2, that may not be immediately obvious, is that I believe MySQL indexes can be optimized for read-only tables. So by having old tables that are never written to, you can take advantage of this extra optimization.
Generally moving rows between tables in proper RDBMS should not be necessary.
I'm not familiar with mysql specifics, but you should do fine with the following:
Make sure your timestamp column is indexed
In addition, you can use active BOOLEAN default true column
Make a batch run every day to mark >24h old rows inactive
Use a partial index for timestamp column so only rows marked active are indexed
Remember to have timestamp and active = TRUE in your where conditions to hit indexes. Use EXPLAIN a lot.
That all depends on the balance between ease of programming, and performance. Performance wise, yes it will definitely be faster. But whether the speed increase is worth the effort is hard to say.
I've worked on systems that run perfectly fine with millions of rows. However, if the data is ever growing it does eventually become a problem.
I've worked on a database storing transaction logging for automated equipment. It generates hundreds of thousands of events per day. After a year, the queries just wouldn't run at acceptable speeds any more. We now keep the last month's worth of logs in the main table (millions of rows still), and move older data to archive tables.
None of the application's functionality ever looks in the archive table (if you do a query of the transaction log, it will return no results). It is only really kept for emergency use, and is just queried with any standalone database query tool. Because the archive has well over a hundred million rows, and the nature of this emergency use is generally unplannable (and therefore mostly un-indexed) queries, they can take a long time to run.
There is another solution. To have another table containing only the active records (tblactiverecords). When the number of active records is really small, you could just do an inner join and get the active records. This should take very less time because primary key by default are indexed in mysql. As your rows become inactive, you could delete them from the tblactiverecords table.
create table tblrecords (id int primary key, data text);
Then,
create table tblactiverecords (tblrecords_id primary key);
you can do
select data from tblrecords join tblactiverecords on tblrecords.id = tblactiverecords.tblrecords_id;
to get all data that are active.