Best practices with historical data in MySQL database

Best practices with historical data in MySQL database - mysql

Recently I think about the best practices with storing historical data in MySQL database. For now, each versionable table has two columns - valid_from and valid_to, both DATETIME type. Records with current data has valid_from filled with its creation day. When I update this row, I fill valid_to with update date and add new record with valid_from the same as valid_to in previous row - easy stuff. But I know that table will be enormous very quick so fetching data can be very slow.
I'd like to know if you have any practices with storing historical data?

It's a common mistake to worry about "large" tables and performance. If you can use indexes to access your data, it doesn't really matter if you have 1000 of 1000000 records - at least not so as you'd be able to measure. The design you mention is commonly used; it's a great design where time is a key part of the business logic.
For instance, if you want to know what the price of an item was at the point when the client placed the order, being able to search product records where valid_from < order_date and valid_until is either null or > order_date is by far the easiest solution.
This isn't always the case - if you're keeping the data around just for archive purposes, it may make more sense to create archive tables. However, you have to be sure that time is really not part of the business logic, otherwise the pain of searching multiple tables will be significant - imagine having to search either the product table OR the product_archive table every time you want to find out about the price of a product at the point the order was placed.

This is not complete answer, just few suggestions.
You can add indexed boolean field like is_valid. This should improve performance with big table with historical and current records.
In general - storing historical data in seprate table may complicate your application (just imagine complexity of query that is supposed to get data with mixed current and historical records...).
Today computers are really fast. I think you should compare/test performance with single table and separate table for historical records.
In addition - try to test your hardware to see how fast is MySQL with big tables to determine how to design database. If its too slow for you - you can tune MySQL configuration (start with increasing cache/RAM).

I'm nearing completion of an application which does exactly this. Most of my indexes index by key fields first and then the valid_to field which is set to NULL for current records thereby allowing current records to be found easily and instantly. Since most of my application deals with real time operations, the indexes provide fast performance. Once in a while someone needs to see historical records, and in that instance there's a performance hit, but from testing it's not too bad since most records don't have very many changes over their lifetime.
In cases where you may have a lot more expired records of various keys than current records it may pay to index over valid_to before any key fields.

Related

Mysql what if too much data in a table

Data is increasing in one table everyday, it might lower the performance . I was thinking if I can create a trigger which move table A into A1 and create a new table A every a period of time, so that insert or update could be faster in table A. Is this the right way to save performance ? If not, what should I do ?
(for example, insert or update 1000 rows per second in table A, how is the performance after 3 years ?)
We are designing softwares for a factory. There are product lines which pcb boards are made on. We need to insert almost 60 pcb records per second for years. (1000 rows seem to be exaggerated)

First, you are talking about several terabytes for a single table. Is your disk that big? Yes, MySQL can handle that big a table.
Will it slow down? It depends on
The indexes. If you have 'random' indexes, the INSERTs will slow down to about 1 insert per disk hit. On a spinning HDD, that is only about 100 per second. SSD might be able to handle 1000/sec. Please provide SHOW CREATE TABLE.
Does the table have an AUTO_INCREMENT? If so, it needs to be BIGINT, not INT. But, if possible, get rid of it all together (to save space). Again, let's see the SHOW.
"Point" queries (load one row via an index) are mostly unaffected by the size of the table. They will be about twice as slow in a trillion-row table as in a million-row table. A point query will take milliseconds or tens of milliseconds; no big deal.
A table scan will take hours or days; hopefully you are not doing that.
A billion-row scan of part of the table will take days or weeks unless you are using the PRIMARY KEY or have a "covering" index. Let's see the queries and the SHOW.
The best technique is not to store the data. Summarize it as it arrives, save the summaries, then toss the raw data. (OK, you might store the raw in a csv file just in case you need to build a new summary table or fix a bug in an existing one.)
Having a few summary tables instead of the raw data would shrink the data to under 1TB and allow the relevant queries to run 10 times as fast. (OK, point queries would be only slightly faster.)
PARTITIONing (or otherwise splitting up the table)? It depends. Let's see the queries and the SHOW. In many situations, PARTITIONing does not speed up anything.
Will you be deleting or modifying existing rows? I hope not. That adds more dimensions of problems. If, on the other hand, you need to purge 'old' data, then that is an excellent use for PARTITIONing. For 3 years' worth of data, I would PARTITION BY RANGE(TO_DAYS(..)) and have monthly partitions. Then a monthly DROP PARTITION would be very fast.

Very Huge data may decrease the performance of server, So there is a way to handle this :
1) you have to create another table to store archive data ( old data ) using Archive storage mechanism . ( https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html )
2) create MySQL job/scheduler to move older records to archive table. schedule in timeslot
when server is maximum idle.
3) after moving older records to archive table, re-index the original table.
this will serve the purpose of performance.

It is unlikely that 1000 row tables perform sufficiently poorly that doing a table copy every once in a while is an overall net gain. And anyway, what would the new table have that the old one did not which would improve performance?
The key to having tables perform efficiently is intelligent table design and management of indexes. That is how zillion row tables are effective in geospatial work, library catalogs, astronomy, and how internet search engines find useful data, etc.
Each index defined does cause more mysql impact especially at row insert time. Assuming there are more reads than inserts, this is an advantage because most queries are rapidly completed thanks to a suitable index.
Indexes are best defined with a thorough understanding of the queries made against the table—both in quality and quantity. And, if there is any tendency for the nature of the queries to trend over months or years, then the indexes would need additions, modifications, or—yes—even deletions.

It seems to me there is something inherently wrong with the way you are using MySQL to begin with.
A database system is supposed to manage data that is required by your application in order for it to work. If you think flushing the table every so often is something acceptable, then that doesn't seem to be the case.
Perhaps you are better off just using log files. Split them by date, delete old ones if and when you decide they are no longer relevant or need the disk space. It's even safer to do that way from a recovery perspective.
If you need a better suggestion, then improve your question to include exactly what you are trying to accomplish so we can help you with it.

In MySql, is it worthwhile creating more than one multi-column indexes on the same set of columns?

I am new to SQL, and certainly to MySQL.
I have created a table from streaming market data named trade that looks like
date | time |instrument|price |quantity
----------|-----------------------|----------|-------|--------
2017-09-08|2017-09-08 13:16:30.919|12899586 |54.15 |8000
2017-09-08|2017-09-08 13:16:30.919|13793026 |1177.75|750
2017-09-08|2017-09-08 13:16:30.919|1346049 |1690.8 |1
2017-09-08|2017-09-08 13:16:30.919|261889 |110.85 |50
This table is huge (150 million rows per date).
To retrieve data efficiently, I have created an index date_time_inst (date,time,instrument) because most of my queries will select a specific date
or date range and then a time range.
But that does not help speed up a query like:
select * from trade where date="2017-09-08", instrument=261889
So, I am considering creating another index date_inst_time (date, instrument, time). Will that help speed up queries where I wish to get the time-series of one or a few instruments out of the thousands?
In additional database write-time due to index update, should I worry too much?
I get data every second, and take about 100 ms to process it and store in a database. As long as I continue to take less than 1 sec I am fine.

To get the most efficient query you need to query on a clustered index. According the the documentation this is automatically set on the primary key and can not be set on any other columns.
I would suggest ditching the date column and creating a composite primary key on time and instrument

A couple of recommendations:
There is no need to store date and time separately if time corresponds to time of the same date. You can instead have one datetime column and store timestamps in it
You can then have one index on datetime and instrument columns, that will make the queries run faster
With so many inserts and fixed format of SELECT query (i.e. always by date first, followed by instrument), I would suggest looking into other columnar databases (like Cassandra). You will get faster writes and reads for such structure

First, your use case sounds like two indexes would be useful (date, instrument) and (date, time).
Given your volume of data, you may want to consider partitioning the data. This involves storing different "shards" of data in different files. One place to start is with the documentation.
From your description, you would want to partition by date, although instrument is another candidate.
Another approach would be a clustered index with date as the first column in the index. This assumes that the data is inserted "in order", to reduce movement of the data on inserts.
You are dealing with a large quantity of data. MySQL should be able to handle the volume. But, you may need to dive into more advanced functionality, such as partitioning and clustered indexes to get the functionality you need.

Typo?
I assume you meant
select * from trade where date="2017-09-08" AND instrument=261889
^^^
Optimal index for such is
INDEX(instrument, date)
And, contrary to other Comments/Answers, it is better to have the date last, especially if you want more than one day.
Splitting date and time
It is usually a bad idea to split date and time. It is also usually a bad idea to have redundant data; in this case, the date is repeated. Instead, use
WHERE `time` >= "2017-09-08"
AND `time` < "2017-09-08" + INTERVAL 1 DAY
and get rid of the date column. Note: This pattern works for DATE, DATETIME, DATETIME(3), etc, without messing up with the midnight at the end of the range.
Data volume?
150M rows? 10 new rows per second? That means you have about 5 years' data? A steady 10/sec insertion rate is rarely a problem.
Need to see SHOW CREATE TABLE. If there are a lot of indexes, then there could be a problem. Need to see the datatypes to look for shrinking the size.
Will you be purging 'old' data? If so, we need to talk about partitioning for that specific purpose.
How many "instruments"? How much RAM? Need to discuss the ramifications of an index starting with instrument.
The query
Is that the main SELECT you use? Is it always 1 day? One instrument? How many rows are typically returned.
Depending on the PRIMARY KEY and whatever index is used, fetching 100 rows could take anywhere from 10ms to 1000ms. Is this issue important?
Millisecond resolution
It is usually folly to think that any time resolution is not going to have duplicates.
Is there an AUTO_INCREMENT already?

SPACE IS CHEAP. Indexes take time creating/inserting (once), but shave time retrieving (Many many times)
My experience is to create as many indexes with all the relevant fields in all orders. This way, Mysql can choose the best index for your query.
So if you have 3 relevant fields
INDEX 1 (field1,field2,field3)
INDEX 2 (field1,field3)
INDEX 3 (field2,field3)
INDEX 4 (field3)
The first index will be used when all fields are present. The others are for shorter WHERE conditions.
Unless you know that some combinations will never be used, this will give MySQL the best chance to optimize your query. I'm also assuming that field1 is the biggest driver of the data.

Creating a table with the name being a variable date?

I wanted to create a table with the name of the table being a date. When I gather stock data for that day, I wanted to store it like this:
$date = date('Y-m-d');
$mysqli->query(
"CREATE TABLE IF NOT EXISTS `$date`(ID INT Primary Key)"
);
That way I will have a database like:
2013-5-1: AAPL | 400 | 400K
MFST | 30 | 1M
GOOG | 700 | 2M
2013-5-2: ...
I think it would be easier to store information like this, but I see a similar question to this was closed.
How to add date to MySQL table name?
"Generating more and more tables is exactly the opposite of "keeping
the database clean". A clean database is one with a sensible,
normalized, fixed schema which you can run queries against."
If this is not the right way to do it, could someone suggest what would be? Many people commenting on this question stated that this was not a "clean" solution?

Do not split your data into several tables. This will become a maintenance nightmare, even though it might seem sensible to do so at first.
I suggest you create a date column that holds the information you currently want to put into the table name. Databases are pretty clever in storing dates efficiently. Just make sure to use the right datatype, not a string. By adding an index to that column you will also not get a performance penalty when querying.
What you gain is full flexibility in querying. There will be virtually no limits to the data you can extract from a table like this. You can join with other tables based on date ranges etc. This will not be possible (or at least much more complicated and cumbersome) when you split the data by date into tables. For example, it will not even be easy to just get the average of some value over a week, month or year.
If - and that's depending on the real amount of data you will collect - some time in the future the data grows dramatically, to more than several million rows I would estimate - you can have a look at the data partitioning features MySQL offers out of the box. However, I would not advise to use them immediately, unless you already have a clearly cut growth model for the data.
In my experience there is very seldom a real need for this technique in most cases. I have worked with tables in the 100s of gigabytes range, with tables having millions of rows. It is all a matter of good indexing and carefully crafted queries when the data gets huge.

...still not getting results trying to optimize mysql innodb table for fast count

i posted this question here a while ago. i tried out the suggestions and came to the conclusion that i must be doing something fundamentally wrong.
What i basically want to do is this:
i have a table containing 83Mio. time/price pairs. As index im using a millisecond accurate unix timestamp, the price ranges between 1.18775 and 1.60400 (decimal with precision 5).
i have a client that needs to get out the price densities for a given time interval, meaning i want to take a specified interval of time and count how many times all the different prices appear in this interval.
How would you guys do this? How would you design/index the table? Right now im building a temporary subtable containing only the data for the given interval and then do the counts on the prices. Is there a better way to do this? My general db settings are already tuned out and pretty performant. Thanks for any hints! I will provide any additional information needed as fast as i can!

Given that you have a large amount of data and its growing v rapidly I'd be inclined to add a second table of:
price (primary key)
time( some block - also part of PK )
count
Do an 'insert on duplicate key update count++' sort of thing. Group the time field by some predetermined interval (depends on the sorts of queries you get.. ms/sec/hour/whatever). This way you:
don't have to mess with temp tables - with a table of this size it will write to disk - slow even with SSD
don't have to touch the initial table every time you want to do your query - might run into locking issues
You will have to avg out your data a bit but the granularity can be predetermined to cause as few issues as possible.

Move inactive rows to another table?

I have a table where when a row is created, it will be active for 24 hours with some writes and lots of reads. Then it becomes inactive after 24 hours and will have no more writes and only some reads, if any.
Is it better to keep these rows in the table or move them when they become inactive (or via batch jobs) to a separate table? Thinking in terms of performance.

This depends largely on how big your table will get, but if it grows forever, and has a significant number of rows per day, then there is a good chance that moving old data to another table would be a good idea. There are a few different ways you could accomplish this, and which is best depends on your application and data access patterns.
Essentially as you said, when a row becomes "old", INSERT to the archive table, and DELETE from the current table.
Create a new table every day (or perhaps every week, or every month, depending on how big your dataset is), and never worry about moving old rows. You'll just have to query old tables when accessing old data, but for the current day, you only ever access the current table.
Have a "today" table and a "all time" table. Duplicate the "today" rows in both tables, keeping them in sync with triggers or other mechanisms. When a row becomes old, simply delete from the "today" table, leaving the "all time" row in tact.
One advantage to #2, that may not be immediately obvious, is that I believe MySQL indexes can be optimized for read-only tables. So by having old tables that are never written to, you can take advantage of this extra optimization.

Generally moving rows between tables in proper RDBMS should not be necessary.
I'm not familiar with mysql specifics, but you should do fine with the following:
Make sure your timestamp column is indexed
In addition, you can use active BOOLEAN default true column
Make a batch run every day to mark >24h old rows inactive
Use a partial index for timestamp column so only rows marked active are indexed
Remember to have timestamp and active = TRUE in your where conditions to hit indexes. Use EXPLAIN a lot.

That all depends on the balance between ease of programming, and performance. Performance wise, yes it will definitely be faster. But whether the speed increase is worth the effort is hard to say.
I've worked on systems that run perfectly fine with millions of rows. However, if the data is ever growing it does eventually become a problem.
I've worked on a database storing transaction logging for automated equipment. It generates hundreds of thousands of events per day. After a year, the queries just wouldn't run at acceptable speeds any more. We now keep the last month's worth of logs in the main table (millions of rows still), and move older data to archive tables.
None of the application's functionality ever looks in the archive table (if you do a query of the transaction log, it will return no results). It is only really kept for emergency use, and is just queried with any standalone database query tool. Because the archive has well over a hundred million rows, and the nature of this emergency use is generally unplannable (and therefore mostly un-indexed) queries, they can take a long time to run.

There is another solution. To have another table containing only the active records (tblactiverecords). When the number of active records is really small, you could just do an inner join and get the active records. This should take very less time because primary key by default are indexed in mysql. As your rows become inactive, you could delete them from the tblactiverecords table.
create table tblrecords (id int primary key, data text);
Then,
create table tblactiverecords (tblrecords_id primary key);
you can do
select data from tblrecords join tblactiverecords on tblrecords.id = tblactiverecords.tblrecords_id;
to get all data that are active.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008