Move inactive rows to another table? - mysql

I have a table where when a row is created, it will be active for 24 hours with some writes and lots of reads. Then it becomes inactive after 24 hours and will have no more writes and only some reads, if any.
Is it better to keep these rows in the table or move them when they become inactive (or via batch jobs) to a separate table? Thinking in terms of performance.

This depends largely on how big your table will get, but if it grows forever, and has a significant number of rows per day, then there is a good chance that moving old data to another table would be a good idea. There are a few different ways you could accomplish this, and which is best depends on your application and data access patterns.
Essentially as you said, when a row becomes "old", INSERT to the archive table, and DELETE from the current table.
Create a new table every day (or perhaps every week, or every month, depending on how big your dataset is), and never worry about moving old rows. You'll just have to query old tables when accessing old data, but for the current day, you only ever access the current table.
Have a "today" table and a "all time" table. Duplicate the "today" rows in both tables, keeping them in sync with triggers or other mechanisms. When a row becomes old, simply delete from the "today" table, leaving the "all time" row in tact.
One advantage to #2, that may not be immediately obvious, is that I believe MySQL indexes can be optimized for read-only tables. So by having old tables that are never written to, you can take advantage of this extra optimization.

Generally moving rows between tables in proper RDBMS should not be necessary.
I'm not familiar with mysql specifics, but you should do fine with the following:
Make sure your timestamp column is indexed
In addition, you can use active BOOLEAN default true column
Make a batch run every day to mark >24h old rows inactive
Use a partial index for timestamp column so only rows marked active are indexed
Remember to have timestamp and active = TRUE in your where conditions to hit indexes. Use EXPLAIN a lot.

That all depends on the balance between ease of programming, and performance. Performance wise, yes it will definitely be faster. But whether the speed increase is worth the effort is hard to say.
I've worked on systems that run perfectly fine with millions of rows. However, if the data is ever growing it does eventually become a problem.
I've worked on a database storing transaction logging for automated equipment. It generates hundreds of thousands of events per day. After a year, the queries just wouldn't run at acceptable speeds any more. We now keep the last month's worth of logs in the main table (millions of rows still), and move older data to archive tables.
None of the application's functionality ever looks in the archive table (if you do a query of the transaction log, it will return no results). It is only really kept for emergency use, and is just queried with any standalone database query tool. Because the archive has well over a hundred million rows, and the nature of this emergency use is generally unplannable (and therefore mostly un-indexed) queries, they can take a long time to run.

There is another solution. To have another table containing only the active records (tblactiverecords). When the number of active records is really small, you could just do an inner join and get the active records. This should take very less time because primary key by default are indexed in mysql. As your rows become inactive, you could delete them from the tblactiverecords table.
create table tblrecords (id int primary key, data text);
Then,
create table tblactiverecords (tblrecords_id primary key);
you can do
select data from tblrecords join tblactiverecords on tblrecords.id = tblactiverecords.tblrecords_id;
to get all data that are active.

Related

Duplicating Data in Another Table for Performance Gain

I'm currently designing the database architecture for a product that I'm in the process of building. I'm simply drawing out everything in an Excel file before I begin creating everything in MySQL.
Currently, I have two different tables that are almost identical to one another.
TABLE A that contains the most recent values of each data point for each user.
TABLE B that contains daily records of each data point for each user.
My reasoning for creating TABLE A, instead or relying solely on TABLE B, is that the number of rows in TABLE B will grow everyday by the number of customers I have. For instance, say I have 20,000 customers, TABLE B will grow by 20,000 rows every single day. So by creating TABLE A, I'll only ever have to search through 20,000 records to find the most recent values of each data point for each user since I'll be updating these values everyday; whereas for TABLE B, I'd have to search through an ever-growing number of rows for the most recent insertion for each user.
Is this acceptable or good practice?
Or should I just forget about TABLE A to reduce "bloat" in my database?
This is not the right approach. You basically have two reasonable options:
Use indexes on the history table to access the most recent day's records.
Use table partitioning to store each day in a separate partition.
You can manage two tables, but that is a lot of trouble and there are built-in methods to handle this situation.
In situations where I need both "current" data and a "history", that is what I do -- One table with the current data and one with history. They are possibly indexed differently for the different usage, etc.
I would think through what is different between "history" and "current", then make the tables different not identical.
When a new record comes in (or 20K rows in your case), I will at least put it into Current. I may also write it to History, thereby keeping it complete (at the cost of a small redundancy). Or I may move the row(s) to History when the next row(s) come into Current.
I see no need for PARTITIONing unless I intend to purge 'old' data. In that case, I would use PARTITION BY RANGE(TO_DAYS(..)) and choose weekly/monthly/whatever such that the number of partitions does not exceed about 50. (If you pick 'daily', History will slow down after a few months, just because of the partitioning.)
The 20K rows each day -- Are many of them unchanged since yesterday? That is probably not the proper way to do things. Please elaborate on what happens each day. You should avoid having duplicate rows in History (except for the date).

Mysql what if too much data in a table

Data is increasing in one table everyday, it might lower the performance . I was thinking if I can create a trigger which move table A into A1 and create a new table A every a period of time, so that insert or update could be faster in table A. Is this the right way to save performance ? If not, what should I do ?
(for example, insert or update 1000 rows per second in table A, how is the performance after 3 years ?)
We are designing softwares for a factory. There are product lines which pcb boards are made on. We need to insert almost 60 pcb records per second for years. (1000 rows seem to be exaggerated)
First, you are talking about several terabytes for a single table. Is your disk that big? Yes, MySQL can handle that big a table.
Will it slow down? It depends on
The indexes. If you have 'random' indexes, the INSERTs will slow down to about 1 insert per disk hit. On a spinning HDD, that is only about 100 per second. SSD might be able to handle 1000/sec. Please provide SHOW CREATE TABLE.
Does the table have an AUTO_INCREMENT? If so, it needs to be BIGINT, not INT. But, if possible, get rid of it all together (to save space). Again, let's see the SHOW.
"Point" queries (load one row via an index) are mostly unaffected by the size of the table. They will be about twice as slow in a trillion-row table as in a million-row table. A point query will take milliseconds or tens of milliseconds; no big deal.
A table scan will take hours or days; hopefully you are not doing that.
A billion-row scan of part of the table will take days or weeks unless you are using the PRIMARY KEY or have a "covering" index. Let's see the queries and the SHOW.
The best technique is not to store the data. Summarize it as it arrives, save the summaries, then toss the raw data. (OK, you might store the raw in a csv file just in case you need to build a new summary table or fix a bug in an existing one.)
Having a few summary tables instead of the raw data would shrink the data to under 1TB and allow the relevant queries to run 10 times as fast. (OK, point queries would be only slightly faster.)
PARTITIONing (or otherwise splitting up the table)? It depends. Let's see the queries and the SHOW. In many situations, PARTITIONing does not speed up anything.
Will you be deleting or modifying existing rows? I hope not. That adds more dimensions of problems. If, on the other hand, you need to purge 'old' data, then that is an excellent use for PARTITIONing. For 3 years' worth of data, I would PARTITION BY RANGE(TO_DAYS(..)) and have monthly partitions. Then a monthly DROP PARTITION would be very fast.
Very Huge data may decrease the performance of server, So there is a way to handle this :
1) you have to create another table to store archive data ( old data ) using Archive storage mechanism . ( https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html )
2) create MySQL job/scheduler to move older records to archive table. schedule in timeslot
when server is maximum idle.
3) after moving older records to archive table, re-index the original table.
this will serve the purpose of performance.
It is unlikely that 1000 row tables perform sufficiently poorly that doing a table copy every once in a while is an overall net gain. And anyway, what would the new table have that the old one did not which would improve performance?
The key to having tables perform efficiently is intelligent table design and management of indexes. That is how zillion row tables are effective in geospatial work, library catalogs, astronomy, and how internet search engines find useful data, etc.
Each index defined does cause more mysql impact especially at row insert time. Assuming there are more reads than inserts, this is an advantage because most queries are rapidly completed thanks to a suitable index.
Indexes are best defined with a thorough understanding of the queries made against the table—both in quality and quantity. And, if there is any tendency for the nature of the queries to trend over months or years, then the indexes would need additions, modifications, or—yes—even deletions.
It seems to me there is something inherently wrong with the way you are using MySQL to begin with.
A database system is supposed to manage data that is required by your application in order for it to work. If you think flushing the table every so often is something acceptable, then that doesn't seem to be the case.
Perhaps you are better off just using log files. Split them by date, delete old ones if and when you decide they are no longer relevant or need the disk space. It's even safer to do that way from a recovery perspective.
If you need a better suggestion, then improve your question to include exactly what you are trying to accomplish so we can help you with it.

MySQL: Partition-like function for a single set of data?

I have a table that has millions of records, and they utilize EFF_FROM and EFF_TO date fields to version the records.
99% of the time, when this table is queried by an application, it is only concerned with records that have an EFF_TO of 2099-12-31, or records that are active and not historical.
I copied just the active records to a test version of the table and the application's SELECT query went from 60 seconds to 3 seconds.
I don't necessarily want to partition every EFF_TO date. I don't want to add that overhead especially to processes that populate the table. I only want the optimization for querying records with 2099-12-31, and I want the performance to be instant.
Is there a straightforward way to do this? Or do I have to resort to creating an active table and a historical table?
Partition like function for a single set of data?
This is something of any oxymoron, however you are asking about partitioning into two sets of data, one where EFF_TO is in the future and one where it is in the past.
have an EFF_TO of 2099-12-31
Design fault - these should be null.
If they were null the the partitioning would be simple. As it stands you will have to drop and recreate the partitions - which is rather an expensive operation (have a look at tools for doing online schema updates).
You could minimize the impact by creating multiple partitions defining the period around NOW then adding an extra one onto the end of and removing one from the beginning at regular intervals.
application's SELECT query went from 60 seconds to 3 seconds.
There are lots of other reasons why the performance improved than just the size of the table
if it's doing a full table table scan, this is a design fault in the application.
You're indexes may not be as up to date as they should be
the logical structure of the indexes may be unbalanced and need optimized
the physical structure of the table and indexes many be fragmented and need optimized

MySQL - why not index every field?

Recently I've learned the wonder of indexes, and performance has improved dramatically. However, with all I've learned, I can't seem to find the answer to this question.
Indexes are great, but why couldn't someone just index all fields to make the table incredibly fast? I'm sure there's a good reason to not do this, but how about three fields in a thirty-field table? 10 in a 30 field? Where should one draw the line, and why?
Indexes take up space in memory (RAM); Too many or too large of indexes and the DB is going to have to be swapping them to and from the disk. They also increase insert and delete time (each index must be updated for every piece of data inserted/deleted/updated).
You don't have infinite memory. Making it so all indexes fit in RAM = good.
You don't have infinite time. Indexing only the columns you need indexed minimizes the insert/delete/update performance hit.
Keep in mind that every index must be updated any time a row is updated, inserted, or deleted. So the more indexes you have, the slower performance you'll have for write operations.
Also, every index takes up further disk space and memory space (when called), so it could potentially slow read operations as well (for large tables).
Check this out
You have to balance CRUD needs. Writing to tables becomes slow. As for where to draw the line, that depends on how the data is being acessed (sorting filtering, etc.).
Indexing will take up more allocated space both from drive and ram, but also improving the performance a lot. Unfortunately when it reaches memory limit, the system will surrender the drive space and risk the performance. Practically, you shouldn't index any field that you might think doesn't involve in any kind of data traversing algorithm, neither inserting nor searching (WHERE clause). But you should if otherwise. By default you have to index all fields. The fields which you should consider unindexing is if the queries are used only by moderator, unless if they need for speed too
It is not a good idea to indexes all the columns in a table. While this will make the table very fast to read from, it also becomes much slower to write to. Writing to a table that has every column indexed would involve putting the new record in that table and then putting each column's information in the its own index table.
this answer is my personal opinion based I m using my mathematical logic to answer
the second question was about the border where to stop, First let do some mathematical calculation, suppose we have N rows with L fields in a table if we index all the fields we will get a L new index tables where every table will sort in a meaningfull way the data of the index field, in first glance if your table is a W weight it will become W*2 (1 tera will become 2 tera) if you have 100 big table (I already worked in project where the table number was arround 1800 table ) you will waste 100 times this space (100 tera), this is way far from wise.
If we will apply indexes in all tables we will have to think about index updates were one update trigger all indexes update this is a select all unordered equivalent in time
from this I conclude that you have in this scenario that if you will loose this time is preferable to lose it in a select nor an update because if you will select a field that is not indexed you will not trigger another select on all fields that are not indexed
what to index ?
foreign-keys : is a must based on
primary-key : I m not yet sure about it may be if someone read this could help on this case
other fields : the first natural answer is the half of the remaining filds why : if you should index more you r not far from the best answer if you should index less you are not also far because we know that no index is bad and all indexed is also bad.
from this 3 points I can conclude that if we have L fields composed of K keys the limit should be somewhere near ((L-K)/2)+K more or less by L/10
this answer is based on my logic and personal prictices
First of all, at least in SAP - ABAP and in background database table, we can create one index table for all required index fields, we will have their addresses only. So other SQL related software-database system can also use one table for all fields to be indexed.
Secondly, what is the writing performance? A company in one day records 50 sales orders for example. And let assume there is a table VBAK sales order header table with 30 fields for example each has 20 CHAR length..
I can write to real table in seconds, but other index table can work in the background, and at the same time a report is tried to be run, for this report while index table is searched, ther can be a logic- for database programming- a index writing process is contiuning and wait it for ending ( 5 sales orders at the same time were being recorded for example and take maybe 5 seconds) ..so , a running report can wait 5 seconds then runs 5 seconds total 10 seconds..
without index, a running report does not wait 5 seconds for writing performance..but runs maybe 40 seconds...
So, what is the meaning of writing performance no one writes thousands of records at the same time. But reading them.
And reading a second table means that : there were all ready sorted fields.I have 3 fields selected and I can find in which sorted sets I need to search these data, then I bring them...what RAM, what memory it is just a copied index table with only one data for each field -address data..What memory?
I think, this is one of the software company secrets hide from customers, not to wake them up , otherwise they will not need another system in the future with an expensive price.

mySQL database efficienty question

I have a database efficiency question.
Here is some info about my table:
-table of about 500-1000 records
-records are added and deleted every day.
- usually have about the same amount being added and deleted every day (size of active records stays the same)
Now, my question is.....when I delete records,...should I (A) delete the record and move it to a new table?
Or,...should I (B) just have and "active" column and set the record to 0 when it is no long active.
The reason I am hesitant to use B is because my site is based on the user being able to filter/sort this table of 500-1000 records on the fly (using ajax)....so I need it to be as fast as possible,..(i'm guessing a table with more records would be slower to filter)...and I am using mySQL InnoDB.
Any input would be great, Thanks
Andrew
~1000 records is a very small number.
If a record can be deleted and re-added later, maybe it makes sense to have an "active" indicator.
Realistically, this isn't a question about DB efficiency but about network latency and the amount of data you're sending over the wire. As far as MySQL goes, 1000 rows or 100k rows are going to be lightning-fast, so that's not a problem.
However, if you've got a substantial amount of data in those rows, and you're transmitting it all to the client through AJAX for filtering, the network latency is your bottleneck. If you're transmitting a handful of bytes (say 20) per row and your table stays around 1000 records in length, not a huge problem.
On the other hand, if your table grows (with inactive records) to, say, 20k rows, now you're transmitting 400k instead of 20k. Your users will notice. If the records are larger, the problem will be more severe as the table grows.
You should really do the filtering on the server side. Let MySQL spend 2ms filtering your table before you spend a full second or two sending it through Ajax.
It depends on what you are filtering/sorting on and how the table is indexed.
A third, and not uncommon, option, you could have a hybrid approach where you inactivate records (B) (optionally with a timestamp) and periodically archive them to a separate table (A) (either en masse or based on the timestamp age).
Realistically, if your table is in the order 1000 rows, it's probably not worth fussing too much over it (assuming the scalability of other factors is known).
If you need to keep the records for some future purpose, I would set an Inactive bit.
As long as you have a primary key on the table, performance should be excellent when SELECTing the records.
Also, if you do the filtering/sorting on the client-side then the records would only have to be retrieved once.