AWS Aurora MYSQL how to deal with growing table - mysql

I have a MySQL table that is growing quite fast and I was wondering what would be the best approach regarding ARCHIVING not needed data moving forward.
The table has data that is 2 years old, but we only need the data for last year onwards.
At the moment, the table has about 4 million rows and is 2.2GB in size.
DB specs:
Engine version
5.7.mysql_aurora.2.07.2
Instance class
db.r4.xlarge
vCPU
4
RAM
30.5 GB
Would anyone have any input in that regard?
Thank you

If the table were already partitioned by, say, month, archiving would be relatively efficient.
In the absence of that prep work, I recommend:
PARTITION BY RANGE(..)
Create a new table that is partitioned; cf Partition
Copy the data since a year ago into that table.
Drop the current table
Work on creating a regular monthly process involving "transportable tablespaces". Or, if you don't need to keep the old data, then plan on just DROP PARTITION (and add a new partition). (See link above.)
Big DELETE
If, instead, you choose to do something that involves DELETEing millions of rows, I strongly suggest chunking the operation: http://mysql.rjweb.org/doc.php/deletebig
The above does not say where you will send the data you have removed from this main table. What is your plan for that?

Related

MySQL best practice for archiving data

I have a 120Go database with 1 specific very heavy table of 80Go (storing data since +10 years).
I think to move old data in archive, but wonder if it is best :
to move them in a new table in the same database
to move them in a new table of a new archive database
What would be the result on the performence point of view ?
1/ If I reduce the table to only 8Go and move 72Go in another table from the same database, is the database going to run faster (we won't access the archive table with read/write operations and r/W will be done on a lighter table).
2/ Keeping 72Go of data into the archive table will anyway slow down the database engine ?
3/ Having the 72Go of data into another archive database will have any benefit versus keeping the 72Go into the archive table of the master database ?
Thanks for your answers,
Edouard
The size of a table may or may not impact the performance of queries against that table. It depends on the query, innodb_buffer_pool_size and RAM size. Let's see some typical queries.
The existence of a big table that is not being used has no impact on queries against other tables.
It may or may not be wise to PARTITION BY RANGE TO_DAYS(...) and have monthly or yearly partitions. The main advantage is where you get around to purging old data, but you don't seem to need that.
If you do split into 72 + 8, I recommend copying the 8 from the 80 into a new table, then use RENAME TABLEs to juggle the table names.
Two TABLEs in one DATABASE is essentially the same as having the TABLEs in different DATABASEs.
I'll update this Answer when you have provided more details.

How to reduce index size of a table in mysql using innodb engine?

I am facing a performance issue in mysql due to large index size on my table. Index size has grown to 6GB and my instance is running on 32GB memory. Majority of rows is not required in that table after a few hours and can be removed selectively. But removing them is a time consuming solution and doesn't reduce index size.
Please suggest some solution to manage this index.
You can optimize your table to rebuild index and get back space if not getting even after deletion-
optimize table table_name;
But as your table is bulky so it will lock during optimze table and also you are facing issue how can remove old data even you don't need few hours old data. So you can do as per below-
Step1: during night hours or when there is less traffic on your db, first rename your main table and create a new table with same name. Now insert few hours data from old table to new table.
By this you can remove unwanted data and also new table will be optimzed.
Step2: In future to avoid this issue, you can create a stored procedure. Which will will execute in night hours only 1 time per day and either delete till previous day (as per your requirement) data from this table or will move data to any historical table.
Step3: As now your table always keep only sigle day data then you can execute optimize table statement to rebuild and claim space back on this table easily.
Note: delete statement will not rebuild index and will not free space on server. For this you need to do optimize your table. It can be by various ways like by alter statement or by optimize statement etc.
If you can remove all the rows older than X hours, then PARTITIONing is the way to go. PARTITION BY RANGE on the hour and use DROP PARTITION to remove an old hour and REORGANIZE PARTITION to create a new hour. You should have X+2 partitions. More details.
If the deletes are more complex, please provide more details; perhaps we can come up with another solution that deals with the question about index size. Please include SHOW CREATE TABLE.
Even if you cannot use partitions for purging, it may be useful to have partitions for OPTIMIZE. Do not use OPTIMIZE PARTITION; it optimizes the entire table. Instead, use REORGANIZE PARTITION if you see you need to shrink the index.
How big is the table?
How big is innodb_buffer_pool_size?
(6GB index does not seem that bad, especially since you have 32GB of RAM.)

Modify database files

I have a system that a client designed and the table was originally not supposed to get larger than 10 gigs (maybe 10 million rows) over a few years. Well, they've imported a lot more information than they were thinking and within a month, the table is now up to 208 gigs (900 million rows).
I have very little experience with MySQL and a lot more experience with Microsoft SQL. Is there anything in MySQL that would allow the client to have the database span multiple files so the queries that are run wouldn't have to use the entire table and index? There is a field on the table that could easily be split on, but I wasn't sure how to do this.
The main issue I'm trying to solve is a retrieval query from this table. Inserts aren't a big deal at all since it's all done by a back-end service. I have a test system where the table is about 2 gigs (6 million rows) and my query takes less than a second. When this same query is run on the production system, it takes 20 seconds. I have feeling that the query is doing well, it's just the size of the table that's causing the issue. There is an index on this table created specifically for this query, and using an EXPLAIN, it is using it.
If you have any other suggestions/questions, please feel free to ask.
Use partitioning and especially the part of create table that sets the data_directory and index_directory.
With these options you can put partitions on separate drives if needed. Usially though, it's enough to partition with a key that you can use on each query, usually time.
In addition to partitioning which has been mentioned you might also want to run the tuning-primer script to ensure your mysql configuration is optimal.

Table data handling - optimum usage

I have a table with 8 millions records in mysql.
I want to keep last one week data and delete the rest, i can take a dump and recreate the table in another schema.
I am struggling to get the queries right, please share your views and best approaches to do this.Best way to delete so that it will not affect other tables in the production.
Thanks.
MySQL offers you a feature called partitioning. You can do a horizontal partition and split your tables by rows. 8 Million isn't that much, how is the insertion rate per week?
CREATE TABLE MyVeryLargeTable (
id SERIAL PRIMARY KEY,
my_date DATE
-- your other columns
) PARTITION BY HASH (YEARWEEK(my_date)) PARTITIONS 4;
You can read more about it here: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Edit: This one creates 4 partitions, so this will last for 4 weeks - therefore I suggest changing to partitions based on months / year. Partition limit is quite high but this is really a question how the insertion rate per week/month/year looks like.
Edit 2
MySQL5.0 comes with an Archive Engine, you should use this for your Archive table ( http://dev.mysql.com/tech-resources/articles/storage-engine.html ). Now how to get your data into the archive table? It seems like you have to write a cron-job that runs on the beginning of every week, moving all records to the archive table and deleting them from the original one. You could write a stored procedure for this but the cron-job needs to run on the shell. Keep in mind this could affect your data integrity in some way. What about upgrading to MySQL 5.1?

Database design for heavy timed data logging

I have an application where I receive each data 40.000 rows. I have 5 million rows to handle (500 Mb MySQL 5.0 database).
Actually, those rows are stored in the same table => slow to update, hard to backup, etc.
Which kind of scheme is used in such application to allow long term accessibility to the data without problems with too big tables, easy backup, fast read/write ?
Is postgresql better than mysql for such purpose ?
1 - 40000 rows / day is not that big
2 - Partition your data against the insert date : you can easily delete old data this way.
3 - Don't hesitate to go through a datamart step. (compute often asked metrics in intermediary tables)
FYI, I have used PostgreSQL with tables containing several GB of data without any problem (and without partitioning). INSERT/UPDATE time was constant
We're having log tables of 100-200million rows now, and it is quite painful.
backup is impossible, requires several days of down time.
purging old data is becoming too painful - it usually ties down the database for several hours
So far we've only seen these solutions:
backup , set up a MySQL slave. Backing up the slave doesn't impact the main db. (We havn't done this yet - as the logs we load and transform are from flat files - we back up these files and can regenerate the db in case of failures)
Purging old data, only painless way we've found is to introduce a new integer column that identifies the current date, and partition the tables(requires mysql 5.1) on that key, per day. Dropping old data is a matter of dropping a partition, which is fast.
If in addition you need to do continuously transactions on these tables(as opposed to just load data every now and then and mostly query that data), you probably need to look into InnoDB and not the default MyISAM tables.
The general answer is: you probably don't need all that detail around all the time.
For example, instead of keeping every sale in a giant Sales table, you create records in a DailySales table (one record per day), or even a group of tables (DailySalesByLocation = one record per location per day, DailySalesByProduct = one record per product per day, etc.)
First, huge data volumes are not always handled well in a relational database.
What some folks do is to put huge datasets in files. Plain old files. Fast to update, easy to back up.
The files are formatted so that the database bulk loader will work quickly.
Second, no one analyzes huge data volumes. They rarely summarize 5,000,000 rows. Usually, they want a subset.
So, you write simple file filters to cut out their subset, load that into a "data mart" and let them query that. You can build all the indexes they need. Views, everything.
This is one way to handle "Data Warehousing", which is that your problem sounds like.
First, make sure that your logging table is not over-indexed. By that i mean that every time you insert/update/delete from a table any indexes that you have also need to be updated which slows down the process. If you have a lot of indexes specified on your log table you should take a critical look at them and decide if they are indeed necessary. If not, drop them.
You should also consider an archiving procedure such that "old" log information is moved to a separate database at some arbitrary interval, say once a month or once a year. It all depends on how your logs are used.
This is the sort of thing that NoSQL DBs might be useful for, if you're not doing the sort of reporting that requires complicated joins.
CouchDB, MongoDB, and Riak are document-oriented databases; they don't have the heavyweight reporting features of SQL, but if you're storing a large log they might be the ticket, as they're simpler and can scale more readily than SQL DBs.
They're a little easier to get started with than Cassandra or HBase (different type of NoSQL), which you might also look into.
From this SO post:
http://carsonified.com/blog/dev/should-you-go-beyond-relational-databases/