MySQL best practice for archiving data - mysql

I have a 120Go database with 1 specific very heavy table of 80Go (storing data since +10 years).
I think to move old data in archive, but wonder if it is best :
to move them in a new table in the same database
to move them in a new table of a new archive database
What would be the result on the performence point of view ?
1/ If I reduce the table to only 8Go and move 72Go in another table from the same database, is the database going to run faster (we won't access the archive table with read/write operations and r/W will be done on a lighter table).
2/ Keeping 72Go of data into the archive table will anyway slow down the database engine ?
3/ Having the 72Go of data into another archive database will have any benefit versus keeping the 72Go into the archive table of the master database ?
Thanks for your answers,
Edouard

The size of a table may or may not impact the performance of queries against that table. It depends on the query, innodb_buffer_pool_size and RAM size. Let's see some typical queries.
The existence of a big table that is not being used has no impact on queries against other tables.
It may or may not be wise to PARTITION BY RANGE TO_DAYS(...) and have monthly or yearly partitions. The main advantage is where you get around to purging old data, but you don't seem to need that.
If you do split into 72 + 8, I recommend copying the 8 from the 80 into a new table, then use RENAME TABLEs to juggle the table names.
Two TABLEs in one DATABASE is essentially the same as having the TABLEs in different DATABASEs.
I'll update this Answer when you have provided more details.

Related

AWS Aurora MYSQL how to deal with growing table

I have a MySQL table that is growing quite fast and I was wondering what would be the best approach regarding ARCHIVING not needed data moving forward.
The table has data that is 2 years old, but we only need the data for last year onwards.
At the moment, the table has about 4 million rows and is 2.2GB in size.
DB specs:
Engine version
5.7.mysql_aurora.2.07.2
Instance class
db.r4.xlarge
vCPU
4
RAM
30.5 GB
Would anyone have any input in that regard?
Thank you
If the table were already partitioned by, say, month, archiving would be relatively efficient.
In the absence of that prep work, I recommend:
PARTITION BY RANGE(..)
Create a new table that is partitioned; cf Partition
Copy the data since a year ago into that table.
Drop the current table
Work on creating a regular monthly process involving "transportable tablespaces". Or, if you don't need to keep the old data, then plan on just DROP PARTITION (and add a new partition). (See link above.)
Big DELETE
If, instead, you choose to do something that involves DELETEing millions of rows, I strongly suggest chunking the operation: http://mysql.rjweb.org/doc.php/deletebig
The above does not say where you will send the data you have removed from this main table. What is your plan for that?

How to increase the performance of database schema creation?

For our testing environment, I need to setup und tear down a database multiple times (each test should run independently of any other).
The process is the following:
Create database schema and insert necessary data
Run test 1
Remove all tables in database
Create database schema and insert necessary data
Run test 2
Remove all tables in database
...
The schema and data are the same for each test in the test case.
Basically, this works. The big problem is, that the creation and clearing of the database takes a lot of time. Is there a possibility to improve the performance of mysql for the creation of tables and the insertion of data? Or can you think of a different process for the tests?
Thank for you your help!
Optimize the logical design
The logical level is about the structure of the query and tables themselves. Try to maximize this first. The goal is to access as few data as possible at the logical level.
Have the most efficient SQL queries
Design a logical schema that support the application's need (e.g. type of the columns, etc.)
Design trade-off to support some use case better than other
Relational constraints
Normalization
Optimize the physical design
The physical level deals with non-logical consideration, such as type of indexes, parameters of the tables, etc. Goal is to optimize the IO which is always the bottleneck. Tune each table to fit it's need. Small table can be loaded permanently loaded in the DBMS cache, table with low write rate can have different settings than table with high update rate to take less disk spaces, etc. Depending on the queries, different index can be used, etc. You can denormalized data transparently with materialized views, etc.
Tables paremeters (allocation size, etc.)
Indexes (combined, types, etc.)
System-wide parameters (cache size, etc.)
Partitioning
Denormalization
Try first to improve the logical design, then the physical design. (The boundary between both is however vague, so we can argue about my categorization).
Optimize the maintenance
Database must be operated correctly to stay as efficient as possible. This include a few mainteanance taks that can have impact on the perofrmance, e.g.
Keep statistics up to date
Re-sequence critical tables periodically
Disk maintenance
All the system stuff to have a server that rocks
source from:How to increase the performance of a Database?
I suggest you can write all your need operations into an script using shell、perl or python(init_db).
The first use, you can create、 insert and delete manually,then dump both the schema and data .
You can choose bulk insert and drop table for deleting data to improve the total performance.
Hope this can help you.
Instead of DROP TABLE + CREATE TABLE, just do TRUNCATE TABLE. This may, or may not, be faster; give it a try.
If you are INSERTing multiple rows each time, then either batch them (all rows in one INSERT), or use LOAD DATA. Either of these is much faster than row-by-row INSERTs.
Also fast... If you have the initial data in another table (which you could keep permanently), then do
CREATE TABLE test SELECT * FROM perm_table;
... (run tests using `test`)
DROP TABLE test;

Only Mysql OR mysql+sqlite OR mysql+own solution

Currently I am building quite big web system and I need strong SQL database solution. I chose Mysql over Postgres because some of tasks needs to be read-only (MyISAM engine) and other are massive-writes (InnoDB).
I have a question about this read-only feature. It has to be extremely fast. User must get answer a lot less than one second.
Let say we have one well-indexed table named "object" with not more than 10 millions of rows and another one named "element" with around 150 millions of rows.
We also have table named "element_object" containing information connecting objects from table "element" with table "object" (hundreds of millions of rows)
So we're going to do partitioning on tables "element" and "element_object" and have 8192 tables "element_hash_n{0..8191}a" and 24576 of tables "element_object_hash_n{0..8191}_m{0..2}".
An Answer on user's question would be a 2-step searching:
Find id of element from tables "element_hash_n"
Do main sql select on table "object" and join with table "element_object..hash_n_m" to filter result with found (from first step) ID
I wonder about first step:
What would be better:
store (all) over 32k tables in mysql
create one sqlite database and store there 8192 tables for first step purpose
create 8192 different sqlite files (databases)
create 8192 files in file system and make own binary solution to find ID.
I'm sorry for my English. Its not my native language.
I think you make way to many partitions. If you have more than 32000 partitions you have a tremendous overhead of management. Given the name element_hash_* it seams as if you want to make a hash of your element and partition it this way. But a hash will give you a (most likely) even distribution of the data over all partitions. I can't see how this should improve performance. If your data is accessed over all those partitions you don't gain anything by having partitions in size of your memory - you will need to load for every query data from another partition.
We used partitions on a transaction systems where more than 90% of the queries used the current day as criteria. In such a case the partition based on days worked very well. But we only had 8 partitions and moved the data then off to another database for long time storage.
My advice: Try to find out what data will be needed that fast and try to group it together. And you will need to make your own performance tests. If it is so important to deliver data that fast there should be enough management support to build a decent test environment.
Maybe your test result will show that you simply can't deliver the data fast enough with a relational database system. If so you should look at NoSQL (as in Not only SQL) solutions.
In what technology do you build your web system? You should test this part as well. A super fast database will not help you much if you lose the time in a poorly performing web application.

Speed-up of readonly MyISAM table

We have a large MyISAM table that is used to archive old data. This archiving is performed every month, and except from these occasions data is never written to the table. Is there anyway to "tell" MySQL that this table is read-only, so that MySQL might optimize the performance of reads from this table? I've looked at the MEMORY storage engine, but the problem is that this table is so large that it would take a large portion of the servers memory, which I don't want.
Hope my question is clear enough, I'm a novice when it comes to db administration so any input or suggestions are welcome.
Instead of un-and re-compressing the history table: If you want to access a single table for the history, you can use a merge table to combine the compressed read-only history tables.
Thus assuming you have an active table and the compressed history tables with the same table structure, you could use the following scheme:
The tables:
compressed_month_1
compressed_month_2
active_month
Create a merge table:
create table history_merge like active_month;
alter table history_merge
ENGINE=MRG_MyISAM
union (compressed_month_1,compressed_month_2);
After a month, compress the active_month table and rename it to compressed_month_3. Now the tables are:
compressed_month_1
compressed_month_2
compressed_month_3
active_month
and you can update the history table
alter table history_merge
union (compressed_month_1, compressed_month_2, compressed_month_3);
Yes, you can compress the myisam tables.
Here is the doc from 5.0 : http://dev.mysql.com/doc/refman/5.0/en/myisampack.html
You could use myisampack to generate fast, compressed, read-only tables.
(Not really sure if that hurts performance if you have to return most of the rows; testing is advisable; there could be a trade-off between compression and disk reads).
I'd say: also certainly apply the usual:
Provide appropriate indexes (based on the most used queries)
Have a look at clustering the data (again if this is useful given the queries)

Database design for heavy timed data logging

I have an application where I receive each data 40.000 rows. I have 5 million rows to handle (500 Mb MySQL 5.0 database).
Actually, those rows are stored in the same table => slow to update, hard to backup, etc.
Which kind of scheme is used in such application to allow long term accessibility to the data without problems with too big tables, easy backup, fast read/write ?
Is postgresql better than mysql for such purpose ?
1 - 40000 rows / day is not that big
2 - Partition your data against the insert date : you can easily delete old data this way.
3 - Don't hesitate to go through a datamart step. (compute often asked metrics in intermediary tables)
FYI, I have used PostgreSQL with tables containing several GB of data without any problem (and without partitioning). INSERT/UPDATE time was constant
We're having log tables of 100-200million rows now, and it is quite painful.
backup is impossible, requires several days of down time.
purging old data is becoming too painful - it usually ties down the database for several hours
So far we've only seen these solutions:
backup , set up a MySQL slave. Backing up the slave doesn't impact the main db. (We havn't done this yet - as the logs we load and transform are from flat files - we back up these files and can regenerate the db in case of failures)
Purging old data, only painless way we've found is to introduce a new integer column that identifies the current date, and partition the tables(requires mysql 5.1) on that key, per day. Dropping old data is a matter of dropping a partition, which is fast.
If in addition you need to do continuously transactions on these tables(as opposed to just load data every now and then and mostly query that data), you probably need to look into InnoDB and not the default MyISAM tables.
The general answer is: you probably don't need all that detail around all the time.
For example, instead of keeping every sale in a giant Sales table, you create records in a DailySales table (one record per day), or even a group of tables (DailySalesByLocation = one record per location per day, DailySalesByProduct = one record per product per day, etc.)
First, huge data volumes are not always handled well in a relational database.
What some folks do is to put huge datasets in files. Plain old files. Fast to update, easy to back up.
The files are formatted so that the database bulk loader will work quickly.
Second, no one analyzes huge data volumes. They rarely summarize 5,000,000 rows. Usually, they want a subset.
So, you write simple file filters to cut out their subset, load that into a "data mart" and let them query that. You can build all the indexes they need. Views, everything.
This is one way to handle "Data Warehousing", which is that your problem sounds like.
First, make sure that your logging table is not over-indexed. By that i mean that every time you insert/update/delete from a table any indexes that you have also need to be updated which slows down the process. If you have a lot of indexes specified on your log table you should take a critical look at them and decide if they are indeed necessary. If not, drop them.
You should also consider an archiving procedure such that "old" log information is moved to a separate database at some arbitrary interval, say once a month or once a year. It all depends on how your logs are used.
This is the sort of thing that NoSQL DBs might be useful for, if you're not doing the sort of reporting that requires complicated joins.
CouchDB, MongoDB, and Riak are document-oriented databases; they don't have the heavyweight reporting features of SQL, but if you're storing a large log they might be the ticket, as they're simpler and can scale more readily than SQL DBs.
They're a little easier to get started with than Cassandra or HBase (different type of NoSQL), which you might also look into.
From this SO post:
http://carsonified.com/blog/dev/should-you-go-beyond-relational-databases/