I duplicated a large table like this:
CREATE TABLE newtable LIKE oldtable;
INSERT newtable SELECT * FROM oldtable;
I checked rows afterwards like this:
select count(*) from oldtable;
select count(*) from newtable;
to make sure they had the same # of rows
However in Navicat (whose counts I don't trust) the Data Length seemed way off so I ran a query on size like this:
SELECT
table_name AS `Table`,
round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`
FROM information_schema.TABLES
WHERE table_schema = "$MY_DB"
AND table_name = "#MY_TABLE"
And the duped table is indeed almost 1/2 the size. I know for now that is some sparse information but I am wondering now if either:
There was some error in the duplication (spot checking indicates no)
Duplicating in that way somehow optimize(d)(s) the data/indexes
On an InnoDB table, over time the index will accumulate something similar to hard disk fragmentation, where INSERT and DELETE operations leave space in the index file that is not made available for reuse by the operating system.
By duplicating the table via INSERT INTO...SELECT *..., the new table is essentially getting a cleaner index with less of the fragmentation the old one had. You can reclaim that space on the old table by running an OPTIMIZE TABLE statement.
Use OPTIMIZE TABLE in these cases, depending on the type of table:
After doing substantial insert, update, or delete operations on an InnoDB table that has its own .ibd file because it was created with the innodb_file_per_table option enabled. The table and indexes are reorganized, and disk space can be reclaimed for use by the operating system.
So in your environment:
OPTIMIZE TABLE oldtable
Afterward, check the index_length in information_schema.TABLES again and its size will likely be closer to that of the duplicated table.
Note: On a large table this may take a long time, so you should not attempt it when locking will be an issue. For InnoDB, MySQL may actually drop and recreate the table rather than optimize the existing index in place.
Addendum after some tests:
Even when duplicating a large table in my testing, though the new table's index_length was considerably smaller than the old one, it still had space that could be reclaimed with OPTIMIZE TABLE. After optimizing the newly duplicated table, (with no intervening inserts/deletions) both old and new had identical sizes.
Related
I have a database table which is around 700GB with 1 Billion rows, the data is approximately 500 GB and index is 200GB,
I am trying to delete all the data before 2021,
Roughly around 298,970,576 rows in 2021 and there are 708,337,583 rows remaining.
To delete this I am running a non-stop query in my python shell
DELETE FROM table_name WHERE id < 1762163840 LIMIT 1000000;
id -> 1762163840 represent data from 2021. Deleting 1Mil row taking almost 1200-1800sec.
Is there any way I can speed up this because the current way is running for more than 15 days and there is not much data delete so far and it's going to do more days.
I thought that if I make a table with just ids of all the records that I want to delete and then do an exact map like
DELETE FROM table_name WHERE id IN (SELECT id FROM _tmp_table_name);
Will that be fast? Is it going to be faster than first making a new table with all the records and then deleting it?
The database is setup on RDS and instance class is db.r3.large 2 vCPU and 15.25 GB RAM, only 4-5 connections running.
I would suggest recreating the data you want to keep -- if you have enough space:
create table keep_data as
select *
from table_name
where id >= 1762163840;
Then you can truncate the table and re-insert new data:
truncate table table_name;
insert into table_name
select *
from keep_data;
This will recreate the index.
The downside is that this will still take a while to re-insert the data (renaming keep_data would be faster). But it should be much faster than deleting the rows.
AND . . . this will give you the opportunity to partition the table so future deletes can be handled much faster. You should look into table partitioning if you have such a large table.
Multiple techniques for big deletes: http://mysql.rjweb.org/doc.php/deletebig
It points out that LIMIT 1000000 is unnecessarily big and causes more locking than might be desirable.
In the long run, PARTITIONing would be beneficial, it mentions that.
If you do Gordon's technique (rebuilding table with what you need), you lose access to the table for a long time; I provide an alternative that has essentially zero downtime.
id IN (SELECT...) can be terribly slow -- both because of the inefficiency of in-SELECT and due to the fact that DELETE will hang on to a huge number of rows for transactional integrity.
I run into a strange problem using MySQL 5.5. I wanted to gather statistics about table size. So, I made up the following query:
SELECT table_name AS name, data_length, index_length, table_rows, avg_row_length
FROM information_schema.TABLES
WHERE table_schema = "<MySchema>"
AND table_name in (<Table names I'm interested in>)
order by table_name;
However, I noticed something strange, when I ran this query several times in a few seconds. The data_length and index_length remain actually the same in all queries (or change a little bit, since there were some writes during my script execution made by customers).
However, table_rows gives every time a pretty different answer. For example, the first query gives for a table A about a 10000 rows, the second query says it's about 20000 rows, and so on. However, when I run the query like this:
select count(*) from TableA;
It gives me the same result over and over again. However, not the information schema, for some reason. What could be wrong with the database? Or maybe I just misunderstand the meaning of table_rows in information_schema?
From INFORMATION_SCHEMA.TABLES
The TABLE_ROWS column is NULL if the table is in the INFORMATION_SCHEMA database.
For InnoDB tables, the row count is only a rough estimate used in SQL
optimization. (This is also true if the InnoDB table is partitioned.)
To update this estimation you need to use ANALYZE TABLE (keep in mind that the accuracy dependents on innodb_stats_persistent_sample_page:
innodb_stats_persistent_sample_pages
The number of index pages to sample when estimating cardinality and other statistics for an indexed column, such as those calculated by ANALYZE TABLE. Increasing the value improves the accuracy of index statistics, which can improve the query execution plan, at the expense of increased I/O during the execution of ANALYZE TABLE for an InnoDB table
To get exact count you need to use COUNT(*).
FOR INNODB:
You should use information_schema.INNODB_SYS_TABLESTATS.NUM_ROWS for accurate table row count data, instead of information_schema.TABLES.TABLE_ROWS.
I.e to fetch a list of tables which contain rows:
SELECT name
FROM information_schema.innodb_sys_tablestats
WHERE name LIKE ("YOUR_DB_SCHEMA_NAME%")
AND num_rows > 0;
In my case, there is a requirement to retroactively write integration tests for a legacy system. Based on the current codebase, using proper PDO transaction rollback() is out of the question...
For a poor man's transaction rollback, I simply select all tables containing data and truncate before/after running tests. This allows cleanup of seeded data and any other dirty tables due to test inserted data.
FOR MyISAM:
you should use information_schema.TABLES.TABLE_ROWS as stated by lad2025
I searched Internet and Stack Overflow for my trouble, but couldn't find a good solution.
I have a table (MySql MyISAM) containing 300,000 rows (one column is blob field).
I must use:
DELETE FROM tablename WHERE id IN (1,4,7,88,568,.......)
There are nearly 30,000 id's in the IN syntax.
It takes nearly 1 hour. Also It does not make the .MYD file smaller although I delete 10% of it, so I run OPTIMIZE TABLE... command. It also lasts long...(I should use it, because disk space matters for me).
What's a way to improve performance when deleting the data as above and recover space? (Increasing buffer size? which one? or else?)
With IN, MySQL will scan all the rows in the table and match the record against the IN clause. The list of IN predicates will be sorted, and all 300,000 rows in the database will get a binary search against 30,000 ids.
If you do this with JOIN on a temporary table (no indexes on a temp table), assuming id is indexed, the database will do 30,000 binary lookups on a 300,000 record index.
So, 300,000 binary searches against 30,000 records, or 30,000 binary searches against 300,000 records... which is faster? The second one is faster, by far.
Also, delaying the index rebuilding with DELETE QUICK will result in much faster deletes. All records will simply be marked deleted, both in the data file and in the index, and the index will not be rebuilt.
Then, to recover space and rebuild the indexes at a later time, run OPTIMIZE TABLE.
The size of the list in your IN() statement may be the cause. You could add the IDs to a temporary table and join to do the deletes. Also, as you are using MyISAM you can use the DELETE QUICK option to avoid the index hit whilst deleting:
For MyISAM tables, if you use the QUICK keyword, the storage engine
does not merge index leaves during delete, which may speed up some
kinds of delete operations.
I think the best approach to make it faster is to create a new table and insert into it the rows which you dont want to delete and then drop the original table and then you can copy the content from the table to the main table.
Something like this:
INSERT INTO NewTable SELECT * FROM My_Table WHERE ... ;
Then you can use RENAME TABLE to rename the copy to the original name
RENAME TABLE My_Table TO My_Table_old, NewTable TO My_Table ;
And then finally drop the original table
DROP TABLE My_Table_old;
try this
create a table name temptable with a single column id
insert into table 1,4,7,88,568,......
use delete join something like
DELETE ab, b FROM originaltable AS a INNER JOIN temptable AS b ON a.id= b.id where b.id is null;
its just an idea . the query is not tested . you can check the syntax on google.
I have a DB schema composed of MYISAM tables, i am interested to delete old records from time to time from some of the tables.
I know that delete does not reclaim the memory space, but as i found in a description of DELETE command, inserts may reuse the space deleted
In MyISAM tables, deleted rows are maintained in a linked list and subsequent INSERT operations reuse old row positions.
I am interested if LOAD DATA command also reuses the deleted space?
UPDATE
I am also interested how the index space reclaimed?
UPDATE 2012-12-03 23:11
some more info supplied based on the answer received from #RolandoMySQLDBA
after executing the following suggested query i got different results for different tables for which space need to be reused or reclaimed:
SELECT row_format FROM information_schema.tables
WHERE table_schema='mydb' AND table_name='mytable1';
> Dynamic
SELECT row_format FROM information_schema.tables
WHERE table_schema='mydb' AND table_name='mytable2';
> Fixed
UPDATE 2012-12-09 08:06
LOAD DATA do reuses previously deleted space (i have checked it by running a short script) if and only if the row format is fixed or (the row format is dynamic and there is a deleted row with exactly the same size).
it seems that if the row_format is dynamic, full look-up over the deleted list is made for each record , and if the exact row size is not found , the deleted record is not used, and the table memory usage will raise, additionally LOAD DATA will take much more time to import records.
I will except the answer given here , since it describes all the process perfectly.
For a MySQL table called mydb.mytable just run the following:
OPTIMIZE TABLE mydb.mytable;
You could also do this in stages:
CREATE TABLE mydb.mytable_new LIKE mydb.mytable;
ALTER TABLE mydb.mytable_new DISABLE KEYS;
INSERT INTO mydb.mytable_new SELECT * FROM mydb.mytable;
ALTER TABLE mydb.mytable_new ENABLE KEYS;
ALTER TABLE mydb.mytable RENAME mydb.mytable_old;
ALTER TABLE mydb.mytable_new RENAME mydb.mytable;
ALTER TABLE mydb.mytable_old;
ANALYZE TABLE mydb.mytable;
In either case, the table ends up with no fragmentation.
Give it a Try !!!
UPDATE 2012-12-03 12:50 EDT
If you are concerned whether or not rows are reused upon bulk INSERTs via LOAD DATA INFILE, please note the following:
When you created the MyISAM table, I assumed the default row format would be dynamic. You can check what it is with either
SHOW CREATE TABLE mydb.mytable\G
or
SELECT row_format FROM information_schema.tables
WHERE table_schema='mydb' AND table_name='mytable';
Since the row format of your table is Dynamic, the fragmented rows are of various sizes. The MyISAM storage engine would have keep checking for the row length of each deleted to see if the next set of data being insert will fit. If the incoming data cannot fit in any of the deleted rows, then the new row data is appended.
The presence of such rows can make myisamchk struggle.
This is why I recommended running OPTIMIZE TABLE. That way, data would be appended quicker.
UPDATE 2012-12-03 12:58 EDT
Here is something interesting you can also do: Try setting concurrent_insert to 2. That way, you are always appending to a MyISAM table without checking for gaps in the table. This will speed up INSERTs dramatically but leave all known gaps alone.
You could still defragment your table at your earliest convenience using OPTIMIZE TABLE.
UPDATE 2012-12-03 13:40 EDT
Why don't run the my second sugesstion
CREATE TABLE mydb.mytable_new LIKE mydb.mytable;
ALTER TABLE mydb.mytable_new DISABLE KEYS;
INSERT INTO mydb.mytable_new SELECT * FROM mydb.mytable;
ALTER TABLE mydb.mytable_new ENABLE KEYS;
ALTER TABLE mydb.mytable RENAME mydb.mytable_old;
ALTER TABLE mydb.mytable_new RENAME mydb.mytable;
ANALYZE TABLE mydb.mytable;
This will give you an idea
How long OPTIMIZE TABLE would take to run
How much smaller the .MYD and .MYI would be after running OPTIMIZE TABLE
After you run my second suggestion, you can compare them with
SELECT
A.mydsize,B.mydsize,A.mydsize - B.mydsize myd_diff,
A.midsize,B.myisize,A.myisize - B.myisize myi_diff
FROM
(
SELECT data_length mydsize,index_length myisize
FROM information_schema.tables
WHERE table_schema='mydb' AND table_name='mytable'
) A,
(
SELECT data_length mydsize,index_length myisize
FROM information_schema.tables
WHERE table_schema='mydb' AND table_name='mytable_new'
) B;
UPDATE 2012-12-03 16:42 EDT
Any table whose ROW_FORMAT is set to fixed has the luxury of allocating the same length row every time. If MyISAM tables maintain a list of deleted rows, the very first row in the list should always be selected as the next row to insert data. There would be no need to traverse a whole list until a suitable row gaps with sufficient length is found. Each deleted row is quickly appended after a DELETE. Each INSERT would pick the first row of the deleted rows.
We can assume these things because MyISAM tables can do concurrent inserts. In order for this feature to be available via the concurrent_insert option, INSERTs into a MyISAM table must be able to detect one of three(3) things:
The presence of a list of deleted rows, thus choosing from the list
Row_Format=Dynamic : list of deleted rows with each row with a different length
Row_Format=Fixed : list of deleted rows with all rows the same length
The absence of a list of deleted rows, thus appending
Bypass checking for the presence of a list of deleted rows (set concurrent_insert to 2)
For detection #1 to be the fastest possible, a MyISAM table's row_format must be Fixed. If it is Dynamic, it is very possible that a list traversal is necessary.
I have a MySQL MYISAM table (say tbl) consisting of 2 unsigned int fields, say, f1 and f2. There is an index on f2 and the table is very large (approximately 320,000,000+ rows). I update this table periodically (with approximately 100,000 new rows a week), and, in order to be able to search this table without doing an ORDER BY (which would be very time consuming in real-time queries), I physically ORDER the table according to the way in which I want to retrieve its rows.
So, I perform an ALTER TABLE tbl ORDER BY f1 DESC. (I know I have enough physical space on the server for a copy of the table.) I have read that during this operation, a temporary table is created and SELECT statements are not affected on the current rows.
However, I have experienced that this is not the case, and SELECT statements on the table that occur at the same time with the ALTER table are getting blocked and do not terminate. After the ALTER TABLE tbl completes (about 40 minutes on the production server), the SELECT statements on tbl start executing fine again.
Is there any reason why the "ALTER table tbl ORDER BY f1 DESC" seems to be blocking other clients from querying tbl?
Altering a table will always grab a lock on the table, preventing SELECTs from running.
I'll admin that I didn't even know you could do that with an ALTER TABLE.
What are you trying to get from the table? For example, all records in a given range? 320 million rows is not a trivial number. I'll give you my gut reactions:
Switch to InnoDB (allows #2, also gives transactions, but without #2 may hurt performance)
Partition the table (makes it act like a number of slightly smaller tables)
Consider a redesign, such as having a "working set" table and a "historical" table, basically manually partitioning. If you usually look for recently inserted data, this (along with partitioning) will help a lot. If your lookups are evenly distributed, this probably won't make a difference.
Consider adding a new column you could use in conjunction to narrow down selects (so instead of searching on date, search on date and customer ID)
Since I don't know what you're storing, some of these (such as #4) may not apply.
There are some other things you could try. OPTIMIZE TABLE may help you but take less time, but I doubt it. I think internally it's implemented as a dump/reload, at least on the InnoDB side.