Difference in datafile size on a same data? - mysql

I use MariaDB 10.1.16
I do very simple job this time.
Select data from oracle and Make csv file
Load that in MariaDB using load infile data command.
DB Engine is InnoDB.
Data row count is 6497641.
Both table is created same query.
PK is auto_increment and int type;
Row created by...
TABLE1 - load infile data...
TABLE2 - insert into TABLE2 select * from TABLE1...
size of tables is below.
TABLE1 - 3.3GBytes
TABLE2 - 1.9GBytes
Contents in mysql.innodb_table_stats is below.
TABLE1: n_rows(5438171) , clustered_index_size(196096), sum_of_other_index_sizes(12853)
TABLE2: n_rows(6407131) , clustered_index_size(106048), sum_of_other_index_sizes(12273)
I wanna know why the size of files is different.
thank you.

The order of the rows can make a big difference. If the data is sorted by the PRIMARY KEY as it is inserted, the blocks will be packed nearly full. If the rows are randomly sorted, the end result will be blocks that are about 69% full. This is the nature of inserting into a BTree.
n_rows is just an approximation, hence the inconsistent count. The other pair of values, I think, an exact number of 16KB blocks.
Since the PK is "clustered" with the data, the clustered_index_size is the size of the data, plus some overhead for the BTree on the PK. Plus a lot of overhead and/or wasted space (as mentioned above).

Related

MySQL Clustered vs Non Clustered Index Performance

I'm running a couple tests on MySQL Clustered vs Non Clustered indexes where I have a table 100gb_table which contains ~60 million rows:
100gb_table schema:
CREATE TABLE 100gb_table (
id int PRIMARY KEY NOT NULL AUTO_INCREMENT,
c1 int,
c2 text,
c3 text,
c4 blob NOT NULL,
c5 text,
c6 text,
ts timestamp NOT NULL default(CURRENT_TIMESTAMP)
);
and I'm executing a query that only reads the clustered index:
SELECT id FROM 100gb_table ORDER BY id;
I'm seeing that it takes almost an ~55 min for this query to complete which is strangely slow. I modified the table by adding another index on top of the Primary Key column and ran the following query which forces the non-clustered index to be used:
SELECT id FROM 100gb_table USE INDEX (non_clustered_key) ORDER BY id;
This finished in <10 minutes, much faster than reading with the clustered index. Why is there such a large discrepancy between these two? My understanding is that both indexes store the index column's values in a tree structure, except the clustered index contains table data in the leaf nodes so I would expect both queries to be similarly performant. Could the BLOB column possibly be distorting the clustered index structure?
The answer comes in how the data is laid out.
The PRIMARY KEY is "clustered" with the data; that is, the data is order ed by the PK in a B+Tree structure. To read all of the ids, the entire BTree must be read.
Any secondary index is also in a B+Tree structure, but it contains (1) the columns of the index, and (2) any other columns in the PK.
In your example (with lots of [presumably] bulky columns), the data BTree is a lot bigger than the secondary index (on just id). Either test probably required reading all the relevant blocks from the disk.
A side note... This is not as bad as it could be. There is a limit of about 8KB on how big a row can be. TEXT and BLOB columns, when short enough, are included in that 8KB. But when one is bulky, it is put in another place, leaving behind a 'pointer' to the text/blob. Hence, the main part of the data BTree is smaller than it might be if all the text/blob data were included directly.
Since SELECT id FROM tbl is a mostly unnecessary query, the design of InnoDB does not worry about the inefficiency you discovered.
Tack on ORDER BY or WHERE, etc, and there are many different optimizations that could into play. You might even find that INDEX(c1) will let your query run in not much more than 10 minutes. (I think I have given you all the clues for 'why'.)
Also, if you had done SELECT * FROM tbl, it might have taken much longer than 55 minutes. This is because of having extra [random] fetches to get the texts/blobs from the "off-record" storage. And from the network time to shovel far more data.

Deleting Billion records in a range vs exact ID lookup MYSQL

I have a database table which is around 700GB with 1 Billion rows, the data is approximately 500 GB and index is 200GB,
I am trying to delete all the data before 2021,
Roughly around 298,970,576 rows in 2021 and there are 708,337,583 rows remaining.
To delete this I am running a non-stop query in my python shell
DELETE FROM table_name WHERE id < 1762163840 LIMIT 1000000;
id -> 1762163840 represent data from 2021. Deleting 1Mil row taking almost 1200-1800sec.
Is there any way I can speed up this because the current way is running for more than 15 days and there is not much data delete so far and it's going to do more days.
I thought that if I make a table with just ids of all the records that I want to delete and then do an exact map like
DELETE FROM table_name WHERE id IN (SELECT id FROM _tmp_table_name);
Will that be fast? Is it going to be faster than first making a new table with all the records and then deleting it?
The database is setup on RDS and instance class is db.r3.large 2 vCPU and 15.25 GB RAM, only 4-5 connections running.
I would suggest recreating the data you want to keep -- if you have enough space:
create table keep_data as
select *
from table_name
where id >= 1762163840;
Then you can truncate the table and re-insert new data:
truncate table table_name;
insert into table_name
select *
from keep_data;
This will recreate the index.
The downside is that this will still take a while to re-insert the data (renaming keep_data would be faster). But it should be much faster than deleting the rows.
AND . . . this will give you the opportunity to partition the table so future deletes can be handled much faster. You should look into table partitioning if you have such a large table.
Multiple techniques for big deletes: http://mysql.rjweb.org/doc.php/deletebig
It points out that LIMIT 1000000 is unnecessarily big and causes more locking than might be desirable.
In the long run, PARTITIONing would be beneficial, it mentions that.
If you do Gordon's technique (rebuilding table with what you need), you lose access to the table for a long time; I provide an alternative that has essentially zero downtime.
id IN (SELECT...) can be terribly slow -- both because of the inefficiency of in-SELECT and due to the fact that DELETE will hang on to a huge number of rows for transactional integrity.

Index Creation after Load Data InFile

I'm using MySQL v5.6.
I'm inserting about 10 millions rows in a newly created table (InnoDB). I try to choose the best way to do this between "Load Data InFile" and multiple inserts.
Load Data InFile should be (and is) more efficient, but I'm observing a weird thing: the index creation is much more longer (by 15%) when using "load data infile"...
Step to observe that (each step starts when the previous is all done):
I create a new table (table_1)
I create a new table (table_2)
I insert 10 millions rows in table_1 with multiple insert (batches of 5000)
I insert 10 millions rows in table_2 with load data infile
I create 4 indexes at a time (with alter Table) on table_1
I create 4 indexes at a time (with alter Table) on table_2 -> about 15% more longer than the previous step
What could explain that?
(Of course, results are the same with steps ordered 2, 1, 4, 3, 6, 5.)
It's possible that the data load with INSERT resulted in more data pages left occupying the buffer pool. When creating the indexes on the table that used LOAD DATA, it first had to load pages from disk into the buffer pool, and then index the data in them.
You can test this by querying after you load data:
SELECT table_name, index_name, COUNT(*)
FROM INFORMATION_SCHEMA.INNODB_BUFFER_PAGE
WHERE table_name IN ('`mydatabase`.`table_1`', '`mydatabase`.`table_2`')
GROUP BY table_name, index_name;
Then do this again after you build your indexes.
(Of course replace mydatabase with the name of the database you create these tables in.)

Mysql delete and optimize very slow

I searched Internet and Stack Overflow for my trouble, but couldn't find a good solution.
I have a table (MySql MyISAM) containing 300,000 rows (one column is blob field).
I must use:
DELETE FROM tablename WHERE id IN (1,4,7,88,568,.......)
There are nearly 30,000 id's in the IN syntax.
It takes nearly 1 hour. Also It does not make the .MYD file smaller although I delete 10% of it, so I run OPTIMIZE TABLE... command. It also lasts long...(I should use it, because disk space matters for me).
What's a way to improve performance when deleting the data as above and recover space? (Increasing buffer size? which one? or else?)
With IN, MySQL will scan all the rows in the table and match the record against the IN clause. The list of IN predicates will be sorted, and all 300,000 rows in the database will get a binary search against 30,000 ids.
If you do this with JOIN on a temporary table (no indexes on a temp table), assuming id is indexed, the database will do 30,000 binary lookups on a 300,000 record index.
So, 300,000 binary searches against 30,000 records, or 30,000 binary searches against 300,000 records... which is faster? The second one is faster, by far.
Also, delaying the index rebuilding with DELETE QUICK will result in much faster deletes. All records will simply be marked deleted, both in the data file and in the index, and the index will not be rebuilt.
Then, to recover space and rebuild the indexes at a later time, run OPTIMIZE TABLE.
The size of the list in your IN() statement may be the cause. You could add the IDs to a temporary table and join to do the deletes. Also, as you are using MyISAM you can use the DELETE QUICK option to avoid the index hit whilst deleting:
For MyISAM tables, if you use the QUICK keyword, the storage engine
does not merge index leaves during delete, which may speed up some
kinds of delete operations.
I think the best approach to make it faster is to create a new table and insert into it the rows which you dont want to delete and then drop the original table and then you can copy the content from the table to the main table.
Something like this:
INSERT INTO NewTable SELECT * FROM My_Table WHERE ... ;
Then you can use RENAME TABLE to rename the copy to the original name
RENAME TABLE My_Table TO My_Table_old, NewTable TO My_Table ;
And then finally drop the original table
DROP TABLE My_Table_old;
try this
create a table name temptable with a single column id
insert into table 1,4,7,88,568,......
use delete join something like
DELETE ab, b FROM originaltable AS a INNER JOIN temptable AS b ON a.id= b.id where b.id is null;
its just an idea . the query is not tested . you can check the syntax on google.

Which type of table storage will give high performance?

A table with 3 columns, 1000000 records. Another table with 20 columns, 5000000 records. From the above which table gives quick output while query for data. Provided both table has auto increment value as primary key?
To represent more clearly,
Lets say, table1 has 3 columns with 1million records,1 field is indexed. And also table2 has 30 columns with 10lakh records, 5 field is indexed. If I run query to select a data from table1 and the next query to fetch data from table2 ( columns are indexed on both tables ), which table gives output much quicker than others?
Based on the sizes you mentioned, the tables are so small that it won't matter.
Generally speaking though MyISAM will be a bit faster than InnoDB for pretty much any table although it seems like the gap there is closing all the time.
Keep in mind though that for a small performance penalty, InnoDB gives you a lot in terms of ACID compliance.