Speeding up a MySql DELETE that relies on a BIT column - mysql

I’m using MySql 5.5.46 and have an InnoDB table with a Bit Column (named “ENABLED”). There is no index on this column. The table has 26 million rows, so understandably, the statement
DELETE FROM my_table WHERE ENABLED = 0;
takes a really long time. My question is, is there anything I can do (without upgrading MySQL, which is not an option at this time), to speed up the time it takes to run this query? My “innodb_buffer_pool_size” variable is set to the following:
show variables like 'innodb_buffer_pool_size';
+-------------------------+-------------+
| Variable_name | Value |
+-------------------------+-------------+
| innodb_buffer_pool_size | 11674845184 |
+-------------------------+-------------+

Do the DELETE in "chunks" of 1000, based on the PRIMARY KEY. See Delete Big. That article goes into details about efficient ways to chunk, and what to do about gaps in the PK.
(With that 11GB buffer_pool, I assume you have 16GB of RAM?)
In general, MySQL will do a table scan instead of using an index if the number of rows to be selected is more than about 20% of the total number of rows. Hence, almost never are "flag" fields worth indexing by themselves.

Related

Are full count queries really so slow on a large MySQL InnoDB tables?

We have a large tables with millions of entrys. A full count is pretty slow, see code below. Is this quite common for a MySQL InnoDB table? Is there no way to accelerate this?
Even with the query cache it's still "slow".
I also wonder, why the count on "communication" table with 2.8 mio entries is slower than the count on "transaction" with 4.5 mio entries.
I'know that it's much faster with a where clause. I just want to know if the bad performance is normal.
We are using Amazon RDS MySQL 5.7 with an m4.xlarge (4 CPU, 16 GB RAM, 500 GB Storage). I've also already tried bigger instances with more CPU and RAM, but there is no big change on the query times.
mysql> SELECT COUNT(*) FROM transaction;
+----------+
| COUNT(*) |
+----------+
| 4569880 |
+----------+
1 row in set (1 min 37.88 sec)
mysql> SELECT COUNT(*) FROM transaction;
+----------+
| count(*) |
+----------+
| 4569880 |
+----------+
1 row in set (1.44 sec)
mysql> SELECT COUNT(*) FROM communication;
+----------+
| count(*) |
+----------+
| 2821486 |
+----------+
1 row in set (2 min 19.28 sec)
This is the downside of using a database storage engine that supports multi-versioning concurrency control (MVCC).
InnoDB allows your query to be isolated in a transaction, without blocking other concurrent clients who are reading and writing rows of data. Those concurrent updates don't affect the view of data your transaction has.
But what is the count of rows in the table, given that many of the rows are in progress of being added or deleted while you're doing the count? The answer is fuzzy.
Your transaction shouldn't be able to "see" row versions that were created after your transaction started. Likewise, your transaction should count rows even if someone else has requested they be deleted, but they did so after your transaction started.
The answer is that when you do a SELECT COUNT(*) — or any other type of query that needs to examine many rows — InnoDB has to visit every row, to see which is the current version of that row visible to your transaction's view of the database, and count it if it's visible.
In a table that doesn't support transactions or concurrent updates, like MyISAM, the storage engine keeps the total count of rows as metadata for the table. This storage engine can't support multiple threads updating rows concurrently, so the total count of rows is less fuzzy. So when you request SELECT COUNT(*) from a MyISAM table, it just returns the count of rows it has in memory (but this isn't useful if you do SELECT COUNT(*) with a WHERE clause to count some subset of rows by some condition, so it has to actually count them in that case).
In general, most people find InnoDB's support for concurrent updates is worth a lot, and they are willing to sacrifice the optimization of SELECT COUNT(*).
In addition to what Bill says...
Smallest index
InnoDB picks the 'smallest' index for doing COUNT(*). It could be that all of the indexes of communication are bigger than the smallest of transaction, hence the time difference. When judging the size of an index, include the PRIMARY KEY column(s) with any secondary index:
PRIMARY KEY(id), -- INT (4 bytes)
INDEX(flag), -- TINYINT (1 byte)
INDEX(name), -- VARCHAR(255) (? bytes)
For measuring size, the PRIMARY KEY has big since it includes (due to clustering) all the columns of the table. INDEX(flag) is "5 bytes". INDEX(name) probably averages a few dozen bytes. SELECT COUNT(*) will clearly pick INDEX(flag).
Apparently transaction has a 'small' index, but communication does not.
TEXT/BLOG columns are sometimes stored "off-record". Hence, they do not count in the size of the PK index.
Query Cache
If the "Query cache" is turned on, the second running of a query may be immensely faster than the first. But that is only if there were no changes to the table in the mean time. Since any change to the table invalidates all QC entries for that table, the QC is rarely useful in production systems. By "faster" I mean on the order of 0.001 seconds; not 1.44 seconds.
The difference between 1m38s and 1.44s is probably due to what was cached in the buffer_pool -- the general caching area for InnoDB. The first run probably found none of the 'smallest' index in RAM so it did a lot of I/O, taking 98 seconds to fetch all 4.5M rows of that index. The second run found all that data cached in the buffer_pool, so it ran at CPU speed (no I/O), hence much faster.
Good Enough
In situations like this, I question the necessity of doing the COUNT(*) at all. Notice how you said "2.8 mio entries", as if 2 significant digits was "good enough". If you are displaying the count to users on a UI, won't that be "good enough"? If so, one solution to the performance is to do the count once a day and store it some place. This would allow instantaneous access to a "good enough" value.
There are other techniques. One is to keep the counter updated, either with active code, or with some form of Summary Table.
Throwing hardware at it
You already found that changing the hardware did not help.
The 98s was as fast as any of RDS's I/O offerings can run.
The 1.44s was as fast as any one RDS CPU can run.
MySQL (and its variants) do not use more than one CPU per query.
You had enough RAM so the entire 'small' index would fit in the buffer_pool until your second SELECT COUNT(*).. (Too little RAM would have led the second running to be very slow.)

How to reclaim MySql disk space

I have a table in MySql server and the table contains around 1M rows. Only because of one column table is taking more disk space day by day. The datatype of this column is Mediumblob. Table size is around 90 GB.
After each row insertion, I do some processing then after I don't really require this column.
So for this column, if I set the value to NULL after processing the row, does MySql utilizes this empty space for next row insertion or not?
MySql Server details
Server version: 5.7
Engine: InnoDB
Hosting: Google Cloud Sql
EDIT 1:
I deleted 90% of rows from table then I ran OPTIMIZE TABLE table_name
but it has reduced only 4GB of disk space and it is not reclaiming the free disk space.
EDIT 2
I even deleted my database and created new DB and table but MySql server still showing 80GB disk space. Sizes of all databases of MySQL server
SELECT table_schema "database name",
sum( data_length + index_length ) / 1024 / 1024 "database size in MB",
sum( data_free )/ 1024 / 1024 "free space in MB"
FROM information_schema.TABLES
GROUP BY table_schema;
+--------------------+---------------------+------------------+
| database name | database size in MB | free space in MB |
+--------------------+---------------------+------------------+
| information_schema | 0.15625000 | 80.00000000 |
| app_service | 15.54687500 | 4.00000000 |
| mysql | 6.76713467 | 2.00000000 |
| performance_schema | 0.00000000 | 0.00000000 |
| sys | 0.01562500 | 0.00000000 |
+--------------------+---------------------+------------------+
Thanks
Edit: It turns out from comments below that the user's binary logs are the culprit. It makes sense that the binary logs would be large after a lot of DELETEs, and assuming that the MySQL instance is using row-based replication.
The answer is complex.
You can save space by using NULL instead of real values. InnoDB uses only 1 bit per column per row to indicate that the value is NULL (see my old answer to https://stackoverflow.com/a/230923/20860) for details.
But this will just make space in the page where that row was stored. Each page must store only rows from the same table. So if you set a bunch of them NULL, you make space in that page, which can be used for subsequent inserts for that table only. It won't use the gaps for rows that belong to other tables.
And it still may not be reused for any rows of your mediumblob table, because InnoDB stores rows in primary key order. The pages for a given table don't have to be consecutive, but I would guess the rows within a page may be. In other words, you might not be able to insert rows in primary key random order within a page.
I don't know this detail for certain, you'd have to read Jeremey Cole's research on InnoDB storage to know the answer. Here's an excerpt:
The actual on-disk format of user records will be described in a future post, as it is fairly complex and will require a lengthy explanation itself.
User records are added to the page body in the order they are inserted (and may take existing free space from previously deleted records), and are singly-linked in ascending order by key using the “next record” pointers in each record header.
It's still not quite clear whether rows can be inserted out of order, and reuse space on a page.
So it's possible you'll only accomplish fragmenting your pages badly, and new rows with high primary key values will be added to other pages anyway.
You can do a better effort of reclaiming the space if you use OPTIMIZE TABLE from time to time, which will effectively rewrite the whole table into new pages. This might re-pack the rows, fitting more rows into each page if you've changed values to NULL.
It would be more effective to DELETE rows you don't need, and then OPTIMIZE TABLE. This will eliminate whole pages, instead of leaving them fragmented.

How to speed up SELECT statements on a table with a large VARCHAR field?

I have a simple ISAM table, about 2.2GB with 1.6 million records. For purposes of this question, the query is
SELECT 'one_unindexed_field' WHERE indexed_field > constant;
The only unusual thing about the table is that it has 1 large unindexed column VARCHAR(8192).
If 'one_unindexed_field' = the large VARCHAR field then the query takes about 7 times as long as the same query with any other field. This factor scales roughly the same for any number of records returned (say, 1000 to 100,000), so presumably you can assume the set of records returned fits into memory easily.
mysql> show global status like 'created_tmp_disk_tables';
reports that zero tmp_disk_tables are being created. EXPLAIN returns the same results for either query.
How can I speed up my queries on this table? If it's not possible, can someone explain what is going on?
key_buffer_size=256M
tmp_table_size=64M
max_heap_table_size=64M
myisam_sort_buffer_size=88M
read_buffer_size=1M
read_rnd_buffer_size=2M
Edit: Got some hits suggesting that changing ROW_FORMAT to FIXED would probably speed up my query ... so I did that, and it actually made the query slightly slower.
Edit: I'm on Win10 64-bit, Server version: 5.7.16-log MySQL Community Server (GPL)
EXPLAIN returns this:
mysql> EXPLAIN SELECT skw_stk_vol FROM tbl_skews WHERE (tday_date >= 42795);
id 1
select_type SIMPLE
table tbl_skews
partitions NULL
type range
possible_keys ndx_skews_tday_date
key ndx_skews_tday_date
key_len 4
ref NULL
rows 406921
filtered 100
Extra Using index condition
If you want improve this query try create a composite index to avoid the look up on the table after finding the matching rows.
(indexed_field, 'one_unindexed_field')`
But the thing is you dont say how much time take that query. Brign a large varchar field will always be slower that a integer just because the data will be larger.
So if a query like this work then there isnt much more you can do.
SELECT `integer_unindexed_field` WHERE indexed_field > constant;
Because the problem isnt finding the row, is just returning the result data.

MySQL performance boost after create & drop index

I have a large MySQL, MyISAM table of around 4 million rows running in a core 2 duo, 8G RAM laptop.
This table has 30 columns including varchar, decimal and int types.
I have an index on a varchar(16). Let's call this column: "indexed_varchar_column".
My query is
SELECT 9 columns FROM the_table WHERE indexed_varchar_column = 'something';
It always returns around 5000 rows for every 'something' I query against.
An EXPLAIN to the query returns this:
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| 1 | SIMPLE | the_table | ref | many indexes including indexed_varchar_column | another_index NOT: indexed_varchar_column! | 19 | const | 5247 | Using where |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
First thing is I'm not sure is why another_index is chosen. In fact it chooses an index which is a composite index of indexed_varchar_column and another 2 columns (which form part of the selected ones). Perhaps this makes sense since it may make things a bit faster for not having to read 2 of the columns in the query. The real QUESTION is the following one:
The query takes 5 seconds for every 'something' I match. On the 2nd time I query against 'something' it takes 0.15 secs (I guess because the query is being cached). When I run another query against 'something_new' it takes again 5 seconds. So, it is consistent.
THE PROBLEM IS: I discovered that creating an index (another composite index including my indexed_varchar_column) and dropping it again produces that all further queries against new 'something_other' take only 0.15 secs. Please note that 1) I create an index 2) drop it again. So everything is in the same state.
I guess all the operations needed for building and dropping indices make the SQL engine to cache something that is then reused. When I run EXPLAIN on a query after all this I get exactly the same as before.
How can I proceed to understand what is cached in the create-drop index procedure so that I can cache it without manipulating indices?
UPDATE:
Following a comment from Marc B that suggested that when mySQL creates an index it internally does a SELECT... I tried the following:
SELECT * FROM my_table;
It took 30 secs and returned 4 million rows. The good thing is that all further queries are very fast again (until I reboot the system). Please note that after rebooting the queries are slow again. I guess this is because mySQL is using some sort of OS caching.
Any idea? How can I explicitly cache the table I guess?
UPDATE 2:
Perhaps I should have mentioned that this table may be severely fragmented. It's 4 million rows but I remove lots of old fields regularly. I also add new ones. Since I had large gaps in IDs (for the rows deleted) every day I drop the primary index (ID) and create it again with consecutive numbers. The table may be then very fragmented and therefore IO must be an issue... Not sure what to do.
Thanks everybody for your help.
Finally I discovered (thanks to the hint of Marc B) that my table was severely fragmented after many INSERTs and DELETEs. I updated the question with this info some hours ago. There are two things that help:
1)
ALTER TABLE my_table ORDER BY indexed_varchar_column;
2) Running:
myisamchk --sort-records=4 my_table.MYI (where 4 corresponds to my index)
I believe both commands are equivalent. Queries are fast even after a system reboot.
I've put this ALTER TABLE ORDER BY command on a cron that is run everyday. It takes 2 minutes but it's worth it.
How many indexes do you have that contain the indexed_varchar_column? Do you have a single index for just the indexed_varchar_column?
Have you tried:
SELECT 9 columns FROM USE INDEX (name_of_index) the_table WHERE indexed_varchar_column = 'something';?
What is the order of the columns in your composite index.
You must use (at least) a left-associative sub-set of the columns in your query
If you have an index on foo,bar, and baz, that will not be usable as an index against bar or baz by themeselves. Only (foo), (foo,bar), and (foo,bar,baz).
EXPLAIN is your friend here. It will tell you which index, if any, is being used by a query.
EDIT Here's a postgres explain of a simple left join query for comparison.
Nested Loop Left Join (cost=0.00..16.97 rows=13 width=103)
Join Filter: (pagesets.id = pages.pageset_id)
-> Index Scan using ix_pages_pageset_id on pages (cost=0.00..8.51 rows=13 width=80)
Index Cond: (pageset_id = 515)
-> Materialize (cost=0.00..8.27 rows=1 width=23)
-> Index Scan using pagesets_pkey on pagesets (cost=0.00..8.27 rows=1 width=23)
Index Cond: (id = 515)

Is MySQL an Efficient System for Searching Large Tables?

Say I have a large table, about 2 million rows and 50 columns. Using MySQL, how efficient would it be to search an entire column for one particular value, and then return the row number of said value? (Assume random distribution of values throughout the entire column)
If an operation like this takes an extended amount of time, what can I do to speed it up?
If the column in question is indexed, then it's pretty fast.
Don't be cavalier with indexes, though. The more indexes you have, the more expensive your writes will be (inserts/updates/deletes). Also, they take up disk space and RAM (and can easily be larger than the table itself). Indexes are good for querying, bad for writing. Choose wisely.
Exactly how fast we're talking here? This depends on configuration of your DB machine. If it doesn't have enough RAM to host indexes and data, operation may become disk-bound and performance will be reduced. Equally will be reduced operation without index. Assuming machine is fine, this further depends on how selective your index is. If you have a table with 10M rows and you index column with boolean values, you will get only a slight increase in performance. If, otherwise, you index a column with many-many different values (user emails), query will be orders of magnitude faster.
Also, by modern standards, table with 2M rows is rather small :-)
The structure of the data makes a big difference here, because it will affect your ability to index. Have a look at mysql indexing options (fulltext, etc).
There is no easy answer to that question, it depends on more parameters about your data. As many others have advised you already, creating an index on the column you have to search (for an exact match, or starting with a string) will be quite efficient.
As an example, I have a MyISAM table with 27,000,000 records (6.7 GB in size) which holds an index on a VARCHAR(128) field.
Here are two sample queries (real data) to give you an idea:
mysql> SELECT COUNT(*) FROM Books WHERE Publisher = "Hachette";
+----------+
| COUNT(*) |
+----------+
| 15072 |
+----------+
1 row in set (0.12 sec)
mysql> SELECT Name FROM Books WHERE Publisher = "Scholastic" LIMIT 100;
...
100 rows in set (0.17 sec)
So yes, I think MySQL is definitely fast enough to do what you're planning to do :)
Create an index on that column.
Create an index on the column in question and performance should not be a problem.
In general - add an index on the column