Innodb, clustered indexes, and slow_query_log - hurt by the primary key? - mysql

In the last few months we've migrated a few tables from MYiSAM to InnoDB. We did this, in theory, for the row locking advantage, since we are updating individual rows, through multiple web-scraping instances. I now have tens of thousands of slow_queries building up in my slow_query_log (10s). and a lot of full table scans. I am having event very simple updates to one row (updating 4 or 5 columns) take 28 seconds. (our i/o and efficiencies are very good, and failed/aborted attempts are very low < 0.5%)
The two tables we update the most have ID (int 11) as the primary key. In InnoDB the primary key is a clustered key, so are written to disk indexed in the order of ID. BUT our two most important record identifying columns are BillsofLading and Container (both varchar22). Most of our DML queries look up records based on these two columns. We also have indexes on BillsofLading and container.
The way I understand it, InnoDB also uses the primary key when creating these two secondary indexes.
So, I could have a record with ID=1, and BillsofLading='z', and another record ID=9 and BillsofLading='a'. With InnoDB indexes, when updating record based on a SELECT where BillsofLading='a', would I not still have to do a full scan to find 'a' since the index is based on the ID?
Thanks in advance for helping with the logic here!

No, your example should not require a full scan, assuming that MySQL is choosing to use your BillsofLading index. You say
The way I understand it, InnoDB also uses the primary key when creating these two secondary indexes.
This is correct, but it does not mean what you think it means.
A primary key (PK) to InnoDB is like a line number to a human being. It's InnoDB's only way to uniquely identify a row. So what a secondary index does is map each value of the target column(s) to all the PKs (i.e. line numbers) that match that value. Thus, MySQL can quickly jump to just the BillsofLading you care about, and then scan just those rows (PKs) for the one you want.

Did you severely decrease key_buffer_size and set innodb_buffer_pool_size to about 70% of available RAM?
There are a number of subtle differences between MyISAM and InnoDB. There are too many to list in this answer, and nothing you have said brings to mind any particular issue. I suggest you review Converting MyISAM to InnoDB to see what might be causing trouble.

Related

InnoDB secondary index includes value instead of pointer to PK, how is it enough?

I am reading Effective Mysql - Optimizing Mysql Statements and in chapter 3 there was this explanation:
The secondary indexes in InnoDB use the B-tree data structure; however, they differ from the MyISAM implementation. In InnoDB, the secondary index stores the physical value of the primary key. In MyISAM, the secondary index stores a pointer to the data that contains the primary key value.
This is important for two reasons. First, the size of secondary indexes in InnoDB can be much larger when a large primary key is defined—for example when your primary key in InnoDB is 40 bytes in length. As the number of secondary indexes increase, the comparison size of the indexes can become significant. The second difference is that the secondary index now includes the primary key value and is not required as part of the index. This can be a significant performance improvement with table joins and covering indexes.
There are many questions that come to my mind, mostly due to lack of understanding of what author is trying to convey.
It is unclear what the author means in the second difference in
second paragraph. What is not required as part of index anymore?
Does InnoDB secondary index B-tree only store PK value or PK value
and Pointer to it? or PK Value and pointer to data row?
What kind of performance improvement would there be due to the storage method (2nd question's answer)?
This question contains an example and also an answer. He explains how it contains PK value, but what I am still not understanding is,
To complete the join, if the pointer is not there in the secondary index and only the value, wont MySQL do a full index scan on Primary Key index with that value from secondary index? How would that be efficient than having the pointer also?
The secondary index is an indirect way to access the data. Unlike the primary (clustered) index, when you traverse the secondary index in InnoDB and you reach the leaf node you find a primary key value for the corresponding row the query is looking for. Using this value you traverse the primary index to fetch the row. This means 2 index look ups in InnoDB.
For MyISAM because the leaf of the secondary node is a pointer to the actual row you only require 1 index lookup.
The secondary index is formed based on certain attributes of your table that are not the PK. Hence the PK is not required to be part of the index by definition. Whether it is (InnoDB) or not (MyISAM) is implementation detail with corresponding performance implications.
Now the approach that InnoDB follows might at first seem inefficient in comparison to MyISAM (2 lookups vs 1 lookup) but it is not because the primary index is kept in memory so the penalty is low.
But the advantage is that InnoDB can split and move rows to optimize the table layout on inserts/updates/deletes of rows without needing to do any updates on the secondary index since it does not refer to the affected rows directly
Basics..
MyISAM's PRIMARY KEY and secondary keys work the same. -- Both are BTrees in the .MYI file where a "pointer" in the leaf node points to the .MYD file.
The "pointer" is either a byte offset into the .MYD file, or a record number (for FIXED). Either results in a "seek" into the .MYD file.
InnoDB's data, including the columns of the PRIMARY KEY, is stored in one BTree ordered by the PK.
This makes a PK lookup slightly faster. Both drill down a BTree, but MyISAM needs an extra seek.
Each InnoDB secondary key is stored in a separate BTree. But in this case the leaf nodes contain any extra columns of the PK. So, a secondary key lookup first drills down that BTree based on the secondary key. There it will find all the columns of both the secondary key and the primary key. If those are all the columns you need, this is a "covering index" for the query, and nothing further is done. (Faster than MyISAM.)
But usually you need some other columns, so the column(s) of the PK are used to drill down the data/PK BTree to find the rest of the columns in the row. (Slower than MyISAM.)
So, there are some cases where MyISAM does less work; some cases where InnoDB does less work. There are a lot of other things going on; InnoDB is winning many comparison benchmarks over MyISAM.
Caching...
MyISAM controls the caching of 1KB index blocks in the key_buffer. Data blocks are cached by the Operating System.
InnoDB caches both data and secondary index blocks (16KB in both cases) in the buffer_pool.
"Caching" refers to swapping in/out blocks as needed, with roughly a "least recently used" algorithm.
No BTree is loaded into RAM. No BTree is explicitly kept in RAM. Every block is requested as needed, with the hope that it is cached in RAM. For data and/or index(es) smaller than the associated buffer (key_buffer / buffer_pool), the BTree may happen to stay in RAM until shutdown.
The source-of-truth is on disk. (OK, there are complex tricks that InnoDB uses with log files to avoid loss of data when a crash occurs before blocks are flushed to disk. That cleanup automatically occurs when restarting after the crash.)
Pulling the plug..
MyISAM:
Mess #1: Indexes will be left in an unclean state. CHECK TABLE and REPAIR TABLE are needed.
Mess #2: If you are in the middle of UPDATEing a thousand rows in a single statement, some will be updated, some won't.
InnoDB:
As alluded to above, InnoDB performs things atomically, even across pulling the plug. No index is left mangled. No UPDATE is left half-finished; it will be ROLLBACKed.
Example..
Given
columns a,b,c,d,e,f,g
PRIMARY KEY(a,b,c)
INDEX(c,d)
The BTree leaf nodes will contain:
MyISAM:
for the PK: a,b,c,pointer
for secondary: c,d,pointer
InnoDB:
for the PK: a,b,c,d,e,f,g (the entire row is stored with the PK)
for secondary: c,d,a,b

Force hidden clustered index in innoDB

I have a table with IDs that are a hash of the "true primary key". Correct me if I'm wrong, but I think my inserts are very slow in this table because of the clustered index on this key (it takes multiple minutes for inserting 100 000 rows).
When I change the key to a nonclustered index, I have the impression that innoDB still secretly clusters on it.
Is there a simple way to avoid that mysql clusters on my primary key without having to define an auto increment primary key?
InnoDB must have a PRIMARY KEY.
Innodb's first preference is an explicit PRIMARY KEY, whether AUTO_INCREMENT or not.
Then a UNIQUE key, but only if none of the columns are NULLable.
Finally, InnoDB will create a hidden, 6-byte, integer that acts somewhat like an auto_increment.
Scenario 1. Inserting into a table must find the block where the desired primary key is. For AUTO_INCREMENT and for #3, above, that will be the "last" block in the table. The 100K rows will go into about 1000 blocks at the "end" of the table.
Scenario 2. Otherwise (non-AI, but explicit PK; or UNIQUE), a block needs to be found (possibly read from disk), the key checked for dup, then the block updated and marked for later rewriting to disk.
If all the blocks fit in the buffer_pool, then either of those is essentially the same speed. But if the table is too big to be cached, then Scenario 2 becomes slow -- in fact slower and slower as the table grows. This is because of I/O. GUIDs, UUIDs, MD5s, and other hashes are notorious at suffering from this slow-down.
Another issue: Transaction integrity dictates that each transaction incur some other I/O. Is your 100K inserts 100K transactions? 1 transaction? Best is to batch them in groups of 100 to 1000 rows per transaction.
I hope those principles let you figure out your situation. If not, please provide CREATE TABLE for each of the options you are thinking about. Then we can discuss your details. Also provide SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; and how much RAM you have.

Mysql/mariadb innodb: does row size affect complex query performance?

I have InnoDB table with millions rows (statistics events) on my MariaDB 10 server and each row historically has a long user-id char(44) field (used as non-unique key) along with other 30 int/varchar fields (row size is about 240 bytes). My system can make cohort analysis, funnels, event segmentation and other common statistics - so some queries are very complex with many joins. Now I have an opportunity to add 4-byte int field and use it as user-id and as main non-unique key for all queries. But I need to keep old symbolic char(44) user-id in this table because of realization details - some data sources are not mine and send events only with symbolic user-ids.
So the question is: will - in general - keeping or removing this char(44) field affect performance of complex queries? It will just stay like other char fields, and it will not be used as a key in queries anymore. I'd prefer not to split the table because there are lot of code depend on its structure.
Thanks!
Tested Aria, and found out that it is ~1.5x slower than InnoDB for my purposes, even on simple joins. InnoDB with "redundant" row format works even faster. So - no, Aria is not a compromise, it is even slower than myISAM. I suppose InnoDB is XtraDB in Maria10, this explains the speed.
Also did some testing on self join query and found that leaving or removing char(44) field has no affect on query performance if we're not using this field.
And moving from char(44) key to int makes queries 2x faster!
Switching to a shorter integer key will help query performance a little bit. The indexing overhead of fixed length character columns isn't hideous.
Stuffing more RAM and/or some SSD disks into your database server will most likely cost less than refactoring your program, as you have mentioned.
What will really help your query performance is the creation of appropriate compound covering indexes. If you have queries that can be satisfied just from such an index, things will get faster.
For example, if you do a lot of
SELECT integer_user_id
FROM table
WHERE character_user_id = 'constant'
then a compound index on (character_user_id) will make this query very fast.
Be careful when you add lots of indexes: there's a penalty to pay upon INSERT or UPDATE in tables with many indexes.

How can I speed-up the table reconstruction in MySQL when altering the schema?

I have a relatively big MySQL InnoDB table (compressed), and I sometimes need to alter its schema (increasing column size or adding a field).
It takes around 1 hour for a 500 MB table with millions of rows, but the server doesn't seem to be very busy (CPU #5%, not much RAM used, and 2.5 MB/s as I/O).
The table is not used in production so there are no concurrent requests at the same time. There is only a primary index (on the first 5 columns) and one foreign key constraint.
Do you have any suggestion on how to speed-up the table alteration process?
Changing storage engine (to newer generation engines like TokuDB) seems the way to go, until InnoDB is "fixed".
Would be helpful to know the exact table and primary key/index definitions, and of lesser importance, the row count to the nearest million, although I would guess as the table is only 500mb it's probably less than 20 million rows. Also, your approach to changing the table - are you creating a new schema and inserting into it, or using a alter table etc.
I've had success in this area before with approaches like
changing the index key composition, adding a unique key
dropping the indexes first, then changing the table, then adding the indexes back. sometimes independent indexes can be a real performance killer if they are affected by the change.
optimizing the table structure to remove unneeded or oversized columns
separating out data (normally columns but you can vertically partition in some circumstances) that won't change from the core structure that might change, so you only churn a smaller part of your table

Mysql: Unique Index = Performance Characteristics for large datasets?

what is the performance characteristic for Unique Indexes in Mysql and Indexes in general in MySQl (like the Primary Key Index):
Given i will insert or update a record in my databse: Will the speed of updating the record (=building/updating the indexes) be different if the table has 10 Thousand records compared to 100 Million records. Or said differently, does the Index buildingtime after changing one row depend on the total indexsize?
Does this also apply for any other indexes in Mysql like the Primary Key index?
Thank you very much
Tom
Most indexes in MySQL are really the same internally -- they're B-tree data structures. As such, updating a B-tree index is an O(log n) operation. So it does cost more as the number of entries in the index increases, but not badly.
In general, the benefit you get from an index far outweighs the cost of updating it.
A typical MySQL implementation of an index is as a set of sorted values (not sure if any storage engine uses different strategies, but I believe this holds for the popular ones) -- therefore, updating the index inevitably takes longer as it grows. However, the slow-down need not be all that bad -- locating a key in a sorted index of N keys is O(log N), and it's possible (though not trivial) to make the update O(1) (at least in the amortized sense) after the finding. So, if you square the number of records as in your example, and you pick a storage engine with highly optimized implementation, you could reasonably hope for the index update to take only twice as long on the big table as it did on the small table.
Note that if new primary key values are always larger than the previous (i.e. autoincrement integer field), your index will not need to be rebuilt.