InnoDB row level locking performance - how many rows? - mysql

I just read a lot of stuff about MyISAM and InnoDB as I have to decide which type to use.
There was always mentioned 'row level locking'-support for InnoDB. Of course this only makes sense at a certain amount of rows.
How many would that (circa) be?
EDIT: Apparently I mis-worded my question. I know what table locking and row locking mean but I wondered when this does matter.
If I have just 100 rows inserted per day, of course table locking would be way enough but for a case of, let's say 100 rows per SECOND I think, InnoDB would be the better choice.
My question: Does row also locking make sense for 10 rows per second or 5 rows per second? When does this choice significantly affect performance?

It's not entirely clear what you're asking. Locking ensures that only one user attempts to modify a given row at any given time. Row-level locking means only the one row they're modifying is locked. The usual alternatives are to either lock the entire table for the duration of the modification, or else to lock some subset of the table. Row-level locking simply reduces that subset of the rows to the smallest number that still ensures integrity.
The idea is to allow one user to modify one thing without preventing other users from modifying other things. It's worth noting, however, that in some cases this can be something of a false positive, so to speak. A few databases support row-level locking, but make a row-level lock considerably more expensive that locking a larger part of the table -- enough more expensive that it can be counterproductive.
Edit: Your edit to the original post helps, but not really a lot. First of all, the sizes of rows and levels of hardware involved have a huge effect (inserting an 8-byte row onto a dozen striped 15K SAS hard drives is just a tad faster than inserting a one megabyte row onto a single consumer class hard drive).
Second, it's largely about the number of simultaneous users, so the pattern of insertion makes a big difference. 1000 rows inserted at 3 AM probably won't be noticed at all. 1000 rows inserted evenly throughout the day means a bit more (but probably only a bit). 1000 rows inserted as a batch right when 100 other users need data immediately might get somebody fired (especially if one of those 100 is the owner of the company).

MyISAM tables support concurrent inserts (aka no table lock for inserts). So if you meet the criteria, there's no problem:
http://dev.mysql.com/doc/refman/5.0/en/concurrent-inserts.html
So, like most things, the answer is "it depends". There is no bright line test. Only you can make the determination; we know nothing about your application/hardware/usage statistics/etc. and, by definition, can't know more about it than you do.

Related

MySQL and a table with 100+ millions of rows

I have a few tables with more than 100+ millions of rows.
I get about 20-40 millions of rows each month.
At this moment everything seems fine:
- all inserts are fast
- all selects are fast ( they are using indexes and don't use complex aggregations )
However, I am worried about two things, what I've read somewhere:
- When a table has few hundred millions of rows, there might be slow inserts, because it might take a while to re-balance the indexes ( binary trees )
- If index doesn't fit into memory, it might take a while to read it from the different parts of the disk.
Any comments would be highly appreciated.
Any suggestions how can I avoid it or how can I fix/mitigate the problem if/when it happens would be highly appreciated.
( I know we should start doing a sharding at some day )
Thank you in advance.
Today is the day you should think about sharding or partitioning because if you have 100MM rows today and you're gaining them at ~30MM per month then you're going to double the size of that in three months, and possibly double it again before the year is out.
At some point you'll hit an event horizon where your database is too big to migrate. Either you don't have enough working space left on your disk to switch to an alternate schema, or you don't have enough down-time to perform the migration before it needs to be operational again. Then you're stuck with it forever as it gets slower and slower.
The performance of write activity on a table is largely a function of how difficult the indices are to maintain. The more data you index the more punishing writes can be. The type of index is all relevant, some are more compact than others. If your data is lightly indexed you can usually get away with having more records before things start to get cripplingly slow, but that degradation factor is highly dependent on your system configuration, your hardware, and your IO capacity.
Remember, InnoDB, the engine you should be using, has a lot of tuning parameters and many people leave it set to the really terrible defaults. Have a look at the memory allocated to it and be sure you're doing that properly.
If you have any way of partitioning this data, like by month, by customer, or some other factor that is not going to change based on business logic, that is the data is intrinsically not related, you will have many simple options. If it's not, you'll have to make some hard decisions.
The one thing you want to be doing now is simulating what your table's performance is like with 1G rows in it. Create a sufficiently large, suitably varied amount of test data, then see how well it performs under load. You may find it's not an issue, in which case, no worries for another few years. If not, start panicking today and working towards a solution before your data becomes too big to split.
Database performance generally degrades in a fairly linear fashion, and then at some point it falls off a cliff. You need to know where this cliff is so you know how much time you have before you hit it. The sharp degradation in performance usually comes when your indexes can't fit in memory and when your disk buffers are stretched too thin to be useful.
I will attempt to address the points being made by the OP and the other responders. The Question only touches the surface; this Answer follows suit. We can dig deeper in more focused Questions.
A trillion rows gets dicey. 100M is not necessarily problematic.
PARTITIONing is not a performance panacea. The main case where it can be useful way is when you need to purge "old" data. (DROP PARTITION is a lot faster than DELETEing a zillion rows.)
INSERTs with an AUTO_INCREMENT PRIMARY KEY will 'never' slow down. This applies to any temporal key and/or small set of "hot spots". Example PRIMARY KEY(stock_id, date) is limited to as many hot spots as you have stocks.
INSERTs with a UUID PRIMARY KEY will get slower and slower. But this applies to any "random" key.
Secondary indexes suffer the same issues as the PK, however later. This is because it is dependent on the size of the BTree. (The data's BTree ordered by the PK is usually bigger than each secondary key.)
Whether an index (including the PK) "fits in memory" matters only if the inserts are 'random' (as with a UUID).
For Data Warehouse applications, it is usually advisable to provide Summary Tables instead of extra indexes on the 'Fact' table. This yields "report" queries that may be as much as 10 times as fast.
Blindly using AUTO_INCREMENT may be less than optimal.
The BTree for the data or index of a million-row table will be about 3 levels deep. For a trillion rows, 6 levels. This "number of levels" has some impact on performance.
Binary trees are not used; instead BTrees (actually B+Trees) are used by InnoDB.
InnoDB mostly keeps its BTrees balanced without much effort. Don't worry about it. (And don't use OPTIMIZE TABLE.)
All activity is done on 16KB blocks (of data or index) and done in RAM (in the buffer_pool). Neither a table nor an index is "loaded into RAM", at least not explicitly as a whole unit.
Replication is useful for read scaling. (And readily available in MySQL.)
Sharding is useful for write scaling. (This is a DYI task.)
As a Rule of Thumb, keep half of your disk free for various admin purposes on huge tables.
Before a table gets into the multi-GB size range, it is wise to re-think the datatypes and normalization.
The main tunable in InnoDB (these days) is innodb_buffer_pool_size, which should (for starters) be about 70% of available RAM.
Row_format=compressed is often not worth using.
YouTube, Facebook, Google, etc, are 'on beyond' anything discussed in this Q&A. They use thousands of servers, custom software, etc.
If you want to discuss your specific application, let's see some details. Different apps need different techniques.
My blogs, which provide more details on many of the above topics: http://mysql.rjweb.org

MySQL query speed or rows read

Sorry for lots of useless text. Most important stuff is told on last 3 paragraphs :D
Recently we had some mysql problem in one of client servers. Something out of blue starts sky-rocking CPU of mysql process. This problem lead us to finding and optimizing bad queries and here is a problem.
I was thinking that optimization is speeding up queries (total time needed for a query to execute). But after optimizing several queries towards it my colleague starting colleague started complaining that some queries read too many rows, all rows from table (as shown with EXPLAIN).
After rewriting a query I noticed that, if I want a query to read less rows - query speed suffers, if I query is made for speed - more rows are read.
And that didn't make me a sense: less rows read, but execution time is longer
And that made me wonder what should be done. Of course it would be perfect to have fast query which reads least rows. But since it doesn't seem to be possible for me, I'm searching for some answers. Which approach should I take - speed or less rows read? What are pros&cons when query is fast but with more rows read and when less rows are read with speed suffer? What happens with server at different cases?
After googling all I could find was articles and discussions about how to improve speed, but neither covered those different cases I mentioned before.
I'm looking forward to seeing even personal choices of course with some reasoning.
Links which could direct me right way are welcome too.
I think your problem depends on how you are limiting the amount of rows read. If you read less rows by implementing more WHERE clauses that MySQL needs to run against, then yes, performance will take a hit.
I would look at perhaps indexing some of your columns that make your search more complex. Simple data types are faster to lookup than complex ones. See if you are searching toward indexed columns.
Without more data, I can give you some hints:
Be sure your tables are properly indexed. Create the appropriate indexes for each of your tables. Also drop the indexes that are not needed.
Decide the best approach for each query. For example, if you use group by only to deduplicate rows, you are wasting resources; it is better to use select distinct (on an indexed field).
"Divide and conquer". Can you split your process in two, three or more intermediate steps? If the answer is "yes", then: Can you create temporary tables for some of these steps? I've split proceses using temp tables, and they are very useful to speed things up.
The count of rows read reported by EXPLAIN is an estimate anyway -- don't take it as a literal value. Notice that if you run EXPLAIN on the same query multiple times, the number of rows read changes each time. This estimate can even be totally inaccurate, as there have been bugs in EXPLAIN from time to time.
Another way to measure query performance is SHOW SESSION STATUS LIKE 'Handler%' as you test the query. This will tell you accurate counts of how many times the SQL layer made requests for individual rows to the storage engine layer. For examples, see my presentation, SQL Query Patterns, Optimized.
There's also an issue of whether the rows requested were already in the buffer pool (I'm assuming you use InnoDB), or did the query have to read them from disk, incurring I/O operations. A small number of rows read from disk can be orders of magnitude slower than a large number of rows read from RAM. This doesn't necessarily account for your case, but it points out that such a scenario can occur, and "rows read" doesn't tell you if the query caused I/O or not. There might even be multiple I/O operations for a single row, because of InnoDB's multi-versioning.
Insight into the difference between logical row request vs. physical I/O reads is harder to get. In Percona Server, enhancements to the slow query log include the count of InnoDB I/O operations per query.

Mysql optimization for simple records - what is best?

I am developing a system that will eventually have millions of users. Each user of the system may have acces to different 'tabs' in the system. I am tracking this with a table called usertabs. There are two ways to handle this.
Way 1: A single row for each user containing userid and tab1-tab10 as int columns.
The advantage of this system is that the query to get a single row by userid is very fast while the disadvantage is that the 'empty' columns take up space. Another disadvantage is that when I needed to add a new tab, I would have to re-org the entire table which could be tedious if there are millions of records. But this wouldn't happen very often.
Way 2: A single row contains userid and tabid and that is all. There would be up to 10 rows per user.
The advantage of this system is easy sharding or other mechanism for optimized storage and no wasted space. Rows only exist when necessary. The disadvantage is up to 10 rows must be read every time I access a record. If these rows are scattered, they may be slower to access or maybe faster, depending on how they were stored?
My programmer side is leaning towards Way 1 while my big data side is leaning towards Way 2.
Which would you choose? Why?
Premature optimization, and all that...
Option 1 may seem "easier", but you've already identified the major downside - extensibility is a huge pain.
I also really doubt that it would be faster than option 2 - databases are pretty much designed specifically to find related bits of data, and finding 10 records rather than 1 record is almost certainly not going to make a difference you can measure.
"Scattered" records don't really matter, the database uses indices to be able to retrieve data really quickly, regardless of their physical location.
This does, of course, depend on using indices for foreign keys, as #Barmar comments.
If these rows are scattered, they may be slower to access or maybe faster, depending on how they were stored?
They don't have to be scattered if you use clustering correctly.
InnoDB tables are always clustered and if your child table's PK1 looks similar to: {user_id, tab_id}2, this will automatically store tabs belonging to the same user physically close together, minimizing I/O during querying for "tabs of the give user".
OTOH, if your child PK is: {tab_id, user_id}, this will store users connected to the same tab physically close together, making queries such as: "give me all users connected to given tab" very fast.
Unfortunately MySQL doesn't support leading-edge index compression (a-la Oracle), so you'll still pay the storage (and cache) price for repeating all these user_ids (or tab_ids in the second case) in the child table, but despite that, I'd still go for the solution (2) for flexibility and (probably) ease of querying.
1 Which InnoDB automatically uses as clustering key.
2 I.e. user's PK is at the leading edge of the child table's PK.

MySQL Table Locks

I was asked to do some PHP scripts on MySQL DB to show some data when I noticed the strange design they had.
They want to perform a study that would require collecting up to 2000 records per user and they are automatically creating a new table for each user that registers. It's a pilot study at this stage so they have around 30 tables but they should have 3000 users for the real study.
I wanted to suggest gathering all of them in a single table but since there might be around 1500 INSERTs per minute to that database during the study period, I wanted to ask this question here first. Will that cause table locks in MySQL?
So, Is it one table with 1500 INSERTs per minute and a maximum size of 6,000,000 records or 3000 tables with 30 INSERTs per minute and a maximum size of 2000 records. I would like to suggest the first option but I want to be sure that it will not cause any issues.
I read that InnoDB has row-level locks. So, will that have a better performance combined with the one table option?
This is a huge loaded question. In my experience performance is not really measured accurately by table size alone. It comes down to design. Do you have the primary keys and indexes in place? Is it over indexed? That being said, I have also found that almost always one trip to the DB is faster than dozens. How big is the single table (columns)? What kind of data are you saving (larger than 4000K?). It might be that you need to create some prototypes to see what performs best for you. The most I can recommend is that you carefully judge the size of the data you are collecting and allocate accordingly, create indexes (but not too many, don't over index), and test.

Query JOIN or not? (optimization)

i was wondering if to use 2 tables is better then using 1 single table .
Scenario:
i have a simple user table and a simple user_details table. i can JOIN tables and select both records.
But i was wondering if to merge 2 table into 1 single table.
what if i have 2milions users records in both tables?
in terms of speed and exec time is better to have a single table when selecting records?
You should easily be able to make either scenario perform well with proper indexing. Two million rows is not that many for any modern RDBMS.
However, one table is a better design if rows in the two tables represent the same logical entity. If the user table has a 1:1 relationship with the user_detail table, you should (probably) combine them.
Edit: A few other answers have mentioned de-normalizing--this assumes the relationship between the tables is 1:n (I read your question to mean the relationship was 1:1). If the relationship is indeed 1:n, you absolutely want to keep them as two tables.
Joins themselves are not inherently bad; RDBMS are designed to perform joins very efficiently—even with millions or hundreds of millions of records. Normalize first before you start to de-normalize, especially if you're new to DB design. You may ultimately end up incurring more overhead maintaining a de-normalized database than you would to use the appropriate joins.
As to your specific question, it's very difficult to advise because we don't really know what's in the tables. I'll throw out some scenarios, and if one matches yours, then great, otherwise, please give us more details.
If there is, and will always be a one-to-one relationship between user and user_details, then user details likely contains attributes of the same entity and you can consider combining them.
If the relationship is 1-to-1 and the user_details contains LOTS of data for each user that you don't routinely need when querying, it may be faster to keep that in a separate table. I've seen this often as an optimization to reduce the cost of table scans.
If the relationship is 1-to-many, I'd strongly advice against combining them, you'll soon wish you hadn't (as will those who come after you)
If the schema of user_details changes, I've seen this too where there is a core table and an additional attribute table with variable schema. If this is the case, proceed with caution.
To denormalize or not to denormalize, that is the question...
There is no simple, one-size-fits all response to this question. It is a case by case decision.
In this instance, it appears that there is exactly one user_detail record per record in the user table (or possibly either 1 or 0 detail record per user record), so shy of subtle caching concerns, there is really little no penalty for "denormalizing". (indeed in the 1:1 cardinality case, this would effectively be a normalization).
The difficulty in giving a "definitive" recommendation depends on many factors. In particular (format: I provide a list of questions/parameters to consider and general considerations relevant to these):
what is the frequency of UPDATEs/ DELETEs / INSERTs ?
what is the ratio of reads (SELECTs) vs. writes (UPDATEs, DELETEs, INSERTs) ?
Do the SELECT usually get all the rows from all the tables, or do we only get a few rows and [often or not] only select from one table at a given time ?
If there is a relative little amount of writes compared with reads, it would be possible to create many indexes, some covering the most common queries, and hence logically re-creating of sort, in a more flexible fashion the two (indeed multiple) table setting. The downside of too many covering indices could of course be to occupy too much disk space (not a big issue these days) but also to possibly impede (to some extent) the cache. Also too many indices may put undue burden on write operations...
what is the size of a user record? what is the size of a user_detail record?
what is the typical filtering done by a given query? Do the most common queries return only a few rows, or do they yield several thousand records (or more), most of the time?
If any one of the record average size is "unusually" long, say above 400 bytes, a multi-table may be appropriate. After all, an somewhat depending on the type of filtering done by the queries, the JOIN operation are typically very efficiently done by MySQL, and there is therefore little penalty in keeping separate table.
is the cardinality effectively 1:1 or 1:[0,1] ?
If it isn't the case i.e if we have user records with more than one user_details, given the relatively small number or records (2 millions) (Yes, 2M is small, not tiny, but small, in modern DBMS contexts), denormalization would probably be a bad idea. (possible exception with cases where we query several dozens of time per second the same 4 or 5 fields, some from the user table, some from the user_detail table.
Bottom lines:
2 Million records is relatively small ==> favor selecting a schema that is driven by the semantics of the records/sub-records rather than addressing, prematurely, performance concerns. If there are readily effective performance bottlenecks, the issue is probably not caused nor likely to be greatly helped by schema changes.
if 1:1 or 1:[0-1] cardinality, re-uniting the data in a single table is probably a neutral choice, performance wise.
if 1:many cardinality, denormalization ideas are probably premature (again given the "small" database size)
read about SQL optimization, pro-and-cons of indexes of various types, ways of limiting the size of the data, while allowing the same fields/semantics to be recorded.
establish baselines, monitor the performance frequently.
Denormalization will generally use-up more space while affording better query performance.
Be careful though - cache also matters and having more data effectively "shrinks" your cache! This may or may not wipe-out the theoretical performance benefit of merging two tables into one. As always, benchmark with representative data.
Of course, the more denormalized your data model is, the harder it will be to enforce data consistency. Performance does not matter if data is incorrect!
So, the answer to your question is: "it depends" ;)
The current trend is denormalize (i.e. put them in the same table). It usually give better performance, but easier to make inconsistent (programming mistake, that is).
Plan: determine your workload type.
Benchmark: See if the performance gain worth the risk.