I'm currently taking the course "Performance Evaluation" at university, and we're now doing an assignment where we are testing the CPU usage on a PHP and MySQL-database server. We use httperf to create custom traffic, and vmstat to track the server load. We are running 3000 connections to the PHP-server, for both INSERT and DELETE (run separately).
Numbers show that the DELETE operation is a lot more CPU intensive than INSERT — and I'm just wondering why?
I initially thought INSERT required more CPU usage, as indexes would need to be recreated, data needed to be written to disk, etc. But obviously I'm wrong, and I'm wondering if anyone can tell me the technical reason for this.
At least with InnoDB (and I hope they have you on this), you have more operations even with no foreign keys. An insert is roughly this:
Insert row
Mark in binary log buffer
Mark commit
Deletions do the following:
Mark row removed (taking the same hit as an insertion -- page is rewritten)
Mark in binary log buffer
Mark committed
Actually go remove the row, (taking the same hit as an insertion -- page is rewritten)
Purge thread tracks deletions in binary log buffer too.
For that, you've got twice the work going on to delete rather than insert. A delete requires those two writes because it must be marked as removed for all versions going forward, but can only be removed when no transactions remain which see it. Because InnoDB only writes full blocks, to the disk, the modification penalty for a block is constant.
DELETE also requires data to be written to disk, plus recalculation of indexes, and in addition, a set of logical comparisons to find the record(s) you are trying to delete in the first place.
Delete requires more logic than you think; how much so depends on the structure of the schema.
In almost all cases, when deleting a record, the server must check for any dependencies upon that record as a foreign key reference. That, in a nutshell, is a query of the system tables looking for table definitions with a foreign key ref to this table, then a select of each of those tables for records referencing the record to be deleted. Right there you've increased the computational time by a couple orders of magnitude, regardless of whether the server does cascading deletes or just throws back an error.
Self-balancing internal data structures would also have to be reorganized, and indexes would have to be updated to remove any now-empty branches of the index trees, but these would have counterparts in the Insert operations.
Related
I'm trying to figure out how multiple indexes are actually affecting insertion performance for MySQL InnoDB tables.
Is it possible to get information about index update times using performance_schema?
It seems like there are no instruments for stages that may reflect such information.
Even if there is something in performance_schema, it would be incomplete.
Non-UNIQUE secondary indexes are handled thus:
An INSERT starts.
Any UNIQUE indexes (including the PRIMARY KEY) are immediately checked for "dup key".
Other index changes are put into the "change buffer".
The INSERT returns to the client.
The Change Buffer is a portion of the buffer_pool (default: 25%) where such index modifications are held. Eventually, they will be batched up for updating the actual blocks of the index's BTree.
In a good situation, many index updates will be combined into very few read-modify-write steps to update a block. In a poor case, each index update requires a separate read and write.
The I/O for the change buffer is done 'in the background' as is the eventual write of any changes to data blocks. These cannot be realistically monitored in any way -- especially if there are different clients with different queries contributing to the same index or data blocks being updated.
Oh, meanwhile, any index lookups need to look both in the on-disk (or cached in buffer_pool) blocks and the change buffer. This makes an index lookup faster or slower, depending on various things unrelated to the operation in hand.
I have a system which has a complex primary key for interfacing with external systems, and a fast, small opaque primary key for internal use. For example: the external key might be a compound value - something like (given name (varchar), family name (varchar), zip code (char)) and the internal key would be an integer ("customer ID").
When I receive an incoming request with the external key, I need to look up the internal key - and here's the tricky part - allocate a new internal key if I don't already have one for the given external ID.
Obviously if I have only one client talking to the database at a time, this is fine. SELECT customer_id FROM customers WHERE given_name = 'foo' AND ..., then INSERT INTO customers VALUES (...) if I don't find a value. But, if there are potentially many requests coming in from external systems concurrently, and many may arrive for a previously unheard-of customer all at once, there is a race condition where multiple clients may try to INSERT the new row.
If I were modifying an existing row, that would be easy; simply SELECT FOR UPDATE first, to acquire the appropriate row-level lock, before doing an UPDATE. But in this case, I don't have a row that I can lock, because the row doesn't exist yet!
I've come up with several solutions so far, but each of them has some pretty significant issues:
Catch the error on INSERT, re-try the entire transaction from the top. This is a problem if the transaction involves a dozen customers, especially if the incoming data is potentially talking about the same customers in a different order each time. It's possible to get stuck in mutually recursive deadlock loops, where the conflict occurs on a different customer each time. You can mitigate this with an exponential wait time between re-try attempts, but this is a slow and expensive way to deal with conflicts. Also, this complicates the application code quite a bit as everything needs to be restartable.
Use savepoints. Start a savepoint before the SELECT, catch the error on INSERT, and then roll back to the savepoint and SELECT again. Savepoints aren't completely portable, and their semantics and capabilities differ slightly and subtly between databases; the biggest difference I've noticed is that, sometimes they seem to nest and sometimes they don't, so it would be nice if I could avoid them. This is only a vague impression though - is it inaccurate? Are savepoints standardized, or at least practically consistent? Also, savepoints make it difficult to do things in parallel on the same transaction, because you might not be able to tell exactly how much work you'll be rolling back, although I realize I might just need to live with that.
Acquire some global lock, like a table-level lock using a LOCK statement (oracle mysql postgres). This obviously slows down these operations and results in a lot of lock contention, so I'd prefer to avoid it.
Acquire a more fine-grained, but database-specific lock. I'm only familiar with Postgres's way of doing this, which is very definitely not supported in other databases (the functions even start with "pg_") so again it's a portability issue. Also, postgres's way of doing this would require me to convert the key into a pair of integers somehow, which it may not neatly fit into. Is there a nicer way to acquire locks for hypothetical objects?
It seems to me that this has got to be a common concurrency problem with databases but I haven't managed to find a lot of resources on it; possibly just because I don't know the canonical phrasing. Is it possible to do this with some simple extra bit of syntax, in any of the tagged databases?
I'm not clear on why you can't use INSERT IGNORE, which will run without error and you can check if an insert occurred (modified records). If the insert "fails", then you know the key already exists and you can do a SELECT. You could do the INSERT first, then the SELECT.
Alternatively, if you are using MySQL, use InnoDB which supports transactions. That would make it easier to rollback.
Perform each customer's "lookup or maybe create" operations in autocommit mode, prior to and outside of the main, multi-customer transaction.
WRT generating an opaque primary key, there are a number of options, eg., use a guid or (at least, with Oracle) a sequence table. WRT insuring the external key is unique, apply unique constraint on the column. If the insert fails because the key exists, reattempt the fetch. You can use an insert with where not exist or where not in. Use a stored procedure to reduce the round trips and improve performance.
I have a table with large amount of data. The data need to be updated frequently: delete old data and add new data. I have two options
whenever there is an deletion event, I delete the entry immediately
I marked delete the entries and use an cron job to delete at unpeak time.
any efficiency difference between the two options? or any better solution?
Both delete and update can have triggers, this may affect performance (check if that's your case).
Updating a row is usually faster than deleting (due to indexes etc.)
However, in deleting a single row in an operation, the performance impact shouldn't be that big. If your measurements show that the database spends significant time deleting rows, then your mark-and-sweep approach might help. The key word here is probably measured - unless the single deletes are significantly slower than updates, I wouldn't bother.
You should use low_priority_updates - http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_low_priority_updates. This will give higher priority to your selects than insert/delete/update operations. I used this in production and got a pretty decent speed improvement. The only problem I see with it is that you will lose more data in case of a crashing server.
With MySQL, deletes are simply marked for deletion internally, and when the CPU is (nearly) idle, MySQL then updates the indexes.
Still, if this is a problem, and you are deleting many rows, consider using DELETE QUICK. This tells InnoDB to not update the index, just leave it marked as deleted, so it can be reused.
To recover the unused index space, simply OPTIMIZE TABLE nightly.
In this case, there's no need to implement the functionality in your application that MySQL will do internally.
One of the portion of my site requires bulk insert, it takes around 40 mins for innodb to load that file into database. I have been digging around the web and found few things.
innodb_autoinc_lock_mode=2 (It wont generate consecutive keys)
UNIQUE_CHECKS=0; (disable unique key checks)
FOREIGN_KEY_CHECKS=0 (disable foreign key checks)
--log_bin=OFF turn off binary log used for replication
Problem
I want to set first 3 options for just one session i.e. during bulk insert. The first option does not work mysql says unknown system variable 'innodb_autoinc_lock_mode'. I am using MySQL 5.0.4
The last option, I would like to turn it off but I am wondering what if I need replication later will it just start working if I turn it on again?
Suggestions
Any other suggestions how to improve bulk inserts/updates for innodb engine? Or please comments on my findings.
Thanks
Assuming you are loading the data in a single or few transactions, most of the time is likely to be spent building indexes (depending on the schema of the table).
Do not do a large number of small inserts with autocommit enabled, that will destroy performance with syncs for each commit.
If your table is bigger (or nearly as big as) the innodb buffer pool you are in trouble; a table which can't fit in ram with its indexes cannot be inserted into efficiently, as it will have to do READS to insert. This is so that existing index blocks can be updated.
Remember that disc writes are ok (they are mostly sequential, and you have a battery-backed raid controller, right?), but reads are slow and need to be avoided.
In summary
Do the insert in a small number of big-ish transactions, say 10k-100k rows or each. Don't make the transactions too big or you'll exhaust the logs.
Get enough ram that your table fits in memory; set the innodb buffer pool appropriately (You are running x86_64, right?)
Don't worry about the operation taking a long time, as due to MVCC, your app will be able to operate on the previous versions of the rows assuming it's only reading.
Don't make any of the optimisations listed above, they're probably waste of time (don't take my word for it - benchmark the operation on a test system in your lab with/without those).
Turning unique checks off is actively dangerous as you'll end up with broken data.
To answer the last part of your question, no it won't just start working again; if the inserts are not replicated but subsequent updates are, the result will not be a pretty sight. Disabling foreign and unique keys should be OK, provided you re-enable them afterwards, and deal with any constraint violations.
How often do you have to do this? Can you load smaller datasets more frequently?
I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?