MySQL - pt-online-schema-change effect on db performance - mysql

I want to use pt-online-schema-change to change the schema of a big table (~100M records), does this tool effect the performance of MySql while its running?

By design, the tool will have no significant effect on performance. First, let's review what the tool does:
attach triggers to the current table, to copy all updates, deletes and inserts to the new table.
copy existing data over in chunks, partitioned by the key
The first part is going to double all writes and there's no way around this. The second part is a batch operation that is going to potentially lock the current table and use up a lot of IO.
Fortunately, the second part is split into chunks and pt-online-schema-change is quite clever about how big the chunks are and how long it waits between chunks:
it checks slave replication between chunks, and pauses if the lag is too great. it is able to recursively check for slaves.
it checks load (typically measured by number of running threads) and pauses if there are too many queries running (which implies lock contention or high CPU/IO usage). it can even abort if the load is extremely high.
it configures InnoDB lock settings such that it is most likely to be the victim of any lock contention, so production queries will run smoothly.
by default, the chunk size is dynamically changed to keep its runtime consistent, using a weighted average of previous chunk runtimes.
chunks that are too big (e.g. due to a huge number of rows with the same key) will be skipped over.
Due to this, it is likely that your server will only be slightly affected by the copy. But of course, there is no guarantee and if possible, you should run the tool on a staging version of the database. In the event of issues, you can safely abort the tool with no loss of data.

The online schema change is pretty intense, while running all inserts/updates/deletes to the original table are doubled (the tool adds triggers so these actions are duplicated on the new copy), and piece by piece every row of the original table is copied over.
If your concerned about what effects this may have on your database, check out the doc: http://www.percona.com/doc/percona-toolkit/2.2/pt-online-schema-change.html
There are several ways you can have the tool throttle itself, including checking thread count, replication lag, and modifying how large each chunk is/how long each chunk should take.

It will read and write stuff to disk, consume memory and use CPU, so yes, it will affect performance while running. How could it be otherwise?

Related

Does MySQL stalls the whole cluster during DDL statements?

Recently, I read that Galera based MySQL cluster uses a concept called total order isolation (https://galeracluster.com/library/documentation/schema-upgrades.html#toi) for DDL's by default which stalls the writes on the whole cluster until it is commited on all the nodes.
How does MySQL handles DDL in native asynchronous replication ?
Does it stall writes for the other schemas as well?
Native Replication sticks the DDL into the replication stream. When the command pops up in the Slave, it executes the DDL before moving on to other queries in the replication stream.
Caveat: The above statement assumes old flavor, without multi-master replication or multiple replication threads. Regardless of this caveat the table being modified is blocked on the Slave just as it was on the Master.
Galera's TOI goes to some extra effort to make sure all the nodes are in sync, even accounting for the DDL versus ordinary writes. Hnece the name "Total Order of Inserts".
Galera's RSU is, in many cases, a viable alternative. It is not more invasive than a crash of each node, one at a time (hence "Rolling"). Assuming connections can failover to different nodes, RSU avoids other blockage.
Still, you should make a conscious choice between RSU and TOI; there are use cases for dictating one versus the other.
In a distributed system (multiple nodes, multiple clients, etc), pushing code gets tricky. I like to take this approach, even though it leads to perhaps 3 times as many pushes:
Push application code to discover whether the database change has been pushed. Have the code work either with the old schema or new. Do this "push" in a "rolling" manor
Push the new schema (eg CREATE/ALTER TABLE).
Clean up the code. (Again, "roll" it out to the many clients.)

Why is the import process very slow in Solr 5.3.x?

I'm using solr 5.3.1's DataImportHandler to import IMDB data which I imported into MySQL.
However it takes a couple of seconds even minutes to get one document processed. My table contains 10M+ rows so this is going to take forever. I have materialized all data and it only take a few minutes for MySQL to get all row processed.
What could have caused this poor performance?
#yangrui
Unfortunately there is no single answer to your question on why indexing is slow. 24G is a lot of heap but depending on the actual size of your index it may or may not be enough.
Commit policy modification should also help in case you are committing too frequently. SOLR does a lot its magic of making documents available for searches when a 'commit' / 'autocommit' happens. However the when a commit does happen it is a resource hungry operation.
One other thing that is not obvious is the actual unallocated RAM available on the server. By unallocated I mean additional RAM on the server apart from the RAM that is associated with the JVM as Heap.
I suggest going through this documentation https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
I suspect that you may not have enough RAM on your machine.
Hope this helps.

Very frequent couchbase document updates

I'm new to couchbase and was wondering if very frequent updates to a single document (possibly every second) will cause all updates to pass through the disk write queue, or only the last update made to the document?
In other words, does couchbase optimize disk writes by only writing the document to disk once, even if updated multiple time between writes.
Based on the docs, http://docs.couchbase.com/admin/admin/Monitoring/monitor-diskqueue.html, it sounds like all updates are processed. If anyone can confirm this, I'd be grateful.
thanks
Updates are held in a disk queue before being written to disk. If a write to a document occurs and a previous write is still in the disk queue, then the two writes will be coalesced, and only the more recent version will actually be written to disk.
Exactly how fast the disk queue drains will depend on the storage subsystem, so whether writes to the same key get coalesced will depend on how quick the writes come in compared to the storage subsystem speed / node load.
Jako, you should worry more about updates happening in the millisecond time frame or more than one update happening in 1 (one) millisecond. The disc write isn't the problem, Couchbase solves this intelligently itself but the fact that you will concurrency issues when you operate in the milliseconds time frame.
I've run into them fairly easily when I tested my application and first couldn't understand why Node.js (in my case) sometimes would write data to CouchBase and sometimes not. If it didn't write to CouchBase usually for the first record.
More problems raised when I first checked if a document with a specific key existed, upon not existing I would try to write it to CouchBase only to find out that in the meantime an early callback had finished and now there was indeed a key for the same document.
In those case you have to operate with the CAS flag and program it iteratively so that your app is continuously trying to pull the right document for that key and then updates. Keep this in mind especially when running tests and updates to the same document is being done!

Is it a good idea to wrap a data migration into a single transaction scope?

I'm doing a data migration at the moment of a subset of data from one database into another.
I'm writing a .net application that is going to communicate with our in house ORM which will drag data from the source database to the target database.
I was wondering, is it feasible, or is it even a good idea to put the entire process into a transaction scope and then if there are no problems to commit it.
I'd say I'd be moving possibly about 1Gig of data across.
Performance is not a problem but is there a limit on how much modified or new data that can be inside a transaction scope?
There's no limit other than the physical size of the log file (note the size required will be much more then the size of the migrated data. Also think about if there is an error and you rollback the transaction that may take a very, very long time.
If the original database is relatively small (< 10 gigs) then I would just make a backup and run the migration non-logged without a transaction.
If there are any issues just restore from back-up.
(I am assuming that you can take the database offline for this - doing migrations when live is a whole other ball of wax...)
If you need to do it while live then doing it in small batches within a transaction is the only way to go.
I assume you are copying data between different servers.
In answer to your question, there is no limit as such. However there are limiting factors which will affect whether this is a good idea. The primary one is locking and lock contention. I.e.:
If the server is in use for other queries, your long-running transaction will probably lock other users out.
Whereas, If the server is not in use, you don't need a transaction.
Other suggestions:
Consider writing the code so that it is incremental, and interruptable, i.e. does it a bit at a time, and will carry on from wherever it left off. This will involve lots of small transactions.
Consider loading the data into a temporary or staging table within the target database, then use a transaction when updating from that source, using a stored procedure or SQL batch. You should not have too much trouble putting that into a transaction because, being on the same server, it should be much, much quicker.
Also consider SSIS as an option. Actually, I know nothing about SSIS, but it is supposed to be good at this kind of stuff.

Best storage engine for constantly changing data

I currently have an application that is using 130 MySQL table all with MyISAM storage engine. Every table has multiple queries every second including select/insert/update/delete queries so the data and the indexes are constantly changing.
The problem I am facing is that the hard drive is unable to cope, with waiting times up to 6+ seconds for I/O access with so many read/writes being done by MySQL.
I was thinking of changing to just 1 table and making it memory based. I've never used a memory table for something with so many queries though, so I am wondering if anyone can give me any feedback on whether it would be the right thing to do?
One possibility is that there may be other issues causing performance problems - 6 seconds seems excessive for CRUD operations, even on a complex database. Bear in mind that (back in the day) ArsDigita could handle 30 hits per second on a two-way Sun Ultra 2 (IIRC) with fairly modest disk configuration. A modern low-mid range server with a sensible disk layout and appropriate tuning should be able to cope with quite a substantial workload.
Are you missing an index? - check the query plans of the slow queries for table scans where they shouldn't be.
What is the disk layout on the server? - do you need to upgrade your hardware or fix some disk configuration issues (e.g. not enough disks, logs on the same volume as data).
As the other poster suggests, you might want to use InnoDB on the heavily written tables.
Check the setup for memory usage on the database server. You may want to configure more cache.
Edit: Database logs should live on quiet disks of their own. They use a sequential access pattern with many small sequential writes. Where they share disks with a random access work load like data files the random disk access creates a big system performance bottleneck on the logs. Note that this is write traffic that needs to be completed (i.e. written to physical disk), so caching does not help with this.
I've now changed to a MEMORY table and everything is much better. In fact I now have extra spare resources on the server allowing for further expansion of operations.
Is there a specific reason you aren't using innodb? It may yield better performance due to caching and a different concurrency model. It likely will require more tuning, but may yield much better results.
should-you-move-from-myisam-to-innodb
I think that that your database structure is very wrong and needs to be optimised, has nothing to do with the storage