Copy some data from a database and keep referential integrity - mysql

The requirement is to extract some data from an active database (159 tables at the moment) into another database, such that the copied data has full referential integrity, whilst the data is in flux (it is a live database). This is not about dumping the entire database (approaching 50GB), just extracting some rows that we have identified from the whole database into a separate database.
We currently create a new DB based upon our initial schema and subsequent DDL migrations and repeatables (views, stored procedures, etc.), and then
copy the appropriate rows. This normally takes more than 10 minutes, but less than 1 hour, depending upon the size of the set to be extracted.
Is there a way to tell mysql that I want ignore any transactions committed after I start running the extract, be they new rows added, rows deleted, or rows updated, but any other connection to the database just carries on working as normal, as if I wasn't making any requests.
What I don't want to have happen is I copy data from table 1 and by the time I get to table 159, table 1 has changed and a row in table 159 refers to that new row in table 1.

Use mysqldump --single-transaction. This starts a repeatable-read transaction before it starts dumping data, so any concurrent transactions that happen while you are dumping data don't affect the data dumped by your transaction.
Re your updated question:
You can do your own custom queries in a transaction.
Start a transaction in repeatable-read mode before you begin running queries for your extraction. You can run many queries against many tables, and all the data you extract will be exactly what was currently committed as of the moment you started that transaction.
You might like to read https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html

Related

Table Renaming in an Explicit Transaction

I am extracting a subset of data from a backend system to load into a SQL table for querying by a number of local systems. I do not expect the dataset to ever be very large - no more than a few thousand records. The extract will run every two minutes on a SQL2008 server. The local systems are in use 24 x 7.
In my prototype, I extract the data into a staging table, then drop the live table and rename the staging table to become the live table in an explicit transaction.
SELECT fieldlist
INTO Temp_MyTable_Staging
FROM FOOBAR;
BEGIN TRANSACTION
IF(OBJECT_ID('dbo.MyTable') Is Not Null)
DROP TABLE MyTable;
EXECUTE sp_rename N'dbo.Temp_MyTable_Staging', N'MyTable';
COMMIT
I have found lots of posts on the theory of transactions and locks, but none that explain what actually happens if a scheduled job tries to query the table in the few milliseconds while the drop/rename executes. Does the scheduled job just wait a few moments, or does it terminate?
Conversely, what happens if the rename starts while a scheduled job is selecting from the live table? Does transaction fail to get a lock and therefore terminate?

Handling bulk insert of huge data

I have some data in csv files. The volume of the data is huge (around 65GB). I want to insert them all in a database so that later they can be queried.
The csv file itself is pretty simple, it has only 5 columns. So basically all the data will be inserted into a single table.
Now I have tried to insert these data into a mysql database but the time it's taking is quite huge. I have spent almost 6 hours to insert just 1.3GB of those data (My processor is core i5 2.9 GHz, RAM is 4GB DDR3).
This loading needs to be finished pretty quickly so that all the data inserts should be done within 4/5 days.
Which database will show the best performance in this case, provided that a reasonable query speed is acceptable on the data?
Also, are there some other steps/practices that I should follow ?
You probably don't even need to import it. You can create a table with the engine=CSV.
mysql> create table mycsv(id int not null) engine=csv;
Query OK, 0 rows affected (0.02 sec)
then go into your data directory and remove mycsv.CSV and move/copy/symlink your CSV file as mycsv.CSV. Go back to mysql and type flush tables; and you're good to go. (NOTE: it may not work with \r\n so you may need to convert those to \n first).
If you are using InnoDB, the problem is that it has to keep track of each undo log entry for every row inserted and this takes a lot of resources, taking a loooong time. Better to do it in smaller batches so it can do most of the undo log tracking in memory. The undo log is there in case you ctrl-c it in the middle of a load and it needs to roll back. After that batch has been loaded, it doesn't need to keep track of it anymore. If you do it all at once, then it has to keep track of all those undo log entries, probably having to go to disk -- and that's a killer.
I prefer to use myisam for data if I know I'm not going to do row level locking, like if I want to run one long program to analyze the data. The table is locked, but I only need one program running on it. Plus you can always use merge tables -- they take myisam tables and you can group them together into one table. I like doing this for log files where each table is a month of data. Then I have a merge table for a year. The merge table doesn't copy the data, it just points to each of the myisam tables.

MySQL INSERT SELECT on large static table

I need to copy the content of one table to another. So I started using:
INSERT new_table SELECT * FROM old_table
However, I am getting the following error now:
1297, "Got temporary error 233 'Out of operation records in transaction coordinator (increase MaxNoOfConcurrentOperations)' from NDBCLUSTER"
I think I have an understanding why this occurs: My table is huge, and MySQL tries to take a snapshot in time (lock everything and make one large transaction out of it).
However, my data is fairly static and there is no other concurrent session that would modify the data. How can I tell MySQL to copy one row at a time, or in smaller chunks, without locking the whole thing?
Edit note: I already know that I can just read the whole table row-by-row into memory/file/dump and write back. I am interested to know if there is an easy way (maybe setting isolation level?). Note that the engine is InnoDB.
Data Migration is one of the few instances where a CURSOR can make sense, as you say, to ensure that the number of locks stays sane.
Use a cursor in conjunction with TRANSACTION, where you commit after every row, or after N rows (e.g. use a counter with modulo)
select the data from innodb into an outfile and load infile into
cluster

Why is a mysqldump with single-transaction more consistent than a one without?

I have gone through the manual and it was mentioned that every transaction will add a BEGIN statement before it starts taking the dump. Can someone elaborate this in a more understandable manner?
Here is what I read:
This option issues a BEGIN SQL statement before dumping data from the server. It is useful only with transactional tables such as InnoDB and BDB, because then it
dumps the consistent state of the database at the time when BEGIN was issued without blocking any applications."
Can some elaborate on this?
Since the dump is in one transaction, you get a consistent view of all the tables in the database. This is probably best explained by a counterexample. Say you dump a database with two tables, Orders and OrderLines
You start the dump without a single transaction.
Another process inserts a row into the Orders table.
Another process inserts a row into the OrderLines table.
The dump processes the OrderLines table.
Another process deletes the Orders and OrderLines records.
The dump processes the Orders table.
In this example, your dump would have the rows for OrderLines, but not Orders. The data would be in an inconsistent state and would fail on restore if there were a foreign key between Orders and OrderLines.
If you had done it in a single transaction, the dump would have neither the order or the lines (but it would be consistent) since both were inserted then deleted after the transaction began.
I used to run into problems where mysqldump without the --single-transaction parameter would consistently fail due to data being changed during the dump. As far as I can figure, when you run it within a single transaction, it is preventing any changes that occur during the dump from causing a problem. Essentially, when you issue the --single-transaction, it is taking a snapshot of the database at that time and dumping it rather than dumping data that could be changing while the utility is running.
This can be important for backups because it means you get all the data, exactly as it is at one point in time.
So for example, imagine a simple blog database, and a typical bit of activity might be
Create a new user
Create a new post by the user
Delete a user which deletes the post
Now when you backup your database, the backup may backup the tables in this order
Posts
Users
What happens if someone deletes a User, which is required by the Posts, just after your backup reaches #1?
When you restore your data, you'll find that you have a Post, but the user doesn't exist in the backup.
Putting a transaction around the whole thing means that all the updates, inserts and deletes that happen on the database during the backup, aren't seen by the backup.

Testing for concurrency and/or transactional integrity in a web application with JMeter

I'm rather new to working with multiple threads in a database (most of my career has been spent on the frontend).
Today I tried testing out a simple php app I wrote to store values in a mysql db using ISAM tables emulating transactions using Table Locking.
I just wrote a blog post on the procedure Here:
Testing With JMeter
From my results my simple php app appears to keep the transactional integrity intact (as seen from the data in my csv files being the same as the data I re-extracted from the database):
CSV Files:
Query Of Data for Both Users After JMeter Test Run:
Am I right in my assumption that the transactional data integrity is intact?
How do you test for concurrency?
Why not use InnoDB and get the same effect without manual table locks?
Also, what are you protecting against? Consider two users (Bill and Steve):
Bill loads record 1234
Steve loads record 1234
Steve changes record 1234 and submits
Bill waits a bit, then updates the stale record 1234 and submits. These changes clobber Bill's.
Table locking doesn't offer any higher data integrity than the native MyISAM table locking. MyISAM will natively lock the table files when required to stop data corruption.
In fact, the reason to use InnoDB over MyISAM is that it will do row locking instead of table locking. It also supports transactions. Multiple updates to different records won't block each other and complex updates to multiple records will block until the transaction is complete.
You need to consider the chance that two updates to the same record will happen at the same time for your application. If it's likely, table/row locking doesn't block the second update, it only postpones it until the first update completes.
EDIT
From what I remember, MyISAM has a special behavior for inserts. It doesn't need to lock the table at all for an insert as it's just appending to the end of the table. That may not be true for tables with unique indexes or non-autoincrement primary keys.