I'm trying to identify which transaction inserts data into a db table into a Mysql 5.1 server using innodb engine. Unfortunately I don't have access at the source code and thus I have to try to guess what's happening by looking at data on the DB.
It seems to me that a Dataset that is supposed to be written into the db in a single transaction is instead written into 2 separate transactions. I would like to test if this assumption is true or not.
In order to do that, my idea was to add a column into my table, let's say TransactionID, and than with a trigger copy the transactionId value on that column.
But I've found that seems not possible to detect in mysql 5.1 the TransactionId for innodb engine.
Do you know whether there's other options to identify the transaction involved into data insertion?
Related
I am using 2 separate processes via multiprocessing in my application. Both have access to a MySQL database via sqlalchemy core (not the ORM). One process reads data from various sources and writes them to the database. The other process just reads the data from the database.
I have a query which gets the latest record from the a table and displays the id. However it always displays the first id which was created when I started the program rather than the latest inserted id (new rows are created every few seconds).
If I use a separate MySQL tool and run the query manually I get correct results, but SQL alchemy is always giving me stale results.
Since you can see the changes your writer process is making with another MySQL tool that means your writer process is indeed committing the data (at least, if you are using InnoDB it does).
InnoDB shows you the state of the database as of when you started your transaction. Whatever other tools you are using probably have an autocommit feature turned on where a new transaction is implicitly started following each query.
To see the changes in SQLAlchemy do as zzzeek suggests and change your monitoring/reader process to begin a new transaction.
One technique I've used to do this myself is to add autocommit=True to the execution_options of my queries, e.g.:
result = conn.execute( select( [table] ).where( table.c.id == 123 ).execution_options( autocommit=True ) )
assuming you're using innodb the data on your connection will appear "stale" for as long as you keep the current transaction running, or until you commit the other transaction. In order for one process to see the data from the other process, two things need to happen: 1. the transaction that created the new data needs to be committed and 2. the current transaction, assuming it's read some of that data already, needs to be rolled back or committed and started again. See The InnoDB Transaction Model and Locking.
I currently have a PostgreSQL database, because one of the pieces of software we're using only supports this particular database engine. I then have a query which summarizes and splits the data from the app into a more useful format.
In my MySQL database, I have a table which contains an identical schema to the output of the query described above.
What I would like to develop is an hourly cron job which will run the query against the PostgreSQL database, then insert the results into the MySQL database. During the hour period, I don't expect to ever see more than 10,000 new rows (and that's a stretch) which would need to be transferred.
Both databases are on separate physical servers, continents apart from one another. The MySQL instance runs on Amazon RDS - so we don't have a lot of control over the machine itself. The PostgreSQL instance runs on a VM on one of our servers, giving us complete control.
The duplication is, unfortunately, necessary because the PostgreSQL database only acts as a collector for the information, while the MySQL database has an application running on it which needs the data. For simplicity, we're wanting to do the move/merge and delete from PostgreSQL hourly to keep things clean.
To be clear - I'm a network/sysadmin guy - not a DBA. I don't really understand all of the intricacies necessary in converting one format to the other. What I do know is that the data being transferred consists of 1xVARCHAR, 1xDATETIME and 6xBIGINT columns.
The closest guess I have for an approach is to use some scripting language to make the query, convert results into an internal data structure, then split it back out to MySQL again.
In doing so, are there any particular good or bad practices I should be wary of when writing the script? Or - any documentation that I should look at which might be useful for doing this kind of conversion? I've found plenty of scheduling jobs which look very manageable and well-documented, but the ongoing nature of this script (hourly run) seems less common and/or less documented.
Open to any suggestions.
Use the same database system on both ends and use replication
If your remote end was also PostgreSQL, you could use streaming replication with hot standby to keep the remote end in sync with the local one transparently and automatically.
If the local end and remote end were both MySQL, you could do something similar using MySQL's various replication features like binlog replication.
Sync using an external script
There's nothing wrong with using an external script. In fact, even if you use DBI-Link or similar (see below) you probably have to use an external script (or psql) from a cron job to initiate repliation, unless you're going to use PgAgent to do it.
Either accumulate rows in a queue table maintained by a trigger procedure, or make sure you can write a query that always reliably selects only the new rows. Then connect to the target database and INSERT the new rows.
If the rows to be copied are too big to comfortably fit in memory you can use a cursor and read the rows with FETCH, which can be helpful if the rows to be copied are too big to comfortably fit in memory.
I'd do the work in this order:
Connect to PostgreSQL
Connect to MySQL
Begin a PostgreSQL transaction
Begin a MySQL transaction. If your MySQL is using MyISAM, go and fix it now.
Read the rows from PostgreSQL, possibly via a cursor or with DELETE FROM queue_table RETURNING *
Insert them into MySQL
DELETE any rows from the queue table in PostgreSQL if you haven't already.
COMMIT the MySQL transaction.
If the MySQL COMMIT succeeded, COMMIT the PostgreSQL transaction. If it failed, ROLLBACK the PostgreSQL transaction and try the whole thing again.
The PostgreSQL COMMIT is incredibly unlikely to fail because it's a local database, but if you need perfect reliability you can use two-phase commit on the PostgreSQL side, where you:
PREPARE TRANSACTION in PostgreSQL
COMMIT in MySQL
then either COMMIT PREPARED or ROLLBACK PREPARED in PostgreSQL depending on the outcome of the MySQL commit.
This is likely too complicated for your needs, but is the only way to be totally sure the change happens on both databases or neither, never just one.
BTW, seriously, if your MySQL is using MyISAM table storage, you should probably remedy that. It's vulnerable to data loss on crash, and it can't be transactionally updated. Convert to InnoDB.
Use DBI-Link in PostgreSQL
Maybe it's because I'm comfortable with PostgreSQL, but I'd do this using a PostgreSQL function that used DBI-link via PL/Perlu to do the job.
When replication should take place, I'd run a PL/PgSQL or PL/Perl procedure that uses DBI-Link to connect to the MySQL database and insert the data in the queue table.
Many examples exist for DBI-Link, so I won't repeat them here. This is a common use case.
Use a trigger to queue changes and DBI-link to sync
If you only want to copy new rows and your table is append-only, you could write a trigger procedure that appends all newly INSERTed rows into a separate queue table with the same definition as the main table. When you want to sync, your sync procedure can then in a single transaction LOCK TABLE the_queue_table IN EXCLUSIVE MODE;, copy the data, and DELETE FROM the_queue_table;. This guarantees that no rows will be lost, though it only works for INSERT-only tables. Handling UPDATE and DELETE on the target table is possible, but much more complicated.
Add MySQL to PostgreSQL with a foreign data wrapper
Alternately, for PostgreSQL 9.1 and above, I might consider using the MySQL Foreign Data Wrapper, ODBC FDW or JDBC FDW to allow PostgreSQL to see the remote MySQL table as if it were a local table. Then I could just use a writable CTE to copy the data.
WITH moved_rows AS (
DELETE FROM queue_table RETURNING *
)
INSERT INTO mysql_table
SELECT * FROM moved_rows;
In short you have two scenarios:
1) Make destination pull the data from source into its own structure
2) Make source push out the data from its structure to destination
I'd rather try the second one, look around and find a way to create postgresql trigger or some special "virtual" table, or maybe pl/pgsql function - then instead of external script, you'll be able to execute the procedure by executing some query from cron, or possibly from inside postgres, there are some possibilities of operation scheduling.
I'd choose 2nd scenario, because postgres is much more flexible, and manipulating data some special, DIY ways - you will simply have more possibilities.
External script probably isn't a good solution, e.g. because you will need to treat binary data with special care, or convert dates× from DATE to VARCHAR and then to DATE again. Inside external script, various text-stored data will be probably just strings, and you will need to quote it too.
I have a SQL Server 2008 database with SET ALLOW_SNAPSHOT_ISOLATION ON and a Person table with columns ID (primary key), and SSN (unique non-clustered index).
One of the rows in the database is ID = 1, SSN = 776-56-4453.
One one connection, this happens:
set transaction isolation level snapshot
begin transaction snapshot
while (1 = 1) select * from person where SSN = '777-77-7777'
Then on another connection:
update person set SSN = '555-55-5555' where ID = 1
As expected, the first connection continues to show the SSN as '777-77-7777' even after the second connection finishes execution. The execution plan for the first connection shows a 'clustered index seek' on SSN, but how can the first connection continue to use the index, if the index key has been updated on the other connection?
Does SQL server do anything special to keep multiple versions of the indexes to accommodate for this?
I am trying to understand the performance characteristics of Snapshot Isolation level, and so want to confirm that SQL Server is smart enough to use existing indexes even when retrieving stale data from the row's previous versions.
As far as I can tell (using DBCC IND and DBCC PAGE as described here and looking at sys.dm_tran_version_store) when updating the index key in a database with snapshot isolation enabled the following happens.
The original row is copied into the version store.
The original row is marked as a ghost and gets the Version Pointer updated to point to the correct location.
A new row is inserted for the new key value.
At some later point the ghost clean up process runs and removes the row.
The only difference in your scenario seems to be that the ghost cleanup process does not clean up the row until it is no longer required by an outstanding snapshot isolation transaction. i.e. the BTree contains rows for both the old and new key values until they are no longer required which allows an index seek on the old value to still work as before.
With snapshot isolation, SQL Server will put "snapshots" of the data being modified into tempDB and other connections will read from there. So your first connection here is reading its values and all the relevant indices involved from a snapshot copy in tempDB
I have an VFP based application with a directory full of DBFs. I use ODBC in .NET to connect and perform transactions on this database. I want to mirror this data to mySQL running on my webhost.
Notes:
This will be a one-way mirror only. VFP to mySQL
Only inserts and updates must be supported. Deletes don't matter
Not all tables are required. In fact, I would prefer to use a defined SELECT statement to only mirror psuedo-views of the necessary data
I do not have the luxury of a "timemodified" stamp on any VFP records.
I don't have a ton of data records (maybe a few thousand total) nor do I have a ton of concurrent users on the mySQL side, want to be as efficient as possible though.
Proposed Strategy for Inserts (doesn't seem that bad...):
Build temp table in mySQL, insert all primary keys of the VFP table/view I want to mirror
Run "SELECT primaryKey from tempTable not in (SELECT primaryKey from mirroredTable)" on mySQL side to identify missing records
Generate and run the necessary INSERT sql for those records
Blow away the temp table
Proposed Strategy for Updates (seems really heavyweight, probably breaks open queries on mySQL dropped table):
Build temp table in mySQL and insert ALL records from VFP table/view I want to mirror
Drop existing mySQL table
Alter tempTable name to new table name
These are just the first strategies that come to mind, I'm sure there are more effective ways of doing it (especially the update side).
I'm looking for some alternate strategies here. Any brilliant ideas?
It sounds like you're going for something small, but you might try glancing at some replication design patterns. Microsoft has documented some data replication patterns here and that is a good starting point. My suggestion is to check out the simple Move Copy of Data pattern.
Are your VFP tables in a VFP database (DBC)? If so, you should be able to use triggers on that database to set up the information about what data needs to updated in MySQL.
I'm rather new to working with multiple threads in a database (most of my career has been spent on the frontend).
Today I tried testing out a simple php app I wrote to store values in a mysql db using ISAM tables emulating transactions using Table Locking.
I just wrote a blog post on the procedure Here:
Testing With JMeter
From my results my simple php app appears to keep the transactional integrity intact (as seen from the data in my csv files being the same as the data I re-extracted from the database):
CSV Files:
Query Of Data for Both Users After JMeter Test Run:
Am I right in my assumption that the transactional data integrity is intact?
How do you test for concurrency?
Why not use InnoDB and get the same effect without manual table locks?
Also, what are you protecting against? Consider two users (Bill and Steve):
Bill loads record 1234
Steve loads record 1234
Steve changes record 1234 and submits
Bill waits a bit, then updates the stale record 1234 and submits. These changes clobber Bill's.
Table locking doesn't offer any higher data integrity than the native MyISAM table locking. MyISAM will natively lock the table files when required to stop data corruption.
In fact, the reason to use InnoDB over MyISAM is that it will do row locking instead of table locking. It also supports transactions. Multiple updates to different records won't block each other and complex updates to multiple records will block until the transaction is complete.
You need to consider the chance that two updates to the same record will happen at the same time for your application. If it's likely, table/row locking doesn't block the second update, it only postpones it until the first update completes.
EDIT
From what I remember, MyISAM has a special behavior for inserts. It doesn't need to lock the table at all for an insert as it's just appending to the end of the table. That may not be true for tables with unique indexes or non-autoincrement primary keys.