Can MySqlBulkLoader be used with a transaction? I don't see a way to explicitly attach a transaction to an instance of the loader. Is there another way?
As stated here by member of MySQL documentation team:
It's not atomic. The records loaded prior to the error will be in the
table.
Work arround is to import data to dedicated table and then execute INSERT INTO ... SELECT ... which will be atomic operation. On huge data sets this is potential problem becasue of long transaction.
The MySQL manual indicates that the MySqlBulkLoader is a wrapper of 'LOAD DATA INFILE'. While looking at the 'LOAD DATA INFILE' documentation I noticed this paragraph:
If you specify IGNORE, input rows that
duplicate an existing row on a unique
key value are skipped. If you do not
specify either option, the behavior
depends on whether the LOCAL keyword
is specified. Without LOCAL, an error
occurs when a duplicate key value is
found, and the rest of the text file
is ignored. With LOCAL, the default
behavior is the same as if IGNORE is
specified; this is because the server
has no way to stop transmission of the
file in the middle of the operation.
I found no discussion on transactions but the above paragraph would indicate that transactions are not possible.
A workaround would be to import the data into a import table and then use a separate stored procedure to process the data using transactions into the desired table.
So in answ
Related
I'm connecting the database of a legacy application to another database by using a CDC tool (I'm using Zendesk's Maxwell) to read data from the database, and another program I'm writing to write data into the database that originated elsewhere.
The problem is that when the foreign data is written into the legacy database, the CDC tool will pick it up. I want to prevent this, since that foreign data already exists in the other system.
I thought about adding a column to all the tables in the legacy database, called _origin for example, and putting the origin of each row in that column. The issue with that is that the legacy application does a lot of updates, so the CDC tool would miss real updates since, the _origin column wouldn't change.
Is there a way to somehow write metadata into the MySQL binlog to indicate the origin of this specific transaction? I would have to figure out how to read it using the CDC tool or modify a CDC tool to read such metadata, but I want to see if it's even possible.
Or, is there a better way to do this?
No, there's no way to write your own custom metadata into the binlog. Just the data itself, and certain session variables.
One solution could be when you read data from the CDC to insert into your destination database, use REPLACE instead of INSERT. The syntax is the same, but it overwrites instead of appends if the row already exists.
I have an SSIS package that imports a flat CSV file, there are approx 200,000 records in the file. I've set the table that the data imports into with a primary unique key of the account number. There shouldn't be any duplicates in the source data (application controlled - outside of my influence)
However there is 1 duplicate row in the CSV, however when i add the primary key it redirect 7k rows... these aren't duplicate rows it just appears to redirect a load for no reason?
If I manually remove the single duplicate row it works perfectly. There is nothing special about the data or the files, it should just import the data and redirect the error row.
This behavior is due to OLE DB Destination and fast insert mode used.
With fast insert mode, OLE DB destination issues INSERT BULK command and does insert in batches. If one of rows within a batch violates table constraints, the whole batch is failed and got redirected to Error Output. This accounts for strange at the first glance behavior - reject more than 1 row.
What you can do with it - depends on your goal and limitations
If simply filter out consecutive duplicates - switch OLE DB Dest to regular insert mode at cost of significant performance decrease. Simplest way.
If performance drop is not an option, and you need to keep it simple - use Sort Component at Dataflow and tick discard duplicate rows flag. Caveat - you do not have a control on which row will be discarded.
If you need to implement some business rules on what data should pass - then you have to implement some scoring column and use it to filter rows. See Todd McDermid's article on this.
I'm receiving a MySQL dump file .sql daily from an external server, which I don't have any control of. I created a local database to store all data in the .sql file. I hope I can set up a script to automatically update my local database daily. The sql file I'm receiving daily contains old data that is in the local database already. How can I avoid duplicates of such old data and only insert into the local MySQL server new data? Thank you very much!
You can use a third-party database compare tool such as those from Red Gate to create two databases, one current (your "master") and the new dump. You can then run the compare tool between the two versions and update only changes between them, updating your master.
Use unique constraints on field, that you want to be unique.
Also, as Danny Beckett mentioned, to avoid errors in output (which I would prefer to redirect into file for future analysis, to check, if I haven't missed anything in process), you can use INSERT IGNORE construct instead of INSERT.
You can use a constraint supported with IGNORE statement.
The second option, you can first insert the data to a temp table. Then insert only the difference.
Using the second option you may use some restriction to do not search for duplication through add records stored in database.
You need to create a primary key in your table. It should be a unique combination of column values. Using the INSERT query with IGNORE will avoid adding duplicates in this table.
see http://dev.mysql.com/doc/refman/5.5/en/insert.html
If this is a plain vanilla mysqldump file, then normally it includes DROP TABLE IF EXISTS... statements and create table statements, so the tables are recreated when the data is imported. So duplicte data should not be a problem, unless I'm missing something.
I'm working on implementing and designing my first database and have a lot of columns with names and addresses and the like.
It seems logical to place a CHECK constraint on these columns so that the DB only accepts values from an alphanumeric range (disallowing any special characters).
I am using MySQL which, as far as I can tell doesn't support user defined types, is there an easy way to do this?
It seems worth while to prevent bad data from entering the DB, but should this complex checking be offloaded to the application instead?
You can't do it with a CHECK constraint if you're using mysql (question is tagged wth mysql, so I presume this is the case) - mysql doesn't support check constraints. They are allowed in the syntax (to be compatible with DDL from other databases), but are otherwise ignored.
You could add a trigger to the table that fires on insert and update, that checks the data for compliance, but if you find a problem there's no way to raise an exception from a mysql stored proc.
I have used a workaround of hitting a table that doesn't exist, but has a name that conveys the meaning you want, eg
update invalid_characters set col1 = 1;
and hope that the person reading the "table invalid_characters does not exist" message gets the idea.
There are several settings that allows you to change how MySQL handles certain situation (but those aren't enough) for your case.
I would stick with data validation on application side but if you need validation on database side, you have two options:
CREATE PROCEDURE that would validate and insert data, do nothing or raise error by calling SIGNAL
CREATE TRIGGER ... BEFORE INSERT which would validate data and stop insert like suggested in this stackoverflow answer
I have a large database which I am trying to update via perl. The information to be added comes from a csv file which I do not control (but which is trusted—it comes from a different part of our company). For each record in the file, I need to either add it (if it does not exist) or do nothing (if it exists). Adding a record consists of the usual INSERT INTO, but before that can run for a particular entry a specific UPDATE must be run.
Let's say for the sake of concreteness that the file has 10,000 entries, but 90% of them are already in the database. What is the most efficient way to import the records? I can see a few obvious approaches:
Pull all records of this type from the database, then check each of the entries from the file for membership. Downside: lots of data transfer, possibly enough to time the server out.
Read in the entries from the file and send a query for just those records with an RLIKE 'foo|bar|baz|...' query (or a stuff = 'foo' || stuff = 'bar' || ... query, but that seems even worse). Downside: huge query, probably enough to choke the server.
Read in the file, send a query for each entry, then add it if appropriate. Downside: tens of thousands of queries, very slow.
Apart from the UPDATE requirement, this seems like a fairly standard issue that presumably has a standard solution. If there is, it can probably be adapted to my case with appropriate use of tests on the auto_increment primary key.
The standard solution is to use INSERT IGNORE which won't raise an error if the insertion would fail because of a constraint. This isn't much use to you as it doesn't give you a chance to do the UPDATE before you know the INSERT is going to work. If you can do the update afterwards, however, this is ideal: just INSERT IGNORE each record and then do the UPDATE if it succeeded.
If a record already exists that means a record with a matching unique key is already in the database, so I don't understand the RLIKE proposal which is bound to be slow.
I would use Perl to grep the CSV file using SELECT count(*) FROM table WHERE key = ? for each record, and removing anything where the result is non-zero.
Then just do your UPDATE and INSERT for everything left in the filtered CSV data.
There is no need to timeout the server if you keep flushing data while iterating the list.