Efficiently bulk import data in SSIS with occasional PK duplicate content? - sql-server-2008

Am regularly loading a flat file with 100k records into a table after some transformations. The table has a PK on two columns. The data on the whole does not contain duplicate PK information but occasionally, there are duplicates.
I naively didn't understand why SSIS was rejecting all my records when only some of them violated the PK constraint. I believe the problem is that during a bulk load, if even 1 of the rows violates the PK constraint, all rows in that batch get rejected.
If I alter the FastLoadMaxInsertCommitSize property of the OLE Db Destination to 1, if fixes the problem but it then runs like a dog as it's committing every 1 row.
In MySQL, the bulk load facility allows you to ignore PK errors and skip those rows without sacrificing performance. Does anyone know of a way to achieve this in SQL Server.
Any help much appreciated.

It sounds like you may be looking for IGNORE_DUP_KEY?
Using the IGNORE_DUP_KEY Option to Handle Duplicate Values
When you create or modify a unique
index or constraint, you can set the
IGNORE_DUP_KEY option ON or OFF. This
option specifies the error response to
duplicate key values in a multiple-row
INSERT statement after the index has
been created. When IGNORE_DUP_KEY is
set to OFF (the default), the SQL
Server Database Engine rejects all
rows in the statement when one or more
rows contain duplicate key values.
When set to ON, only the rows that
contain duplicate key values are
rejected; the nonduplicate key values
are added.
For example, if a single statement
inserts 20 rows into a table with a
unique index, and 10 of those rows
contain duplicate key values, by
default all 20 rows are rejected.
However, if the index option
IGNORE_DUP_KEY is set to ON, only the
10 duplicate key values will be
rejected; the other 10 nonduplicate
key values will be inserted into the
table.

You can up the FastLoadMaxInsertCommitSize to say 5k...this will speed up your inserts greatly. Then, set the Error Output to redirect the rows - on the error output from there, send a batch of 5k rows that contains an error row to another Destination. (This next bit is from memory!) If you set this up to not be fast load, it will then insert the good rows and you can pass the error output to an error table or something like a row count task.
You can play with the FastLoadMaxInsertCommitSize figures until you find something that works well for you.

Related

Adding a UNIQUE key to a large existing MySQL table which is receiving INSERTs/DELETEs

I have a very large table (dozens of millions of rows) and a UNIQUE index needs to be added to a column on that table. I know for a fact that the table does contain duplicated values on that key, which I need to clean up (by deleting rows/resetting the value of the column to something unique that I can automatically generate). A plus is that the rows which are already duplicated do not get modified anymore.
What would be the right approach to perform a change like this, given that I will be probably using the Percona pt-osc tool and there are continuous deletes/inserts on the table? My plan was:
Add code that ensures no dupe IDs get inserted anymore. Probably I need to add a separate table for this temporarily, since I want the database to enforce this for me and not the application - so insert into the "shadow table" with a unique index in a transaction together with my main table, rollback all inserts that try to insert duplicate values
Backfill the table by zapping all invalid column values which are within the primary key range below $current_pkey_value
Then add the index and use pt-osc to changeover the table
Is there anything I am missing?
Since we use pt-online-schema-change we are using triggers for performing the synchronisation from the existing table to a temp table. The tool actually has a special configuration key for this, --no-check-unique-key-change, which will do exactly what we need - agree to perform the ALTER TABLE and set up triggers in such a way that if a conflict occurs, INSERT .. IGNORE will be applied and the first row having used the now-unique value will win in the insert during synchronisation. For us this is a good tradeoff because all the duplicates we have seen resulted from data races, not from actual conflicts in the value generation process.

SQL Query Not Adding New Entries With INSERT IGNORE INTO

So I have a script that gets data about 100 items at a time and inserts them into a MySQL database with a command like this:
INSERT IGNORE INTO beer(name, type, alcohol_by_volume, description, image_url) VALUES('Bourbon Barrel Porter', 2, '9.1', '', '')
I ran the script once, and it populated the DB with 100 entries. However, I ran the script again with the same SQL syntax, gathering all new data (i.e., no duplicates), but the database is not reflecting any new entries -- it is the same 100 entries I inserted on the first iteration of the script.
I logged the queries, and I can confirm that the queries were making requests with the new data, so it's not a problem in the script not gathering new data.
The name field is a unique field, but no other fields are. Am I missing something?
If you use the IGNORE keyword, errors that occur while executing the INSERT statement are treated as warnings instead. For example, without IGNORE, a row that duplicates an existing UNIQUE index or PRIMARY KEY value in the table causes a duplicate-key error and the statement is aborted. With IGNORE, the row still is not inserted, but no error is issued.
If there is no primary key, there can't be duplicate key to ignore. you should always set a primary key, so please do that - and if you want to have additional colums that shouldn't be duplicate, set them as "unique".

REPLACE INTO, does it re-use the PRIMARY KEY?

The REPLACE INTO function in MySQL works in such a way that it deletes and inserts the row. In my table, the primary key (id) is auto-incremented, so I was expecting it to delete and then insert a table with id at the tail of the database.
However, it does the unexpected and inserts it with the same id! Is this the expected behaviour, or am I missing something here? (I am not setting the id when calling the REPLACE INTO statement)
This is an expected behavior if you have another UNIQUE index in your table which you must have otherwise it would add the row as you would expect. See the documentation:
REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted. See Section 13.2.5, “INSERT Syntax”.
https://dev.mysql.com/doc/refman/5.5/en/replace.html
This really also makes lot of sense because how else would mySQL find the row to replace? It could only scan the whole table and that would be time consuming. I created an SQL Fiddle to demonstrate this, please have a look here
That is expected behavior. Technically, in cases where ALL unique keys (not just primary key) on the data to be replaced/inserted are a match to an existing row, MySQL actually deletes your existing row and inserts a new row with the replacement data, using the same values for all the unique keys. So, if you look to see the number of affected rows on such a query you will get 2 affected rows for each replacement and only one for the straight inserts.

Preventing duplicate rows based on a column (MySQL)?

I'm building a system that updates its local database from other APIs frequently. I have Python-scripts set as cron jobs, and they do the job almost fine.
However, the one flaw is, that the scripts take ages to perform. When they are ran for the first time, the process is quick, but after that it takes nearly 20 minutes to go through a list of 200k+ items received from the third-party API.
The problem is that the script first gets all the rows from the database and adds their must-be-unique column value to a list. Then, when going through the API results, it checks if the current items must-be-unique value exists in the list. This gets really heavy, as the list has over 200k values in it.
Is there a way to check in an INSERT-query that, based on a single column, there is no duplicate? If there is, simply not add the new row.
Any help will be appreciated =)
If you add a UNIQUE key to the column(s) that have to contain UNIQUE values, MySQL will complain when you insert a row that violates this constraint.
You then have three options:
INSERT IGNORE will try to insert, and in case of violation, do nothing.
INSERT ... ON DUPLICATE KEY UPDATE will try to insert, and in case of violation, update the row to the new values
REPLACE will try to insert, and in case of violation, DELETE the offending existing row, and INSERT the new one.

MS Access "duplicate values" error, but I don't know why

I have a VB6 program that adds a column to an MS Access database thusly:
alter table x add column y long constraint z unique
The program goes through a number of databases without error; however, on the one that I am looking at now, it gives me "The changes you requested to the table were not successful because they would create duplicate values in the index, primary key, or relationship..."
In case it makes a difference, I add values to the column by forming a recordset of the primary key and new column values, then going through each record to add a value to this column. I do recordSet.updateBatch when I'm all done.
If I remove the constraint, it completes normally; I have put all 1600 values into a spreadsheet, sorted by the values I've added, and used a formula to check for duplicates. There aren't any. All rows get a new value, none of the new values are the same as any other new value.
Are there other reasons why I might get this error? I really don't want to remove the constraint, but I don't know how to get past this.
Since you're certain you're not attempting to insert rows which violate the table's index constraints, perhaps you have a corrupted index. See whether Compact & Repair cures the problem. But first make a backup of the database.
You could also recreate the table in a new database and test it there.
You can find further information about corruption at Tony Towes' Corrupt Microsoft Access MDBs FAQ.