SSIS Improve upsert method - ssis

I have database table of 100,000 rows, imported from CSV each week using an SSIS package.
Usually updates, but sometimes it can add rows.
I see a few exceptions with staging table during the update rows - I don't know why? and how to update from staging to destination table?
This is the merge code :
MERGE INTO [PWCGFA_BG].[dbo].[Bank_Guarantees] WITH (HOLDLOCK) AS bg
USING [PWCGFA_BG].[dbo].[stagingBG] AS stgbg
ON bg.IATA_CODE = stgbg.IATA_CODE
WHEN MATCHED THEN
UPDATE set
bg.LEGAL_NAME=stgbg.LEGAL_NAME,
bg.TRADING_NAME=stgbg.TRADING_NAME,
bg.COUNTRY=stgbg.COUNTRY,
bg.CURRENCY=stgbg.CURRENCY,
bg.LANGUAGE=stgbg.LANGUAGE,
bg.STATUS=stgbg.STATUS,
bg.BANK_NAME=stgbg.BANK_NAME,
bg.BANK_GUARANTEE_AMOUNT=stgbg.BANK_GUARANTEE_AMOUNT,
bg.BANK_GUARANTEE_CURRENCY=stgbg.BANK_GUARANTEE_CURRENCY,
bg.BANK_GUARANTEE_EXPIRY_DATE=stgbg.BANK_GUARANTEE_EXPIRY_DATE,
bg.ACCREDITATION_DATE=stgbg.ACCREDITATION_DATE,
bg.CLASS_PAX_OR_CGO=stgbg.CLASS_PAX_OR_CGO,
bg.LOCATION_TYPE=stgbg.LOCATION_TYPE,
bg.XREF=stgbg.XREF,
bg.IRRS=stgbg.IRRS,
bg.TAX_CODE=stgbg.TAX_CODE,
bg.COUNTRY_CODE=stgbg.COUNTRY_CODE,
bg.CITY=stgbg.CITY,
bg.DEF=stgbg.DEF,
bg.OWN_SHARE_CHANGE=stgbg.OWN_SHARE_CHANGE
WHEN NOT MATCHED BY bg THEN
INSERT (IATA_CODE,LEGAL_NAME,TRADING_NAME,COUNTRY,CURRENCY,LANGUAGE,STATUS,BANK_NAME,BANK_GUARANTEE_AMOUNT,BANK_GUARANTEE_CURRENCY,BANK_GUARANTEE_EXPIRY_DATE,ACCREDITATION_DATE,CLASS_PAX_OR_CGO,LOCATION_TYPE,XREF,IRRS,TAX_CODE,CITY,DEF,OWN_SHARE_CHANGE)
VALUES (stgbg.IATA_CODE,stgbg.LEGAL_NAME,stgbg.TRADING_NAME,stgbg.COUNTRY,stgbg.CURRENCY,stgbg.LANGUAGE,stgbg.STATUS,stgbg.BANK_NAME,stgbg.BANK_GUARANTEE_AMOUNT,stgbg.BANK_GUARANTEE_CURRENCY,stgbg.BANK_GUARANTEE_EXPIRY_DATE,stgbg.ACCREDITATION_DATE,stgbg.CLASS_PAX_OR_CGO,stgbg.LOCATION_TYPE,stgbg.XREF,stgbg.IRRS,stgbg.TAX_CODE,stgbg.CITY,stgbg.DEF,stgbg.OWN_SHARE_CHANGE)
WHEN NOT MATCHED BY stgbg THEN
DELETE

If your source(staging) and destination tables on the same Server you can use MERGE statement with Execute SQL task, which is faster and very effective than a lookup which uses a row by row operation.
But if the destination is on a different Server, you have the following options
Use lookup to update the matching rows with an OLEDB Command(UPDATE Statement)
Use a Merge Join (with LEFT OUTER JOIN) to identify the new/matching records and then use a conditional split to INSERT or UPDATE records. This works same as the Lookup but faster.
Create a temporary table in the destination db, dump data from staging to that table and then use the MERGE statement, this is faster than using a lookup.

Related

Pentaho Kettle (Spoon) - Delete Records From Different Tables

I'm trying to delete records in my target table based on whether records exist in source table. I tried using a 'Delete' step, but I noticed that this step is based on a conditional clause.
My condition is quite simple "if the record/row does NOT exist in table A [source], delete the record/row from table B [destination]".
I also read about using a 'Merge Rows (diff)' step, but it seems to check/compare the entire set of tables for differences.
The table has several million records with many hundreds of columns in a MySQL server, I need to do this in the most efficient way.
I'm doing a search of table A with the Table input object and sql command:
'' ' SELECT I went , user , password , attribute , op FROM viewuserradiusunisulma
Any help would be appreciated.
print - image screen pentaho transformation
Transformation
Delete Pentaho
if your source and target table are in the same database, you can use a SQL query to delete all records in tableB that don't have a corresponding entry in tableA:
delete tableB where not exists (select id from tableA where id = tableB.id)
if source and destination tables are not in the same database, you would have to go through all rows in tableB and check whether the record exists in tableA. If your source tableA has a limited number of rows, loading the key values in memory and then performing a stream lookup instead of a database lookup would be much faster. I'd probably try that even with higher number of rows because of the significant performance impact.
note: I hope I haven't messed up the sql syntax, I'm thinking almost exclusively in abap at the moment and that messes with my memory a bit. So please test this on some backup before firing away.
I found the solution. In this case, I check the records, then report, update and enter the new data
Trasnsformation

Duplicate row detected during DML action - Snowflake - Talend

I want to load data into Snowflake with Talend. I used tSnowflakeOutput with Upsert option because I want to insert data if not exists in Snowflake, or update rows if it exists. I used the primary key to identify the rows that already exist.
When I run my job, I have the following error:
Duplicate row detected during DML action
I am aware that the problem is due to a line that exists in Snowflake, I want to update the line but all I've got is this error.
do you have an idea why?
Please help :)
The Talend connector might be internally using the MERGE operation of Snowflake. As mentioned by #mike-walton, the error is reported because MERGE does not accept duplicates in the source data. Considering that its an insert or update if exists operation, if multiple source rows join to a target record, the system is not able to decide which source row to use for the operation.
From the docs
When a merge joins a row in the target table against multiple rows in the source, the following join conditions produce nondeterministic results (i.e. the system is unable to determine the source value to use to update or delete the target row)
A target row is selected to be updated with multiple values (e.g. WHEN MATCHED ... THEN UPDATE)
Solutions 1
One option as mentioned in the documentation can be to set the ERROR_ON_NONDETERMINISTIC_MERGE parameter. This will just pick an arbitrary source row to update from.
Solutions 2
Another option is to make it deterministic by using a MERGE query of the following form. This essentially does a de-duplication on the source table and lets you pick one of the duplicates as the preferred one for the update.
merge into taget_table t
using (
select *
from source_table
qualify
row_number() over (
partition by
the_join_key
order by
some_ordering_column asc
) = 1
) s
on s.the_join_key = t.the_join_key
when matched then update set
...
when not matched then insert
...
;
Doing this same thing in Talend may just require one to do a dedup operation upstream in the ETL mapping.

SSIS synchronize two tables using lookup

I want to synchronize two tables src,dest(DB SOURCE>> src table, DB Destination>> dest table) using ssis, where any (insert, update and delete operations on src will be applied to dest)
How can I achieve this using lookup tranformation ?
Thanks in advance.
Take table Dest in the lookup cache and then you need to lookup with Table Src .Choose option of the lookup as Redirect the non matching records. For the non matching records(not present in Table Dest) which are present in Table Src use OLE DB Destination to insert them in Table Dest.
For matching record use a physical table or temp table, use Execute SQl Task after the DFT to update those records in Table Src.
To speed up process try to use Cache Transform
Also you can achieve same by using Merge by following this article, Synchronize two tables using SSIS

Compare two MySQL tables and remove rows that no longer exist

I have created a system using PHP/MySQL that downloads a large XML dataset, parses it and then inserts the parsed data into a MySQL database every week.
This system is made up of two databases with the same structure. One is a production database and one is a temporary database where the data is parsed and inserted into first.
When the data has been inserted into the temporary database I perform a merge by inserting/replacing the data in the production database. I have done all of the above so far. I then realised, data that might have been removed in a new dataset will be left to linger in the production database.
I need to perform a check to see if the new data is still in the production database, if it is then leave it, if it isn't delete the row from the production database so that the rows aren't left to linger.
For arguments sake, let's say the two databases are called database_temporary and database_production.
How can I go about doing this?
If you are using SQL to merge, a simple SQL can do the delete as well:
delete from database_production.table
where pk not in (select pk from database_temporary.table)
Notes:
This assumes that there is a a row can be uniquely identified. This may be based on a single column, multiple columns or another mechanism.
If your dataset is large, a not exists mey perform better than not in. See What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL? and NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
An example not exists:
delete from database_production.table p
where not exists (select 1 from database_temporary.table t where t.pk = p.pk)
Performance Notes:
As pointed out by #mgonzalez in the comments on the question, you may want to use a timestamp column (something like last modified) for comparing/merging in general so that you vompare only changed rows. This does not apply to the delete specifically, you cannot use timestamp for the delete because, well, the row would not exist.

How to delete extra records from my destination table while pulling the data from sql database?

I have to pull data from a SQL database table to my DB2 table. If records already exist UPDATE, for new records INSERT, for extra records in destination table DELETE those extra records. Destination table looks exactly like source table. For INSERT/UPDATE I am fine, how do I do DELETE from dest table?
DB2 has a MERGE command. This allows you to write a single SQL statement to do an INSERT, UPDATE and DELETE based on conditions you define. It is a very clean way of doing this.
So what you will do is add an "Execute SQL Task" element to your SSIS package, and add the DB2 merge statement to the task.
See this link (at the bottom are examples) - http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic=%2Fcom.ibm.db2.udb.admin.doc%2Fdoc%2Fr0010873.htm
if all you want is a copy of the source table... then avoid complexity and delete the target entirely first - then everything is just an insert.