Duplicate row detected during DML action - Snowflake - Talend

Duplicate row detected during DML action - Snowflake - Talend - duplicates

I want to load data into Snowflake with Talend. I used tSnowflakeOutput with Upsert option because I want to insert data if not exists in Snowflake, or update rows if it exists. I used the primary key to identify the rows that already exist.
When I run my job, I have the following error:
Duplicate row detected during DML action
I am aware that the problem is due to a line that exists in Snowflake, I want to update the line but all I've got is this error.
do you have an idea why?
Please help :)

The Talend connector might be internally using the MERGE operation of Snowflake. As mentioned by #mike-walton, the error is reported because MERGE does not accept duplicates in the source data. Considering that its an insert or update if exists operation, if multiple source rows join to a target record, the system is not able to decide which source row to use for the operation.
From the docs
When a merge joins a row in the target table against multiple rows in the source, the following join conditions produce nondeterministic results (i.e. the system is unable to determine the source value to use to update or delete the target row)
A target row is selected to be updated with multiple values (e.g. WHEN MATCHED ... THEN UPDATE)
Solutions 1
One option as mentioned in the documentation can be to set the ERROR_ON_NONDETERMINISTIC_MERGE parameter. This will just pick an arbitrary source row to update from.
Solutions 2
Another option is to make it deterministic by using a MERGE query of the following form. This essentially does a de-duplication on the source table and lets you pick one of the duplicates as the preferred one for the update.
merge into taget_table t
using (
select *
from source_table
qualify
row_number() over (
partition by
the_join_key
order by
some_ordering_column asc
) = 1
) s
on s.the_join_key = t.the_join_key
when matched then update set
...
when not matched then insert
...
;
Doing this same thing in Talend may just require one to do a dedup operation upstream in the ETL mapping.

Related

Pentaho Kettle (Spoon) - Delete Records From Different Tables

I'm trying to delete records in my target table based on whether records exist in source table. I tried using a 'Delete' step, but I noticed that this step is based on a conditional clause.
My condition is quite simple "if the record/row does NOT exist in table A [source], delete the record/row from table B [destination]".
I also read about using a 'Merge Rows (diff)' step, but it seems to check/compare the entire set of tables for differences.
The table has several million records with many hundreds of columns in a MySQL server, I need to do this in the most efficient way.
I'm doing a search of table A with the Table input object and sql command:
'' ' SELECT I went , user , password , attribute , op FROM viewuserradiusunisulma
Any help would be appreciated.
print - image screen pentaho transformation
Transformation
Delete Pentaho

if your source and target table are in the same database, you can use a SQL query to delete all records in tableB that don't have a corresponding entry in tableA:
delete tableB where not exists (select id from tableA where id = tableB.id)
if source and destination tables are not in the same database, you would have to go through all rows in tableB and check whether the record exists in tableA. If your source tableA has a limited number of rows, loading the key values in memory and then performing a stream lookup instead of a database lookup would be much faster. I'd probably try that even with higher number of rows because of the significant performance impact.
note: I hope I haven't messed up the sql syntax, I'm thinking almost exclusively in abap at the moment and that messes with my memory a bit. So please test this on some backup before firing away.

I found the solution. In this case, I check the records, then report, update and enter the new data
Trasnsformation

SSIS Improve upsert method

I have database table of 100,000 rows, imported from CSV each week using an SSIS package.
Usually updates, but sometimes it can add rows.
I see a few exceptions with staging table during the update rows - I don't know why? and how to update from staging to destination table?
This is the merge code :
MERGE INTO [PWCGFA_BG].[dbo].[Bank_Guarantees] WITH (HOLDLOCK) AS bg
USING [PWCGFA_BG].[dbo].[stagingBG] AS stgbg
ON bg.IATA_CODE = stgbg.IATA_CODE
WHEN MATCHED THEN
UPDATE set
bg.LEGAL_NAME=stgbg.LEGAL_NAME,
bg.TRADING_NAME=stgbg.TRADING_NAME,
bg.COUNTRY=stgbg.COUNTRY,
bg.CURRENCY=stgbg.CURRENCY,
bg.LANGUAGE=stgbg.LANGUAGE,
bg.STATUS=stgbg.STATUS,
bg.BANK_NAME=stgbg.BANK_NAME,
bg.BANK_GUARANTEE_AMOUNT=stgbg.BANK_GUARANTEE_AMOUNT,
bg.BANK_GUARANTEE_CURRENCY=stgbg.BANK_GUARANTEE_CURRENCY,
bg.BANK_GUARANTEE_EXPIRY_DATE=stgbg.BANK_GUARANTEE_EXPIRY_DATE,
bg.ACCREDITATION_DATE=stgbg.ACCREDITATION_DATE,
bg.CLASS_PAX_OR_CGO=stgbg.CLASS_PAX_OR_CGO,
bg.LOCATION_TYPE=stgbg.LOCATION_TYPE,
bg.XREF=stgbg.XREF,
bg.IRRS=stgbg.IRRS,
bg.TAX_CODE=stgbg.TAX_CODE,
bg.COUNTRY_CODE=stgbg.COUNTRY_CODE,
bg.CITY=stgbg.CITY,
bg.DEF=stgbg.DEF,
bg.OWN_SHARE_CHANGE=stgbg.OWN_SHARE_CHANGE
WHEN NOT MATCHED BY bg THEN
INSERT (IATA_CODE,LEGAL_NAME,TRADING_NAME,COUNTRY,CURRENCY,LANGUAGE,STATUS,BANK_NAME,BANK_GUARANTEE_AMOUNT,BANK_GUARANTEE_CURRENCY,BANK_GUARANTEE_EXPIRY_DATE,ACCREDITATION_DATE,CLASS_PAX_OR_CGO,LOCATION_TYPE,XREF,IRRS,TAX_CODE,CITY,DEF,OWN_SHARE_CHANGE)
VALUES (stgbg.IATA_CODE,stgbg.LEGAL_NAME,stgbg.TRADING_NAME,stgbg.COUNTRY,stgbg.CURRENCY,stgbg.LANGUAGE,stgbg.STATUS,stgbg.BANK_NAME,stgbg.BANK_GUARANTEE_AMOUNT,stgbg.BANK_GUARANTEE_CURRENCY,stgbg.BANK_GUARANTEE_EXPIRY_DATE,stgbg.ACCREDITATION_DATE,stgbg.CLASS_PAX_OR_CGO,stgbg.LOCATION_TYPE,stgbg.XREF,stgbg.IRRS,stgbg.TAX_CODE,stgbg.CITY,stgbg.DEF,stgbg.OWN_SHARE_CHANGE)
WHEN NOT MATCHED BY stgbg THEN
DELETE

If your source(staging) and destination tables on the same Server you can use MERGE statement with Execute SQL task, which is faster and very effective than a lookup which uses a row by row operation.
But if the destination is on a different Server, you have the following options
Use lookup to update the matching rows with an OLEDB Command(UPDATE Statement)
Use a Merge Join (with LEFT OUTER JOIN) to identify the new/matching records and then use a conditional split to INSERT or UPDATE records. This works same as the Lookup but faster.
Create a temporary table in the destination db, dump data from staging to that table and then use the MERGE statement, this is faster than using a lookup.

Compare two MySQL tables and remove rows that no longer exist

I have created a system using PHP/MySQL that downloads a large XML dataset, parses it and then inserts the parsed data into a MySQL database every week.
This system is made up of two databases with the same structure. One is a production database and one is a temporary database where the data is parsed and inserted into first.
When the data has been inserted into the temporary database I perform a merge by inserting/replacing the data in the production database. I have done all of the above so far. I then realised, data that might have been removed in a new dataset will be left to linger in the production database.
I need to perform a check to see if the new data is still in the production database, if it is then leave it, if it isn't delete the row from the production database so that the rows aren't left to linger.
For arguments sake, let's say the two databases are called database_temporary and database_production.
How can I go about doing this?

If you are using SQL to merge, a simple SQL can do the delete as well:
delete from database_production.table
where pk not in (select pk from database_temporary.table)
Notes:
This assumes that there is a a row can be uniquely identified. This may be based on a single column, multiple columns or another mechanism.
If your dataset is large, a not exists mey perform better than not in. See What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL? and NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
An example not exists:
delete from database_production.table p
where not exists (select 1 from database_temporary.table t where t.pk = p.pk)
Performance Notes:
As pointed out by #mgonzalez in the comments on the question, you may want to use a timestamp column (something like last modified) for comparing/merging in general so that you vompare only changed rows. This does not apply to the delete specifically, you cannot use timestamp for the delete because, well, the row would not exist.

MySQL DUPLICATE KEY UPDATE fails to update due to a NOT NULL field which is already set

I have a MySQL DB which is using strict mode so I need to fill all NOT NULL values when I insert a row. The API Im creating is using just DUPLICATE KEY UPDATE functionality to do both inserts/updates.
The client application complains if any NOT NULL attributes are inserted which is expected.
Basic example (id is primary key and theare are two fields that are NOT NULL aaa and xxx)
INSERT INTO tablename (aaa, xxx, id ) VALUES ( "value", "value", 1)
ON DUPLICATE KEY UPDATE aaa=VALUES(aaa), xxx=VALUES(xxx)
All good so far. Once it is inserted, the system would allow doing updates. Nevertheless, I get the following error when updating only one of the fields.
INSERT INTO tablename (aaa, id ) VALUES ( "newValue", 1)
ON DUPLICATE KEY UPDATE aaa=VALUES(aaa)
java.sql.SQLException: Field 'xxx' doesn't have a default value
This Exception is a lie as the row is already inserted and xxx attribute has "value" as value. I would expect the following sentence to be equivalent to:
UPDATE tablename SET aaa="newValue" WHERE id=1
I would be glad if someone can shed some light about this issue.
Edit:
I can use the SQL query in PhpMyAdmin successfully to update just one field so I am afraid that this is not a SQL problem but a driver problem with JDBC. That may not have solution then.
#Marc B: Your insight is probably true and would indicate what I just described. That would mean that there is a bug in JDBC as it should not do that check when the insert is of ON DUPLICATE type as there may be a default value for the row after all. Can't provide real table data but I believe that all explained above is quite clear.
#ruakh: It does not fail to insert, neither I am expecting delayed validation. One requirement I have is to have both insert/updates done using the same query as the servlet does not know if the row exists or not. The JAVA API service only fails to update a row that has NOT NULL fields which were already filled when the insert was done. The exception is a lie because the field DOES have a default value as it was inserted before the update.

This is a typical case of DRY / SRP fail; in an attempt to not duplicate code you've created a function that violates the single responsibility principle.
The semantics of an INSERT statement is that you expect no conflicting rows; the ON DUPLICATE KEY UPDATE option is merely there to avoid handling the conflict inside your code, requiring another separate query. This is quite different from an UPDATE statement, where you would expect at least one matching row to be present.
Imagine that MySQL would only check the columns when an INSERT doesn't conflict and for some reason a row was just removed from the database and your code that expects to perform an update has to deal with an exception it doesn't expect. Given the difference in statement behaviour it's good practice to separate your insert and update logic.
Theory aside, MySQL puts together an execution plan when a query is run; in the case of an INSERT statement it has to assume that it might succeed when attempted, because that's the most optimal strategy. It prevents having to check indices etc. only to find out later that a column is missing.
This is per design and not a bug in JDBC.

Can I INSERT/UPDATE into two tables with one query?

Here is a chunk of the SQL I'm using for a Perl-based web application. I have a number of requests and each has a number of accessions, and each has a status. This chunk of code is there to update the table for every accession_analysis that shares all these fields for each accession in a request.
UPDATE accession_analysis
SET analysis_id = ? ,
reference_id = ? ,
status = ? ,
extra_parameters = ?
WHERE analysis_id = ?
AND reference_id = ?
AND status = ?
AND extra_parameters = ?
and accession_id is (
SELECT accesion_id
FROM accessions
where request_id = ?
)
I have changed the tables so that there's a status table for accession_analysis, so when I update, I update both accession_analysis and accession_analysis_status, which has status, status_text and the id of the accession_analysis, which is a not null auto_increment variable.
I have no strong idea about how to modify this code to allow this. My first pass grabbed all the accessions and looped through them, then filtered for all the fields, then updated. I didn't like that because I had many connections with short SQL commands, which I understood to be bad, but I can't help but think the only way to really do this is to go back to the loop in Perl holding two simpler SQL statements.
Is there a way to do this in SQL that, with my relative SQL inexperience, I'm just not seeing?

The answer depends on which DBMS you're using. The easiest way is to create a trigger on one table that provides the logic of updating the other table. (For any DB newbies -- a trigger is procedural code attached to a table at the DBMS (not application) layer that runs in response to an insert, update or delete on the table.). A similar, slightly less desirable method is to put the logic in a stored procedure and execute that instead of the update statement you're now using.
If the DBMS you're using doesn't support either of these mechanisms, then there isn't a good way to do what you're after while guaranteeing transactional integrity. However if the problem you're solving can tolerate a timing difference in the two tables' updates (i.e. The data in one of the tables is only used at predetermined times, like reporting or some type of batched operation) you could write to one table (live) and create a separate process that runs when needed (later) to update the second table using data from the first table. The correctness of allowing data to be updated at different times becomes a large and immovable design assumption, however.

If this is mostly about connection speed, then one option you have is to write a stored procedure that handles the "double update or insert" transparently. See the manual for stored procedures:
http://dev.mysql.com/doc/refman/5.5/en/create-procedure.html
Otherwise, You probably cannot do it in one statement, see the MySQL INSERT syntax:
http://dev.mysql.com/doc/refman/5.5/en/insert.html
The UPDATE syntax allows for multi-table updates (not in combination with INSERT, though):
http://dev.mysql.com/doc/refman/5.5/en/update.html

Each table needs its own INSERT / UPDATE in the query.
In fact, even if you create a view by JOINing multiple tables, when you INSERT into the view, you can only INSERT with fields belonging to one of the tables at a time.
The modifications made by the INSERT statement cannot affect more than one of the base tables referenced in the FROM clause of the view. For example, an INSERT into a multitable view must use a column_list that references only columns from one base table. For more information about updatable views, see CREATE VIEW.
Inserting data into multiple tables through an sql view (MySQL)
INSERT (SQL Server)
Same is true of UPDATE
The modifications made by the UPDATE statement cannot affect more than one of the base tables referenced in the FROM clause of the view. For more information on updatable views, see CREATE VIEW.
However, you can have multiple INSERTs or UPDATEs per query or stored procedure.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008