I'm having data flow from source tables to destination table. To simplify the question, I'll say there are two merge joined source tables and one destination table. Also, there are primary keys helping me identify each record
The package is running everyday, and if one record is deleted from source table, how could I know which one is deleted so that I could delete that in destination table?
(FYI~~ I've dong checking to see if a record exists in destination table and if so update else insert, but don't know how to find deleted data)
Another possible approach:
Assuming you receive all records from source, not just imports and updates:
Amend package to stamp records that have been inserted or updated using a unique id or run datetime
Following the package run, process the destination table where records weren't inserted or updated in the last package run. By a process of elimination, any records that weren't provided in the source file should be deleted.
Again, assuming that all records are sent, not just imports and updates. But then again, if you don't receive all records, it's going to be physically impossible to detect if a record has been deleted.
The problem with comparing source to destination is that you have to compare every source row to the destination in every load, and as the number of rows increases that takes up more and more time.
As a result, the best way to handle this is probably on the source side. Two common approaches are a 'soft delete' where you set a flag column to mark the row as deleted; or a trigger that records the PK of the deleted row in a log table (or moves the entire row to an archive log table). Your ETL process then looks at the flags or the log/archive table to determine which rows were deleted since the last load.
Another possibility is that the source platform offers some built-in feature you can use to track deleted rows, e.g. CDC in SQL Server. But if you have no control at all over the source database (if it even is a database) then there may be no alternative to comparing the full data set.
One possible approach:
Prior to running package, delete the destination table records (using a stored procedure)
Just import all records in to destination table
Pros:
Your destination table will always mirror the incoming data, no need to check for deletions
Cons:
You won't have any historical information (if that is required)
I had the same problem, as in how to mark my old/archive records as being "deleted" because they no longer exist in the original data source.
Basically, I built two tables, where one is the main table containing all the records that came in from the original data source, and a temporary table I kept to store the original data source every time I ran my scripts.
MAIN TABLE
ID, NAME, SURNAME, DATE_MODIFIED, ORDERS_COUNT, etc
plus a STATUS column (1 for Active, 0 for Deleted)
TEMP TABLE same as the original, but without STATUS column
ID, NAME, SURNAME, DATE_MODIFIED, ORDERS_COUNT, etc
The key was to update the MAIN TABLE with STATUS = 0 if the ID of the MAIN table was no longer in the Temp table. ie: The source records have been deleted.
I did it like this:
UPDATE m
SET m.Status = 0
FROM tblMAIN AS m
LEFT JOIN tblTEMP AS t
ON t.ID = m.ID
WHERE t.ID IS NULL
Related
I'm trying to delete records in my target table based on whether records exist in source table. I tried using a 'Delete' step, but I noticed that this step is based on a conditional clause.
My condition is quite simple "if the record/row does NOT exist in table A [source], delete the record/row from table B [destination]".
I also read about using a 'Merge Rows (diff)' step, but it seems to check/compare the entire set of tables for differences.
The table has several million records with many hundreds of columns in a MySQL server, I need to do this in the most efficient way.
I'm doing a search of table A with the Table input object and sql command:
'' ' SELECT I went , user , password , attribute , op FROM viewuserradiusunisulma
Any help would be appreciated.
print - image screen pentaho transformation
Transformation
Delete Pentaho
if your source and target table are in the same database, you can use a SQL query to delete all records in tableB that don't have a corresponding entry in tableA:
delete tableB where not exists (select id from tableA where id = tableB.id)
if source and destination tables are not in the same database, you would have to go through all rows in tableB and check whether the record exists in tableA. If your source tableA has a limited number of rows, loading the key values in memory and then performing a stream lookup instead of a database lookup would be much faster. I'd probably try that even with higher number of rows because of the significant performance impact.
note: I hope I haven't messed up the sql syntax, I'm thinking almost exclusively in abap at the moment and that messes with my memory a bit. So please test this on some backup before firing away.
I found the solution. In this case, I check the records, then report, update and enter the new data
Trasnsformation
We have the below requirement:
Currently, we get the data from source (another server, another team, another DB) into a temp DB (via batch jobs) and after we get data into our temp DB, we process the data, transform and update our primary DB with the difference (i.e. the records that changed or the newly added records).
Source->tempDB (daily recreated)->delta->primaryDB
Requirement:
- To delete the data in primary DB once its deleted in source.
Ex: suppose a record with ID=1 is created in source, it comes to temp DB and eventually makes it to primary DB. When this record is deleted in source, it should get deleted in primary DB also.
Challenge:
How do we delete from primary DB when there is nothing to refer to in temp DB (since the record is already deleted in source, nothing comes in tempDB).
Naive approach:
- We can clean up primary DB, before every transform and load afresh. However, it takes a significant amount of time to clean up and populate primary DB everytime.
You could create triggers on each table that fills a history table with deleted entries. Synch that over to your tempDB and use it to delete stuff i your primary DB.
You either want one "delete-history-table" per table or a combined history table that also includes the tablename which triggered the deletion.
You might want to look into SQL Compare or other tools for synching tables.
If you have access to tempDB and primeDB (same server or linked servers) at the same time you could also try a
delete *
from primeBD.Tablename
where not exists (
select 1
from tempDB.Tablename where id = primeDB.Tablename.Id
)
which will perform awfully - ask your db designers.
In this scenorio if TEMPDB & Primary DB have no direct reference then can use track event notification on database level .
Here is the link i got for same :
https://www.mssqltips.com/sqlservertip/2121/event-notifications-in-sql-server-for-tracking-changes/
I have created a system using PHP/MySQL that downloads a large XML dataset, parses it and then inserts the parsed data into a MySQL database every week.
This system is made up of two databases with the same structure. One is a production database and one is a temporary database where the data is parsed and inserted into first.
When the data has been inserted into the temporary database I perform a merge by inserting/replacing the data in the production database. I have done all of the above so far. I then realised, data that might have been removed in a new dataset will be left to linger in the production database.
I need to perform a check to see if the new data is still in the production database, if it is then leave it, if it isn't delete the row from the production database so that the rows aren't left to linger.
For arguments sake, let's say the two databases are called database_temporary and database_production.
How can I go about doing this?
If you are using SQL to merge, a simple SQL can do the delete as well:
delete from database_production.table
where pk not in (select pk from database_temporary.table)
Notes:
This assumes that there is a a row can be uniquely identified. This may be based on a single column, multiple columns or another mechanism.
If your dataset is large, a not exists mey perform better than not in. See What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL? and NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
An example not exists:
delete from database_production.table p
where not exists (select 1 from database_temporary.table t where t.pk = p.pk)
Performance Notes:
As pointed out by #mgonzalez in the comments on the question, you may want to use a timestamp column (something like last modified) for comparing/merging in general so that you vompare only changed rows. This does not apply to the delete specifically, you cannot use timestamp for the delete because, well, the row would not exist.
I need a little advice concerning a MySQL operation:
There is a database A wich yields several tables. With a query I selected a set of entries out of this database to copy these results into another table of database B.
Now the table in database B contains the results of my query on database A.
For instance the query is:
SELECT names.name,ages.age FROM A.names names A.ages ages WHERE ages.name = name.name;
And to copy these results into database B I would run:
INSERT INTO B.persons (SELECT name,age FROM A.names names A.age age WHERE age.name = name.name);
Here's my question: When the data of database A has changed I want to run an "update" on the table of database B.
So, the easy and dirty approach would be: Truncate the table in database B, re-run the query on database A and copy the result back to database B.
But isn't there a smarter way so that only new result rows of that query will be copied and those entries in database B which are not in database A anymore get deleted?
In short: Is there a way to "augment" the table of database B with new entries and "prune" old entries out?
Thanks for your help
I would do two things:
1) Ensure you have a primary key that's either an integer or a unique combination of columns at a minimum in database B
2) Use logical deletes instead of physical deletes i.e. have a boolean deleted column
Point 2 ensures you never have to delete and lose data, you just update the flag and in your queries put where deleted = 0 or where deleted is null.
When combined with a primary key it means everything can be handled easily by an INSERT ... WITH DUPLICATE KEY which will insert new rows and update existing ones - which means it can perform your 'deletes' at the same time too.
What you describe sounds like you want to replicate the table. There is no simple quick fix for what you describe. You could of course write some application logic to do it but it would not be so efficient as it would have to compare each entry in each table and then delete or update accordingly.
One solution would be to setup a foreign-key index between A and B and cascade updates and deletes to B. But this would only partly solve the problem. It would drop rows in B if they were deleted in A and it would update a key column in B if it were updated in A. But it would not update the other columns. Note also that this would require your table type to be INNODB.
Another would be to run inserts on B with A's values but use
INSERT ON DUPLICATE KEY UPDATE....
Again this would work fine for updates but not for Deletes.
You could try to setup actual MySQL replication but this is perhaps beyond the scope of your problem and is more involved.
Finally you could set up the foreign key index as described above and write a trigger that whenever an updates is applied to A then the corresponding key row in B is also updated. This seems like a plausible solution for you while not the cleanest I would admit.
It would seem that a small batch script run periodically on which ever environment your running on to duplicate the table would be the best to achieve what you are looking for.
I'm working on a database right now and I have a pretty specific problem that I'm trying to figure out:
I have a large master table in my database with all the info we're gathering. We're updating records in this master table based on Excel files returned to us by various team members across the company - all of the records have unique ID numbers so we know what fields in the master table to update. We are tracking who responds by updating the file name into the master table as well. I want to update this with the file name; however, if two sources give me the same data, I want to append the second file to the first file rather than replace it with an update.
The problem is, I need the query to "know" when to update and when to append. Is there some IF statement I can use - maybe Update when Null, Append when Not Null?
You can refer to an Excel sheet or range in a query:
INSERT INTO Table1 ( ADate )
SELECT SomeDate FROM [Excel 8.0;HDR=YES;DATABASE=Z:\Docs\Test.xls].[Sheet1$a1:a4]
WHERE SomeDate Is Not Null
This means that you can run queries based on the presence or absence of data in the Excel file.