Pentaho Kettle (Spoon) - Delete Records From Different Tables - mysql

I'm trying to delete records in my target table based on whether records exist in source table. I tried using a 'Delete' step, but I noticed that this step is based on a conditional clause.
My condition is quite simple "if the record/row does NOT exist in table A [source], delete the record/row from table B [destination]".
I also read about using a 'Merge Rows (diff)' step, but it seems to check/compare the entire set of tables for differences.
The table has several million records with many hundreds of columns in a MySQL server, I need to do this in the most efficient way.
I'm doing a search of table A with the Table input object and sql command:
'' ' SELECT I went , user , password , attribute , op FROM viewuserradiusunisulma
Any help would be appreciated.
print - image screen pentaho transformation
Transformation
Delete Pentaho

if your source and target table are in the same database, you can use a SQL query to delete all records in tableB that don't have a corresponding entry in tableA:
delete tableB where not exists (select id from tableA where id = tableB.id)
if source and destination tables are not in the same database, you would have to go through all rows in tableB and check whether the record exists in tableA. If your source tableA has a limited number of rows, loading the key values in memory and then performing a stream lookup instead of a database lookup would be much faster. I'd probably try that even with higher number of rows because of the significant performance impact.
note: I hope I haven't messed up the sql syntax, I'm thinking almost exclusively in abap at the moment and that messes with my memory a bit. So please test this on some backup before firing away.

I found the solution. In this case, I check the records, then report, update and enter the new data
Trasnsformation

Related

transfer data from 1 table to another in the same database

Is this the right syntax:
INSERT INTO stock (Image)
SELECT Image,
FROM productimages
WHERE stock.Name_of_item = productimages.number;
SQL Server Management Studio's "Import Data" task (right-click on the DB name, then tasks) will do most of this for you. Run it from the database you want to copy the data into.
If the tables don't exist it will create them for you, but you'll probably have to recreate any indexes and such. If the tables do exist, it will append the new data by default but you can adjust that (edit mappings) so it will delete all existing data.
I use this all the time and it works fairly well.
INSERT INTO bar..tblFoobar( *fieldlist* )
SELECT *fieldlist* FROM foo..tblFoobar
This just moves the data. If you want to move the table definition (and other attributes such as permissions and indexes), you'll have to do something else.
The query logic that you are trying appears to be not correct (the query itself is buggy).
Assuming you have the correct query for the above logic and what you are trying is to insert new rows into the table stock by selecting a column from productimages table with a matching record as stock.Name_of_item = productimages.number
The above logic will add redundant data in to the table.
You perhaps looking to update instead of insert, something as -
update stock s
join productimages p on p.number = s.Name_of_item
set s.Image = p.Image

Create a view or new table for caching records

I'm experiencing huge performance problem in one legacy application.
There is a search form where user can search records with given value.
A result row contains 10 columns. Then a SP returns any row which contains in any column that value.
This SP uses 8 Tables and some of them have about million records. Every minute I get a new record. This SP conducts paging as well.
Execution of this SP takes sometimes around 40 seconds.
What I did was, I created a new table and put there all records by using a query from this SP, but without conditions.
When there is a new update or update in one of source table I use a trigger and update this new "cache" table.
Now waiting for results from this new table takes only 1-3 seconds.
Has someone experience with something like this?
One of my colleagues said I better use view, but then every time I will be making JOINS.
What do you think? Is there another way?
Often times temporary tables can help you resolve performance issues. One approach might be to collect only the records that you need to consider into temporary tables and then create your final select statement from the temporary tables joined to any other tables that you're not filtering.
As an example, let's say one of the fields you are searching for is field1 in table1. Start by inserting into table #table1 only records that have the value of field1 you are looking for:
select PrimaryKeyTable1, Field1, Field2, Field3, etc...
into #table1
from table1
where Field1 = 'Whatever you are looking for'
This should be pretty fast even for a big tables, especially if you have an index on Field1. You do this for every table with search fields to collect all the records that have relevant records you are searching.
Then you also need to be sure to insert any records into your temporary tables that might have foreign key references to any of your other temporary tables. So let's say you also built a table #table2 with the above method that has a foreign key to table1 called PrimaryKeyTable1. You would insert those records like:
Insert into #table1
(PrimaryKeyTable1, Field1, Field2, Field3, etc...)
select table1.PrimaryKeyTable1, table1.Field1, table1.Field2, table1.Field3, etc...
from table1
join #table2
on table1.PrimaryKeyTable1 = table2.PrimaryKeyTable1
where table1.PrimaryKeyTable1 not in
(Select PrimaryKeyTable1 from #table1)
Now you will also have any records in #table1 that match to a record in #table2 that contain records that match the search criteria. You do this for all your temporary tables that have relevant foreign keys. The order that you do the inserts matters; be sure that you don't reference any temporary tables until after the last insert statement while collecting the foreign key referenced records.
Then you can simply do your final select statement, replacing the actual tables with the temporary tables you have built and eliminating all the filters that search your field data. Depending on the structure of your query there might be other optimizations, but that is the general idea.
If you've already explored all of your indexing options and this still doesn't help, MS SQL Server has "Change Tracking" features that maybe be of use to you in building your cache table. You enable the database for change tracking and configure which tables you wish to track. SQL Server then creates change records on every update, insert, delete on a table and then lets you query for changes to records that have been made since the last time you checked. This is very useful for syncing changes and is more efficient than using triggers. It's also easier to manage than making your own tracking tables. This has been a feature since SQL Server 2005.
How to: Use SQL Server Change Tracking
Change tracking only captures the primary keys of the tables and let's you query which fields might have been modified. Then you can query the tables join on those keys to get the current data. If you want it to capture the data also you can use Change Capture, but it requires more overhead and at least SQL Server 2008 enterprise edition.
Change Data Capture
Your solution is a robust way of doing what is called in Microsoft SQL Server "an indexed view" or "materialized view" in Oracle.
Basically you are correct - it's faster to navigate single indexed table then a dozen ones which are updated constantly.
You should really try creating an indexed view (some start here https://technet.microsoft.com/en-us/library/dd171921(v=sql.100).aspx) and it will probably solve all your performance issues.
You can use schema binding View and create cluster index on view.it will store your view data physically.but after creating schema binding view you can not alter your table.

Compare two MySQL tables and remove rows that no longer exist

I have created a system using PHP/MySQL that downloads a large XML dataset, parses it and then inserts the parsed data into a MySQL database every week.
This system is made up of two databases with the same structure. One is a production database and one is a temporary database where the data is parsed and inserted into first.
When the data has been inserted into the temporary database I perform a merge by inserting/replacing the data in the production database. I have done all of the above so far. I then realised, data that might have been removed in a new dataset will be left to linger in the production database.
I need to perform a check to see if the new data is still in the production database, if it is then leave it, if it isn't delete the row from the production database so that the rows aren't left to linger.
For arguments sake, let's say the two databases are called database_temporary and database_production.
How can I go about doing this?
If you are using SQL to merge, a simple SQL can do the delete as well:
delete from database_production.table
where pk not in (select pk from database_temporary.table)
Notes:
This assumes that there is a a row can be uniquely identified. This may be based on a single column, multiple columns or another mechanism.
If your dataset is large, a not exists mey perform better than not in. See What's the difference between NOT EXISTS vs. NOT IN vs. LEFT JOIN WHERE IS NULL? and NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
An example not exists:
delete from database_production.table p
where not exists (select 1 from database_temporary.table t where t.pk = p.pk)
Performance Notes:
As pointed out by #mgonzalez in the comments on the question, you may want to use a timestamp column (something like last modified) for comparing/merging in general so that you vompare only changed rows. This does not apply to the delete specifically, you cannot use timestamp for the delete because, well, the row would not exist.

Augment and Prune a MySQL table

I need a little advice concerning a MySQL operation:
There is a database A wich yields several tables. With a query I selected a set of entries out of this database to copy these results into another table of database B.
Now the table in database B contains the results of my query on database A.
For instance the query is:
SELECT names.name,ages.age FROM A.names names A.ages ages WHERE ages.name = name.name;
And to copy these results into database B I would run:
INSERT INTO B.persons (SELECT name,age FROM A.names names A.age age WHERE age.name = name.name);
Here's my question: When the data of database A has changed I want to run an "update" on the table of database B.
So, the easy and dirty approach would be: Truncate the table in database B, re-run the query on database A and copy the result back to database B.
But isn't there a smarter way so that only new result rows of that query will be copied and those entries in database B which are not in database A anymore get deleted?
In short: Is there a way to "augment" the table of database B with new entries and "prune" old entries out?
Thanks for your help
I would do two things:
1) Ensure you have a primary key that's either an integer or a unique combination of columns at a minimum in database B
2) Use logical deletes instead of physical deletes i.e. have a boolean deleted column
Point 2 ensures you never have to delete and lose data, you just update the flag and in your queries put where deleted = 0 or where deleted is null.
When combined with a primary key it means everything can be handled easily by an INSERT ... WITH DUPLICATE KEY which will insert new rows and update existing ones - which means it can perform your 'deletes' at the same time too.
What you describe sounds like you want to replicate the table. There is no simple quick fix for what you describe. You could of course write some application logic to do it but it would not be so efficient as it would have to compare each entry in each table and then delete or update accordingly.
One solution would be to setup a foreign-key index between A and B and cascade updates and deletes to B. But this would only partly solve the problem. It would drop rows in B if they were deleted in A and it would update a key column in B if it were updated in A. But it would not update the other columns. Note also that this would require your table type to be INNODB.
Another would be to run inserts on B with A's values but use
INSERT ON DUPLICATE KEY UPDATE....
Again this would work fine for updates but not for Deletes.
You could try to setup actual MySQL replication but this is perhaps beyond the scope of your problem and is more involved.
Finally you could set up the foreign key index as described above and write a trigger that whenever an updates is applied to A then the corresponding key row in B is also updated. This seems like a plausible solution for you while not the cleanest I would admit.
It would seem that a small batch script run periodically on which ever environment your running on to duplicate the table would be the best to achieve what you are looking for.

SSIS how to find deleted records

I'm having data flow from source tables to destination table. To simplify the question, I'll say there are two merge joined source tables and one destination table. Also, there are primary keys helping me identify each record
The package is running everyday, and if one record is deleted from source table, how could I know which one is deleted so that I could delete that in destination table?
(FYI~~ I've dong checking to see if a record exists in destination table and if so update else insert, but don't know how to find deleted data)
Another possible approach:
Assuming you receive all records from source, not just imports and updates:
Amend package to stamp records that have been inserted or updated using a unique id or run datetime
Following the package run, process the destination table where records weren't inserted or updated in the last package run. By a process of elimination, any records that weren't provided in the source file should be deleted.
Again, assuming that all records are sent, not just imports and updates. But then again, if you don't receive all records, it's going to be physically impossible to detect if a record has been deleted.
The problem with comparing source to destination is that you have to compare every source row to the destination in every load, and as the number of rows increases that takes up more and more time.
As a result, the best way to handle this is probably on the source side. Two common approaches are a 'soft delete' where you set a flag column to mark the row as deleted; or a trigger that records the PK of the deleted row in a log table (or moves the entire row to an archive log table). Your ETL process then looks at the flags or the log/archive table to determine which rows were deleted since the last load.
Another possibility is that the source platform offers some built-in feature you can use to track deleted rows, e.g. CDC in SQL Server. But if you have no control at all over the source database (if it even is a database) then there may be no alternative to comparing the full data set.
One possible approach:
Prior to running package, delete the destination table records (using a stored procedure)
Just import all records in to destination table
Pros:
Your destination table will always mirror the incoming data, no need to check for deletions
Cons:
You won't have any historical information (if that is required)
I had the same problem, as in how to mark my old/archive records as being "deleted" because they no longer exist in the original data source.
Basically, I built two tables, where one is the main table containing all the records that came in from the original data source, and a temporary table I kept to store the original data source every time I ran my scripts.
MAIN TABLE
ID, NAME, SURNAME, DATE_MODIFIED, ORDERS_COUNT, etc
plus a STATUS column (1 for Active, 0 for Deleted)
TEMP TABLE same as the original, but without STATUS column
ID, NAME, SURNAME, DATE_MODIFIED, ORDERS_COUNT, etc
The key was to update the MAIN TABLE with STATUS = 0 if the ID of the MAIN table was no longer in the Temp table. ie: The source records have been deleted.
I did it like this:
UPDATE m
SET m.Status = 0
FROM tblMAIN AS m
LEFT JOIN tblTEMP AS t
ON t.ID = m.ID
WHERE t.ID IS NULL