Reading incremental data based on the value of composite Primary key - mysql

I have an OLTP (source) from where data has to be moved to the DWH(destination) on an incremental basis.
The source table has a composite Primary Key on Loan_id, AssetID as shown below.
LOAN_ID, ASSETID, REC_STATUS
'12848','13170', 'F'
Had it been a single col primary key then I would check for the max value of the column at the destination and then read all the records from the source where the Primary key value is greater than the max value at the destination, but as it is a composite primary key, this will not work.
Any idea how this can be done using T-SQL Query?
Specs: Source is an MYSQL DB and the destination is MSSQL 2012. The connection is made using a linked server.

There's a few things that you can try. Dealing with a linked server and not knowing the specifics of that setup, along with volume of data, performance could be an issue.
If you're not worried about changes in existing records or deletes, a simple left outer join will get you any records that haven't been inserted into your destination yet:
SELECT [s].[LOAD_ID]
, [s].[ASSETID]
, [s].[REC_STATUS]
FROM [LinkedServer].[Database].[schema].[SourceTable] [s]
LEFT OUTER JOIN [DestinationTable] [d]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHERE [d].[LOAN_ID] IS NULL;
If you're worried about changes you could still use a left outer and look for NULL in destination or differences in field values, but then you'd need an additional update statement.
SELECT [s].[LOAD_ID]
, [s].[ASSETID]
, [s].[REC_STATUS]
FROM [LinkedServer].[Database].[schema].[SourceTable] [s]
LEFT OUTER JOIN [DestinationTable] [d]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHERE [d].[LOAN_ID] IS NULL --Records from source not in destination
OR (
--This evaluates those in the destination, but then checks for changes in field values.
[d].[LOAN_ID] IS NOT NULL
AND (
[s].[REC_STATUS] <> [d].[REC_STATUS]
OR [s].[SomOtherField] <> [d].[SomeOtherField]
)
);
--The above insert into some landing or staging table on the destination side and then you could do a MERGE.
If we need to worry about deletes. A record was deleted from the source and you don't want it in the destination anymore. Flip the left outer to find records in your destination no longer in the source:
DELETE [d]
FROM [DestinationTable] [d]
LEFT OUTER JOIN [LinkedServer].[Database].[schema].[SourceTable] [s]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHERE [s].[LOAD_ID] IS NULL;
You could attempt doing all of this using a merge. Try the MERGE over a linked server or bring all the source records to the destination in a land/stage table and then do the merge there. Here's an example of attempting over a linked server.
MERGE [DestinationTable] [t]
USING [LinkedServer].[Database].[schema].[SourceTable] [s]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHEN MATCHED THEN UPDATE SET [REC_STATUS] = [s].[REC_STATUS]
WHEN NOT MATCHED BY TARGET THEN INSERT (
[REC_STATUS]
)
VALUES ( [s].[REC_STATUS] )
WHEN NOT MATCHED BY SOURCE THEN DELETE;
When dealing with a merge, you gotcha watch out for this statement:
WHEN NOT MATCHED BY SOURCE THEN DELETE;
If you're not working with the entire record set you could lose records in you destination. Example would be, you've limited the results set you pulled from the source into a staging table, you now merge the staging table with the final destination, anything outside of that would get deleted in your destination. You can solve that by limiting your target with a CTE, Google: "merge into cte as the target". That's if you have a date you can filter on.
If you have a date column, that's always helpful, especially some sort of Change/Update date column when new records are inserted or updated. Then you can filter on your source to only those records you care about.

Incremental loads typically have a date driving them.
You can use a composite key inside a lookup. This has been answered many times.
Add a lookup and change the test to redirect no match (default is fail).
Basically you check if the key exists in the destination.
If the key exist then it is an update (match).
If the key does not exist (no match) then it is an insert.

Related

Dealing with large overlapping sets of data - updating just the delta

My Python application generates a CSV file containing a few hundred unique records, one unique row per line. It runs hourly and very often the data remains the same from one run to another. If there are changes, then they are small, e.g.
one record removed.
a few new records added.
occasional update to an existing record.
Each record is just four simple fields (name, date, id, description), and there will be no more than 10,000 records by the time the project is at maximum, so it can all be contained in a single table.
What the best way to merge changes into the table?
A few approaches I'm considering are:
1) empty the table and re-populate on each run.
2) write the latest data to a staging table and run a DB job to merge the changes into the main table.
3) read the existing table data into my python script, collect the new data, find the differences, run multiple 'CRUD' operations to apply the changes one by one
Can anyone suggest a better way?
thanks
I would do this in the following way:
Load the new CSV file into a second table.
DELETE rows in the main table that are missing from the second table:
DELETE m FROM main_table AS m
LEFT OUTER JOIN new_table AS t ON m.id = t.id
WHERE t.id IS NULL;
Use INSERT ON DUPLICATE KEY UPDATE to update rows that need to be updated. This becomes a no-op on each row that already contains the same values.
INSERT INTO main_table (id, name, date, description)
SELECT id, name, date, description FROM new_table
ON DUPLICATE KEY UPDATE
name = VALUES(name), date = VALUES(date), description = VALUES(description);
Drop the second table once you're done with it.
This is assuming id is the primary key and you have no other UNIQUE KEY in the table.
Given a data set size of 10,000 rows, this should be quick enough to do it in one batch. Once the data set gets 10x larger, you may have to reconsider the solution, for example do batches of 10,000 rows at a time.

how to compare huge table of mysql

I have a huge table of mysqlwhich contains more than 33 million records .How I could compare my table to found non duplicate records , but unfortunately select statement doesn't work. Because it's huge table.
Please provide me a solution
First, Create a snapshot of your database or the tables you want to compare.
Optionally you can also limit the range of data you want to compare , for example only 3 years of data. This way your select query won't hog all the resources.
Snapshot will be bunch of files each representing a table containg your primary key or business key for each record ( I am assuming you can compare data based on aforementioned key . If thats not the case record all the field in your file)
Next, read each records from the file and do a select against the corresponding table. If there are more than 1 record you know it is a duplicate
Thanks
Look at the explain plan and see if what the DB is actually doing for the NOT IN.
You could try refactoring, with an index on subscriber as Roy suggested if necessary. I'm not familiar enough with MySQL to know whether the optimizer will execute these identically.
SELECT *
FROM contracts
WHERE NOT EXISTS
( SELECT 1
FROM edms
WHERE edms.subscriber=contracts.subscriber
);
-- or
SELECT C.*
FROM contracts AS C
LEFT
JOIN edms AS E
ON E.subscriber = C.subscriber
WHERE E.subscriber IS NULL;

Deleting non referenced data from database

Because of a bad design, I have to clean up a database. There's data in there which is not "connected" correctly (Foreign keys were not set)
Therefore I want to delete all data which is not referenced.
I have created with a join a temporary table temp1 and inserted all the Entity_ID which have no connection to the main table entity. The next step is, I want to delete from entityisactive all the bad data with following query:
Delete from db1.entityisactive where db1.entityisactive.Entity_ID IN
(
Select db1.temp1.Entity_ID from db1.temp1
)
The Problem is, I get a connection time out, also when I do thatSelect db1.temp1.Entity_ID from db1.temp1 where Entity_ID = 42
What I want to do is delete all entries in entityisactive where entityisactive.Entity_ID = temp1.Entity_ID
How can I speed up the SQL-Query? Or where is my error in reasoning?
I would suggest using an explicit join and then defining indexes. The code would look like:
Delete eia
from db1.entityisactive eia join
db1.temp1 t
on eia.Entity_ID = t.Entity_ID ;
Then, this query wants an index on either temp1(Entity_Id) or entityisactive(Entity_Id). Both indexes are not necessary and the second might have the better performance.

r - dbWriteTable or a MySQL Delete query?

I can't seem to find the answer to this anywhere. I am reading a csv into a data frame using the read.csv function. Then I am writing the data frame contents to a mysql table using dbWriteTable. This works great for the initial run to create the table, but I each run after this needs to do either an insert or an update depending on whether the record already exists in the table.
The 1st column in the data frame is the primary key, and the other records contain data that might change every time I pull a new copy of the csv. Each time I pull the CSV, if the primary key already exists, I want it to update that record with the new data, and if the primary key does not exist(eg: a new key since the last run), I want it to just insert the record into the table.
This is my current dbWriteTable. This creates the table just fine the 1st time it's run, and also inserts a "Timestamp" column into the table that is set to "on update CURRENT_TIMESTAMP" so that I know when each record was last updated.
dbWriteTable(mydb, value=csvData, name=Table, row.names=FALSE, field.types=list(PrimaryKey="VARCHAR(10)",Column2="VARCHAR(255)",Column3="VARCHAR(255)",Timestamp="TIMESTAMP"), append=TRUE)
Now the next time I run this, I simply want it to update any PrimaryKeys that are already in the table, and add any new ones. I also don't want to lose any records in the event a PrimaryKey disappears from the CSV source.
Is it possible to do this kind of update using dbWriteTable, or some other R function?
If that's not possible, is it possible to just run a mysql query that would delete any duplicate PrimaryKey records and keep just the 1 record with the most current timestamp? So I would run a dbWriteTable to append the new data, and then run a MySQL query to prune out the older records.
Obviously I couldn't define that 1st column as an actual PrimaryKey in the DB as my append/delete solution wouldn't work due to duplicate keys, and that's fine, I can always add an auto increment integer column to the table for the "real" primary key if needed.
Thoughts?
Consider using a temp table (an exact replica of final table but with less records) and then run an INSERT and UPDATE query into final table which will handle both cases without overlap (plus primary keys are constraints and queries will error out if attempts are made to duplicate any):
records to append if not exists - using the LEFT JOIN NULL query
records to update if does exist. - using the UPDATE INNER JOIN query
Concerning the former there is a regular debate among SQL coders if LEFT JOIN NULL or NOT IN or NOT EXISTS is the optimal solution which of course "depends". Left Join used here does avoid subqueries. But consider those avenues if needed.
# DELETE LAST SET OF TEMP DATA
dbSendQuery(mydb, "DELETE FROM tempTable")
# APPEND R DATA FRAME TO TEMP DATA
dbWriteTable(mydb, value=csvData, name=tempTable, row.names=FALSE,
field.types=list(PrimaryKey="VARCHAR(10)", Column2="VARCHAR(255)",
Column3="VARCHAR(255)", Timestamp="TIMESTAMP"),
append=TRUE, overwrite=FALSE)
# LEFT JOIN ... NULL QUERY TO APPEND NEW RECORDS NOT IN TABLE
dbSendQuery(mydb, "INSERT INTO finalTable (Column1, Column2, Column3, Timestamp)
SELECT Column1, Column2, Column3, Timestamp
FROM tempTable f
LEFT JOIN finalTable t
ON f.PrimaryKey = t.PrimaryKey
WHERE f.PrimaryKey IS NULL;")
# UPDATE INNER JOIN QUERY TO UPDATE MATCHING RECORDS
dbSendQuery(mydb, "UPDATE finalTable f
INNER JOIN tempTable t
ON f.PrimaryKey = t.PrimaryKey
SET f.Column1 = t.Column1,
f.Column2 = t.Column2,
f.Column3 = t.Column3,
f.Timestamp = t.Timestamp;")
For the most part, queries above will be compliant in most SQL backends should you ever need to change databases. Some RDMS do not support UPDATE INNER JOIN but equivalent alternatives are available. Finally, the beauty of this route is all processing is handled in the SQL engine and not in R.
Sounds like you're trying to do an upsert.
I'm kind of rusty with MySQL but the general idea is that you need to have a staging table to upload the new CSV, and then in the database itself do the insert/update.
For that you need to use dbSendQuery with INSERT ON DUPLICATE UPDATE.
http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html

load TableC from TableB based on value of TableA in SSDT/SSIS

I have 3 tables-
--server 1
CREATE TABLE TableA (GROUP_ID INT
,STATUS VARCHAR(10))
--server 2
CREATE TABLE TableB (GROUP_ID INT
,NAME VARCHAR(10)
,STATE VARCHAR(50)
,COMPANY VARCHAR(50))
-- server 1
CREATE TABLE TableC (GROUP_ID INT
,NAME VARCHAR(10)
,STATE VARCHAR(50)
,COMPANY VARCHAR(50))
Sample data
INSERT INTO TableA (1, 'READY'),(2,'NOT READY),(3,'READY'),(4,'NOT READY')
INSERT INTO TableB (1, Mike, 'NY', 'aaa'), (1, Rick, 'OK','bbb'), (2, Smith, 'TX','ccc'), (3, Nancy, 'MN','bbb'), (4, Roger, 'CA','aaa')
I am trying to build a SSDT(SSIS 2012) package to load the data in TableC from TableB for only those GROUP_ID which has STATUS= 'READY' in TableA and change STATUS ='LOADED'
I need to accomplish this by using a project level parameters or variables for TableA-GROUP_ID and STATUS because i will be doing this for about 60 tables and those values might change.
I must build a SSIS package, it is a requirement.
using linked server is not preferred. unless its impossible to achieve through SSIS.
Any help would be appreciated.
As the two tables are on separate servers, you could create a Data Flow with two Sources. You'll need to set up Connection Managers to both databases, then point one Source to the database holding TableA, and the other to the database holding TableB. Once this is done, you can join the two with a Merge Join, and then discard the records which don't have the value or values you want using a Conditional Split. It would ultimately look a bit like this:
First you'll need to set up the Sources as already discussed. However, since you want to use a Merge Join, you'll need to sort the output from the sources. You can do this in SSIS with a Sort transform, but you're better off just building an ORDER BY clause into your SELECT statement that you have in the source, and then telling SSIS that the output is sorted:
Right click on each Source, and select Show Advanced Editor.
Go to the Input and Output Properties tab.
Select OLE DB Source Output, then set IsSorted on the right-hand side to True.
Expand OLE DB Source Output, then expand Output Columns.
Click on the column you're sorting by (presumably GROUP_ID), and set SourceKeyPosition to 1.
Here's an image of that last bit in case you're at all lost - it can be a little fiddly getting around the properties in SSIS if you're not used to it:
Since the STATUS value you want to change might load, you could set this up in the Project Parameters. Just go to that page from the Solution Explorer, and click to add a new parameter. You should end up with something like this:
As you're using 2012, you'll be able to configure this value after release in SSMS, avoiding the need to re-work this or create a configuration file.
When you set up the Conditional Split, you have a couple of options. If you might want to send rows with other STATUS values into other tables in future, then you should look for cases where the STATUS has a value of READY, but if you only care about the READY rows you can also do it the way I have here:
When you drag the output of the Conditional Split to the destination, it'll ask which output you want to use. If you've set it up the same way I have, use Conditional Split Default Output, and it'll pass through all rows which don't meet one of the conditions you've stated.
If you need to update the values of the data while you're loading it, it depends where you want the updates to show. If you want to leave TableA and TableB alone, but change the value in TableC, then you could set up a Derived Column transform after the Conditional Split and before the Destination. You could then replace the value in the STATUS column with one you set (this can be parameterised, as above):
If you want to update the STATUS field in TableA, then you should go back to the Control Flow, and after the Data Flow you've been working on, add an Execute SQL Task which is connected to the database holding TableA, and which runs a simple SQL update statement.
If this is going to be running outside of business hours and you know there won't be any new rows during this time, you can simply update all rows which currently have a STATUS of READY. If you need to update the rows more precisely because the situation might be continuing to change while you work, then you might need to re-think this - one option would be to grab all of the GROUP_ID values you want to update at the beginning, store that in a variable, and use the variable as a parameter in the Source select statements and Execute SQL Task update statement. You could also choose to work in a loop instead, but that would obviously be a lot slower than operating on the rows in bulk.
This part is from my original answer before the question was updated, but I'll leave it here in case it's useful to anyone else:
If the tables (A and B) are in the same database, instead of the Conditional Split you could set the source up to be a select statement which joins Table A to Table B, and has a WHERE clause that only selects the rows with a STATUS of READY:
select GROUP_ID, NAME, STATE, COMPANY
from TableA a
inner join TableB b
on a.GROUP_ID = b.GROUP_ID
where a.STATUS = 'READY';