load TableC from TableB based on value of TableA in SSDT/SSIS - ssis

I have 3 tables-
--server 1
CREATE TABLE TableA (GROUP_ID INT
,STATUS VARCHAR(10))
--server 2
CREATE TABLE TableB (GROUP_ID INT
,NAME VARCHAR(10)
,STATE VARCHAR(50)
,COMPANY VARCHAR(50))
-- server 1
CREATE TABLE TableC (GROUP_ID INT
,NAME VARCHAR(10)
,STATE VARCHAR(50)
,COMPANY VARCHAR(50))
Sample data
INSERT INTO TableA (1, 'READY'),(2,'NOT READY),(3,'READY'),(4,'NOT READY')
INSERT INTO TableB (1, Mike, 'NY', 'aaa'), (1, Rick, 'OK','bbb'), (2, Smith, 'TX','ccc'), (3, Nancy, 'MN','bbb'), (4, Roger, 'CA','aaa')
I am trying to build a SSDT(SSIS 2012) package to load the data in TableC from TableB for only those GROUP_ID which has STATUS= 'READY' in TableA and change STATUS ='LOADED'
I need to accomplish this by using a project level parameters or variables for TableA-GROUP_ID and STATUS because i will be doing this for about 60 tables and those values might change.
I must build a SSIS package, it is a requirement.
using linked server is not preferred. unless its impossible to achieve through SSIS.
Any help would be appreciated.

As the two tables are on separate servers, you could create a Data Flow with two Sources. You'll need to set up Connection Managers to both databases, then point one Source to the database holding TableA, and the other to the database holding TableB. Once this is done, you can join the two with a Merge Join, and then discard the records which don't have the value or values you want using a Conditional Split. It would ultimately look a bit like this:
First you'll need to set up the Sources as already discussed. However, since you want to use a Merge Join, you'll need to sort the output from the sources. You can do this in SSIS with a Sort transform, but you're better off just building an ORDER BY clause into your SELECT statement that you have in the source, and then telling SSIS that the output is sorted:
Right click on each Source, and select Show Advanced Editor.
Go to the Input and Output Properties tab.
Select OLE DB Source Output, then set IsSorted on the right-hand side to True.
Expand OLE DB Source Output, then expand Output Columns.
Click on the column you're sorting by (presumably GROUP_ID), and set SourceKeyPosition to 1.
Here's an image of that last bit in case you're at all lost - it can be a little fiddly getting around the properties in SSIS if you're not used to it:
Since the STATUS value you want to change might load, you could set this up in the Project Parameters. Just go to that page from the Solution Explorer, and click to add a new parameter. You should end up with something like this:
As you're using 2012, you'll be able to configure this value after release in SSMS, avoiding the need to re-work this or create a configuration file.
When you set up the Conditional Split, you have a couple of options. If you might want to send rows with other STATUS values into other tables in future, then you should look for cases where the STATUS has a value of READY, but if you only care about the READY rows you can also do it the way I have here:
When you drag the output of the Conditional Split to the destination, it'll ask which output you want to use. If you've set it up the same way I have, use Conditional Split Default Output, and it'll pass through all rows which don't meet one of the conditions you've stated.
If you need to update the values of the data while you're loading it, it depends where you want the updates to show. If you want to leave TableA and TableB alone, but change the value in TableC, then you could set up a Derived Column transform after the Conditional Split and before the Destination. You could then replace the value in the STATUS column with one you set (this can be parameterised, as above):
If you want to update the STATUS field in TableA, then you should go back to the Control Flow, and after the Data Flow you've been working on, add an Execute SQL Task which is connected to the database holding TableA, and which runs a simple SQL update statement.
If this is going to be running outside of business hours and you know there won't be any new rows during this time, you can simply update all rows which currently have a STATUS of READY. If you need to update the rows more precisely because the situation might be continuing to change while you work, then you might need to re-think this - one option would be to grab all of the GROUP_ID values you want to update at the beginning, store that in a variable, and use the variable as a parameter in the Source select statements and Execute SQL Task update statement. You could also choose to work in a loop instead, but that would obviously be a lot slower than operating on the rows in bulk.
This part is from my original answer before the question was updated, but I'll leave it here in case it's useful to anyone else:
If the tables (A and B) are in the same database, instead of the Conditional Split you could set the source up to be a select statement which joins Table A to Table B, and has a WHERE clause that only selects the rows with a STATUS of READY:
select GROUP_ID, NAME, STATE, COMPANY
from TableA a
inner join TableB b
on a.GROUP_ID = b.GROUP_ID
where a.STATUS = 'READY';

Related

Reading incremental data based on the value of composite Primary key

I have an OLTP (source) from where data has to be moved to the DWH(destination) on an incremental basis.
The source table has a composite Primary Key on Loan_id, AssetID as shown below.
LOAN_ID, ASSETID, REC_STATUS
'12848','13170', 'F'
Had it been a single col primary key then I would check for the max value of the column at the destination and then read all the records from the source where the Primary key value is greater than the max value at the destination, but as it is a composite primary key, this will not work.
Any idea how this can be done using T-SQL Query?
Specs: Source is an MYSQL DB and the destination is MSSQL 2012. The connection is made using a linked server.
There's a few things that you can try. Dealing with a linked server and not knowing the specifics of that setup, along with volume of data, performance could be an issue.
If you're not worried about changes in existing records or deletes, a simple left outer join will get you any records that haven't been inserted into your destination yet:
SELECT [s].[LOAD_ID]
, [s].[ASSETID]
, [s].[REC_STATUS]
FROM [LinkedServer].[Database].[schema].[SourceTable] [s]
LEFT OUTER JOIN [DestinationTable] [d]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHERE [d].[LOAN_ID] IS NULL;
If you're worried about changes you could still use a left outer and look for NULL in destination or differences in field values, but then you'd need an additional update statement.
SELECT [s].[LOAD_ID]
, [s].[ASSETID]
, [s].[REC_STATUS]
FROM [LinkedServer].[Database].[schema].[SourceTable] [s]
LEFT OUTER JOIN [DestinationTable] [d]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHERE [d].[LOAN_ID] IS NULL --Records from source not in destination
OR (
--This evaluates those in the destination, but then checks for changes in field values.
[d].[LOAN_ID] IS NOT NULL
AND (
[s].[REC_STATUS] <> [d].[REC_STATUS]
OR [s].[SomOtherField] <> [d].[SomeOtherField]
)
);
--The above insert into some landing or staging table on the destination side and then you could do a MERGE.
If we need to worry about deletes. A record was deleted from the source and you don't want it in the destination anymore. Flip the left outer to find records in your destination no longer in the source:
DELETE [d]
FROM [DestinationTable] [d]
LEFT OUTER JOIN [LinkedServer].[Database].[schema].[SourceTable] [s]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHERE [s].[LOAD_ID] IS NULL;
You could attempt doing all of this using a merge. Try the MERGE over a linked server or bring all the source records to the destination in a land/stage table and then do the merge there. Here's an example of attempting over a linked server.
MERGE [DestinationTable] [t]
USING [LinkedServer].[Database].[schema].[SourceTable] [s]
ON [s].[LOAN_ID] = [d].[LOAN_ID]
AND [s].[ASSETID] = [d].[ASSETID]
WHEN MATCHED THEN UPDATE SET [REC_STATUS] = [s].[REC_STATUS]
WHEN NOT MATCHED BY TARGET THEN INSERT (
[REC_STATUS]
)
VALUES ( [s].[REC_STATUS] )
WHEN NOT MATCHED BY SOURCE THEN DELETE;
When dealing with a merge, you gotcha watch out for this statement:
WHEN NOT MATCHED BY SOURCE THEN DELETE;
If you're not working with the entire record set you could lose records in you destination. Example would be, you've limited the results set you pulled from the source into a staging table, you now merge the staging table with the final destination, anything outside of that would get deleted in your destination. You can solve that by limiting your target with a CTE, Google: "merge into cte as the target". That's if you have a date you can filter on.
If you have a date column, that's always helpful, especially some sort of Change/Update date column when new records are inserted or updated. Then you can filter on your source to only those records you care about.
Incremental loads typically have a date driving them.
You can use a composite key inside a lookup. This has been answered many times.
Add a lookup and change the test to redirect no match (default is fail).
Basically you check if the key exists in the destination.
If the key exist then it is an update (match).
If the key does not exist (no match) then it is an insert.

SSIS prevent the insert of data rows from flat file that already exist in the SQL Server table

‌I‌‌ need to create a SSIS package in which I am reading a flat file (provided monthly with many defined columns) and writing the data to a already defined SQL Server table (with lot of data already in SQL table). In the SQL table design view, I have datatypes including float ,datetime , bigint, varchar (which are already defined and CANNOT be changed)
I need to prevent the insert of any data rows from flat file that already exist in the SQL Server table.‌ How can I achieve this ?
I‌ tried to achieve this using lookup transformation ‌‌‌but in Edit mappings I get an error while creating relationships "Cannot map the lookup column because the column is set to a floating point data type" . I am able to create the relationships for all other data types but then there are some data rows in source file which differ from data in sql table in floating point values only and the expectation is that these rows will be inserted.
‌ Is there any other simple way to achieve this ?
T‌hanks.
Please try to convert the columns which has problem in mapping using data conversion.
Thanks
neither SSIS nor SQL bulk load (the SQL feature that is behind the SSIS load task) permit this out of the box.
you can use the method described by #sasi, and in your lookup, define the sql query yourself with sql cast (the convert keyword). But even if you could solve your cast issue this way, you will surely face a performance problem if you load a large amount of data.
There are two way to deal with it:
The first (the easiest but quiet slow compared to the other option, maybe even more slow than your solution in some conditions), use an insert statement command using sql command for each row like the following:
INSERT target_table (val1, val2, id)
SELECT $myVal1, $myVal2, $myCandidateKey
WHERE NOT EXISTS (SELECT 1 FROM target_table as t WHERE t.id = $myCandidateKey);
The second implies the creation of a staging table on the target database. This table has the same structure than your target table. It is created once for good. You must also create an index on what is supposed to be the key that will define if the record might already be loaded. Your process will empty it prior any execution for an obvious purpose. Instead of loading the target table with SSIS, you load this staging table. Once this staging table is loaded, you will run the following command just once:
INSERT target_table (val1, val2, id)
SELECT stg.val1, stg.val2, stg.id
FROM staging_target_table as stg
WHERE NOT EXISTS (SELECT 1 FROM target_table as t WHERE t.id = stg.id);
This is extremely fast, compared to the first solution.
in this case, I supposed that what permits you to recognize you row is a key (the "id" column), but if you actually want to compare the full row, you will have to add the comparison like this for the first solution:
INSERT target_table (val1, val2)
SELECT $myVal1, $myVal2
WHERE NOT EXISTS (SELECT 1 FROM target_table as t WHERE t.val1 = $myVal1 and t.val2 = $myVal2);
or like this for the second solution:
INSERT target_table (val1, val2, id)
SELECT stg.val1, stg.val2, stg.id
FROM staging_target_table as stg
WHERE NOT EXISTS (SELECT 1 FROM target_table as t WHERE t.val1 = stg.val1 and t.val2 = stg.val2);

MySQL copy row from one table to another with multiple NOT IN criteria

We have an old FoxPro DB that still has active data being entered into it. I am in the process of writing a series of .bat files that will update a MySQL database for our web applications that I'm working on.
Our FoxPro databases were never set up with unique IDs or anything useful like that so I'm having to have the query look at a few different fields.
Here's my query thus far:
//traininghistory = MySQL DB
//traininghistory_test = FoxPro DB
INSERT INTO traininghistory
WHERE traininghistory_test.CLASSID NOT IN(SELECT CLASSID FROM traininghistory)
AND traininghistory_test.EMPID NOT IN(SELECT EMPID FROM traininghistory)
What I'm After is this:
I need an query that looks at the 600,000+ entries in the FoxPro DB (traininghistory_test in my code) and compares to the 600,000+ entries in the MySQL DB (traininghistory in my code) and only inserts the ones where the columns CLASSID and EMPID are new- that is, they are NOT in the traininghistory table.
Any thoughts on this (or if you know a simpler/more efficient way to execute this query in MySQL) are greatly appreciated.
One option is to use a outer join / null check:
insert into traininghistory
select values
from traininghistory_test tht
left join traininghistory th on tht.empid = th.empid
and tht.classid = th.classid
where th.empid is null
It's also worth noting, your current query may leave out records since it's not comparing empid and classid in the same records.
One way ist.
CREATE ONE UNIQUE INDEX ON THE COLUMS (CLASSID, EMPID),
THEN
INSERT IGNORE INTO traininghistory SELECT * or fieldlist FROM traininghistory_test;
Thats all

SQL query runs on SQL Server 2008 standard but not on enterprise

I'm having a problem with a query running on SQL Server 2008 enterprise.
The query is an insert to a table from another but it checks that a record is inserted just once.
The query does something like this:
insert into A(...)
--complex select from table B as b
WHERE NOT EXISTS (SELECT 1 FROM A WHERE id = b.id)
Edit: this query does the following:
If the "complex select" from B selects the record 45 (i.e. the record with id = 45) twice then the where is true for the first time record 45 appears, so it gets inserted in A.
Then the second time record 45 appears, the where condition is false, so it does not get inserted in A twice.
This query works fine on SQL Server 2008 standard edition, so I think the problem is a difference between the SQL Server versions (like a default set different or something)
I'm reading about the Maximum Insert Commit Size, but I'm not sure if that can be the issue.
There is no error message, the only visible error is that in standard I get record 45 once and in enterprise I get it twice.
Any Ideas?
I'm pretty sure that the behaviour you say you are getting on Standard Edition is for some other reason than you think it is.
You seem to be expecting that if your values to be INSERTed contain duplicates that one will be INSERTED and then the NOT EXISTS will evaluate to false because of the existence of the newly added row. However AFAIK that is not the way it is supposed to work. Looking at a simple INSERT .. SELECT as below.
CREATE TABLE A(id INT PRIMARY KEY)
CREATE TABLE B(id INT PRIMARY KEY)
INSERT INTO A
SELECT *
FROM B
Gives the following plan
Adding the NOT EXISTS clause
INSERT INTO A
SELECT *
FROM B
WHERE NOT EXISTS (SELECT 1 FROM A WHERE id = B.id)
Changes the plan as follows
As well as the plan now including an anti semi join SQL Server has added an eager spool to the plan before the clustered index insert on A. This is a blocking operator and the purpose of it is to ensure that the entire SELECT is evaluated before any rows are inserted to B at all (Related to Halloween Protection).
You might not necessarily see a spool in your plans however. e.g. SQL Server might also choose to use another blocking operator such as a SORT or a hash anti semi join.
Please post the execution plan for at least the standard edition and preferably both. Also the queries so we can see if you are using any undeterministic constructs.
try restructuring your query and see if it works. Instead of the NOT EXISTS join the tables in the insert query:
INSERT A(...)
SELECT ... FROM B LEFT JOIN A ON B.id = A.id
WHERE B.id = 45 AND A.id IS NULL
then you are performing the select only once.

SQL: Select Keys that doesn't exist in one table

I got a table with a normal setup of auto inc. ids. Some of the rows have been deleted so the ID list could look something like this:
(1, 2, 3, 5, 8, ...)
Then, from another source (Edit: Another source = NOT in a database) I have this array:
(1, 3, 4, 5, 7, 8)
I'm looking for a query I can use on the database to get the list of ID:s NOT in the table from the array I have. Which would be:
(4, 7)
Does such exist? My solution right now is either creating a temporary table so the command "WHERE table.id IS NULL" works, or probably worse, using the PHP function array_diff to see what's missing after having retrieved all the ids from table.
Since the list of ids are closing in on millions or rows I'm eager to find the best solution.
Thank you!
/Thomas
Edit 2:
My main application is a rather easy table which is populated by a lot of rows. This application is administrated using a browser and I'm using PHP as the intepreter for the code.
Everything in this table is to be exported to another system (which is 3rd party product) and there's yet no way of doing this besides manually using the import function in that program. There's also possible to insert new rows in the other system, although the agreed routing is to never ever do this.
The problem is then that my system cannot be 100 % sure that the user did everything correct from when he/she pressed the "export" key. Or, that no rows has ever been created in the other system.
From the other system I can get a CSV-file out where all the rows that system has. So, by comparing the CSV file and my table I can see if:
* There are any rows missing in the other system that should have been imported
* If someone has created rows in the other system
The problem isn't "solving it". It's making the best solution to is since there are so much data in the rows.
Thanks again!
/Thomas
We can use MYSQL not in option.
SELECT id
FROM table_one
WHERE id NOT IN ( SELECT id FROM table_two )
Edited
If you are getting the source from a csv file then you can simply have to put these values directly like:
I am assuming that the CSV are like 1,2,3,...,n
SELECT id
FROM table_one
WHERE id NOT IN ( 1,2,3,...,n );
EDIT 2
Or If you want to select the other way around then you can use mysqlimport to import data in temporary table in MySQL Database and retrieve the result and delete the table.
Like:
Create table
CREATE TABLE my_temp_table(
ids INT,
);
load .csv file
LOAD DATA LOCAL INFILE 'yourIDs.csv' INTO TABLE my_temp_table
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(ids);
Selecting records
SELECT ids FROM my_temp_table
WHERE ids NOT IN ( SELECT id FROM table_one )
dropping table
DROP TABLE IF EXISTS my_temp_table
What about using a left join ; something like this :
select second_table.id
from second_table
left join first_table on first_table.id = second_table.id
where first_table.is is null
You could also go with a sub-query ; depending on the situation, it might, or might not, be faster, though :
select second_table.id
from second_table
where second_table.id not in (
select first_table.id
from first_table
)
Or with a not exists :
select second_table.id
from second_table
where not exists (
select 1
from first_table
where first_table.id = second_table.id
)
The function you are looking for is NOT IN (an alias for <> ALL)
The MYSQL documentation:
http://dev.mysql.com/doc/refman/5.0/en/all-subqueries.html
An Example of its use:
http://www.roseindia.net/sql/mysql-example/not-in.shtml
Enjoy!
The problem is that T1 could have a million rows or ten million rows, and that number could change, so you don't know how many rows your comparison table, T2, the one that has no gaps, should have, for doing a WHERE NOT EXISTS or a LEFT JOIN testing for NULL.
But the question is, why do you care if there are missing values? I submit that, when an application is properly architected, it should not matter if there are gaps in an autoincrementing key sequence. Even an application where gaps do matter, such as a check-register, should not be using an autoincrenting primary key as a synonym for the check number.
Care to elaborate on your application requirement?
OK, I've read your edits/elaboration. Syncrhonizing two databases where the second is not supposed to insert any new rows, but might do so, sounds like a problem waiting to happen.
Neither approach suggested above (WHERE NOT EXISTS or LEFT JOIN) is air-tight and neither is a way to guarantee logical integrity between the two systems. They will not let you know which system created a row in situations where both tables contain a row with the same id. You're focusing on gaps now, but another problem is duplicate ids.
For example, if both tables have a row with id 13887, you cannot assume that database1 created the row. It could have been inserted into database2, and then database1 could insert a new row using that same id. You would have to compare all column values to ascertain that the rows are the same or not.
I'd suggest therefore that you also explore GUID as a replacement for autoincrementing integers. You cannot prevent database2 from inserting rows, but at least with GUIDs you won't run into a problem where the second database has inserted a row and assigned it a primary key value that your first database might also use, resulting in two different rows with the same id. CreationDateTime and LastUpdateDateTime columns would also be useful.
However, a proper solution, if it is available to you, is to maintain just one database and give users remote access to it, for example, via a web interface. That would eliminate the mess and complication of replication/synchronization issues.
If a remote-access web-interface is not feasible, perhaps you could make one of the databases read-only? Or does database2 have to make updates to the rows? Perhaps you could deny insert privilege? What database engine are you using?
I have the same problem: I have a list of values from the user, and I want to find the subset that does not exist in anther table. I did it in oracle by building a pseudo-table in the select statement Here's a way to do it in Oracle. Try it in MySQL without the "from dual":
-- find ids from user (1,2,3) that *don't* exist in my person table
-- build a pseudo table and join it with my person table
select pseudo.id from (
select '1' as id from dual
union select '2' as id from dual
union select '3' as id from dual
) pseudo
left join person
on person.person_id = pseudo.id
where person.person_id is null