SSIS Script component - Reference data validation - ssis

I am in the process of extending an SSIS package, which takes in data from a text file, 600,000 lines of data or so, modifies some of the values in each line based on a set of business rules and persists the data to a database, database B. I am adding in some reference data validation, which needs to be performed on each row before writing the data to database B. The reference data is stored in another database, database A.
The reference data in database A is stored in seven different tables; each tables only has 4 or 5 columns of type varchar. Six of the tables contain < 1 million records and the seventh has 10+ million rows. I don't want to keep hammering the database for each line in the file and I just want to get some feedback on my proposed approach and ideas on how best to manage the largest table.
The reference data checks will need to be performed in the script component, which acts as a source in the data flow. It has an ado.net connection. On pre-execute, I am going to retrieve the reference data from database 'A', the tables which have < 1 million rows, using the ado.net connection, loop through them all using a sqldatareader, convert them to .Net objects; one for each table and add them to a dictionary.
As I process each line in the file, I can use the dictionaries to perform the reference data validation. Is this a good approach? Anybody got any ideas on how best to manage the largest table?

Related

Read Metadata Dynamically in SSIS package

I have a requirement where I have execute a MYSQL procedure (procedure name is same across all the sites) for different sites, I have a For each loop which get the server details and database details and within the Foreach Loop container I have dataflow task, in the data flow task I created a source connection for a MySql connection and calling the stored procedure like below,
CALL spITProd( '2022-09-25 20:04:22.847000000', '2022-10-25 20:04:22.847000000' );
This gives different metadata for different sites, to be precise first 16 columns are same for all the sites and after that number of columns and column names can vary, my requirement is to store first 16 columns in a destination table and all other columns in a different table by unpivoting them, I was able to achieve this for one site as shown below SSIS data flow task,
I want to automate this for process for all the sites, I know I can create multiple data flow tasks but that's not what I want to do, when source connects to the new site it fails as it encounters new metadata (as I said number of columns and column name changes after first 16 columns), Please suggest any ideas on how to dynamically read the columns from the procedure. I don't have any permission son the MYSQL database it is managed by the vendor I just have permission to the call the procedure, I tried to store the data into System.Object variable but I am not sure how to use that later as I cant create a temp table on destination using that, Appreciate any ideas on how to handle this requirement.
Mentioned the details above

SSIS package design, where 3rd party data is replacing existing data

I have created many SSIS packages in the past, though the need for this one is a bit different than the others which I have written.
Here's the quick description of the business need:
We have a small database on our end sourced from a 3rd party vendor, and this needs to be overwritten nightly.
The source of this data is a bunch of flat files (CSV) from the 3rd party vendor.
Current setup: we truncate the tables of this database, and we then insert the new data from the files, all via SSIS.
Problem: There are times when the files fail to come, and what happens is that we truncate the old data, though we don't have the fresh data set. This leaves us without a database where we would prefer to have yesterday's data over no data at all.
Desired Solution: I would like some sort of mechanism to see if the new data truly exists (these files) prior to truncating our current data.
What I have tried: I tried to capture the data from the files and add them to an ADO recordset and only proceeding if this part was successful. This doesn't seem to work for me, as I have all the data capture activities in one data flow and I don't see a way for me to reuse that data. It would seem wasteful of resources for me to do that and let the in-memory tables just sit there.
What have you done in a similar situation?
If files are not present update some flags like IsFile1Found to false and pass these flags to stored procedure which truncates on conditional basis.
If file is empty then Using powershell through Execute Process Task you can extract first two rows if there are two rows (header + data row) then it means data file is not empty. Then you can truncate the table and import the data.
other approach could be
you can load data into some staging table and from these staging table insert data to the destination table using SQL stored procedure and truncate these staging tables after data is moved to all the destination table. In this way before truncating destination table you can check if staging tables are empty or not.
I looked around and found that some others were struggling with the same issue, though none of them had a very elegant solution, nor do I.
What I ended up doing was to create a flat file connection to each file of interest and have a task count records and save to a variable. If a file isn't there, the package fails and you can stop execution at that point. There are some of these files whose actual count is interesting to me, though for the most part, I don't care. If you don't care what the counts are, you can keep recycling the same variable; this will reduce the creation of variables on your end (I needed 31). In order to preserve resources (read: reduce package execution time), I excluded all but one of the columns in each data source; it made a tremendous difference.

Importing MYSQL database to NeO4j

I have a mysql database on a remote server which I am trying to migrate into Neo4j database. For this I dumped the individual tables into csv files and am now planning to use the LOAD CSV functionality to create graphs from the tables.
How does loading each table preserve the relationship between tables?
In other words, how can I generate a graph for the entire database and not just a single table?
Load each table as a CSV
Create indexes on your relationship field (Neo4j only does single property indexes)
Use MATCH() to locate related records between the tables
Use MERGE(a)-[:RELATIONSHIP]->(b) to create the relationship between the tables.
Run "all at once", this'll create a large transaction, won't go to completion, and most likely will crash with a heap error. Getting around that issue will require loading the CSV first, then creating the relationships in batches of 10K-100K transaction blocks.
One way to accomplish that goal is:
MATCH (a:LabelA)
MATCH (b:LabelB {id: a.id}) WHERE NOT (a)-[:RELATIONSHIP]->(b)
WITH a, b LIMIT 50000
MERGE (a)-[:RELATIONSHIP]->(b)
What this does is find :LabelB records that don't have a relationship with the :LabelA records and then creates that relationship for the first 50,000 records it finds. Running this repeatedly will eventually create all the relationships you want.

SSIS data validation

I have a json file that comes with around 125 columns and I need to load it to a DB Table.I'm using SSIS package and after dumping all the JSON file contents to a DB DUMP Table,I need to validate the data and load only the data that is valid to the MASTER Table and Send the rest to a failure table.The failure Table has 250 columns with ERROR for each column.If the first column fails validation,I need to write the error message to the corresponding error column and continue with the validation of second column...Is there some utility IN SSIS that helps in achieving the requirement.
I've tried using Conditional Split but appears like it doesn't fit the bill..
Thanks,
Vijay
I agree with Alleman's suggestion of getting this done via stored procedure. In terms of implementation there are various ways with which you can go about. I am listing one way here
In the database you can create some 10 stored procedures as follows
dbo.usp_ValidateData_Columns1_To_Columns25
dbo.usp_ValidateData_Columns26_To_Columns50
....
....
dbo.usp_ValidateData_Columns226_To_Columns250
In each of this procedures you can have the validate your data in bulk across columns. If validation fails you can insert into the respective error columns.
Once you have this in place you can then call all the above procedures in parallel as part of your SSIS Package.
Post that you would need one more DFT, to pick all those records which are good to be transferred to MASTER.
Basically you are modularizing the whole setup.

SSIS OLE DB conditional "insert"

I have no idea whether this can be done or not, but basically, I have the following data flow:
Extracts the data from an XML file (works fine)
Simply splits the records based on an enclosed condition (works fine)
Had to add a derived column object due to some character set issues (might be better methods, but it works)
Now "Step 4" is where I'm running into a scenario where I'd only like to insert the values that have a corresponding match in my database, for instance, the XML has about 6000 records, and from those, I have maybe 10 of them that I need to match back against and insert them instead of inserting all 6000 of them and doing the compare after the fact (which I could also do, but was hoping there'd be another method). I was thinking that I might be able to perform a sql insert command within the OLE DB DESTINATION object where the ID value in the file matches, but that's what I'm not 100% clear on or if it's even possible for that matter. Should I simply go the temp table route and scrub the data after the fact, or can I do this directly in the destination piece? Any suggestions would be greatly appreciated.
EDIT
Thanks to the last comment from billinkc, I managed to get bit closer, where I can identify the matches and use that result set, but somehow it seems to be running the data flow twice, which is strange.... I took the lookup object out to see whether it was causing it and somehow it seems to be the case, any reason why it would run this entire flow twice with the addition of the lookup? I should have a total of 8 matches, which I confirmed with the data viewer output, but then it seems to be running it a second time for the same file.
Is there a reason you can't use a Lookup transformation to find existing records. Configure it so that it routes non-match records to the no match output and then only connect the match found connector to the "Navigator Staging Manager Funds"
I believe that answers what you've asked but I wonder if you're expressing the right desire? My assumption is the lookup would go against the existing destination and so the lookup returns the id 10 for a row. All of the out of the box destinations in SSIS only perform inserts, so that row that found a match would now get doubled. As you are looking for existing rows, that usually implies you'd want to perform an update to an existing row. If that's the case, there is a specially designed transformation, the OLE DB Command. It is the component that allows for updates. There is a performance problem with that component, it issues a single update statement per row flowing through it. For 10 rows, I think it'd be fine. Otherwise, the pattern you'd use is to write all the new rows (inserts) into your destination table and then write all of your changed rows (updates) into a second staging-type table. After the data flow is complete, then use an Execute SQL Task to perform a set based update statement.
There are third party options that handle combined upserts. I know Pragmatic Works has an option and there are probably others on the tasks and components site.