Validating the CSV data before inserting into SQL Server - sql-server-2008

I have a senario where i have to check the first column of the excel has a valid data and not in the other columns from the CSV file. If the data is not present in the first column my SSIS package should log an exeption.
Can any one help me in this senario please.
Thanks,
Sateesh.

In SSIS you can use a conditional Split task to do this and send the good data to where you want it and the bad data to an exception table.
Personally I always prefer to start with putting the data for any import into two tables, one with the raw unchanged data and one that will contain the cleaned up data before the import to the prod tables. This makes it easier to see what the cause is when you inevitably have to research why some bad data got into the database (if you are doing your job right, 90+% of the time it's bad data you were sent _ you can't know the contract expires on 4/12/2012 when you were sent 4/12/2011 to pick a not so random example). Also always make sure to save the input file to an archive loaction. Trust me, you will need one or more of those archived files some day.

Related

SSIS package design, where 3rd party data is replacing existing data

I have created many SSIS packages in the past, though the need for this one is a bit different than the others which I have written.
Here's the quick description of the business need:
We have a small database on our end sourced from a 3rd party vendor, and this needs to be overwritten nightly.
The source of this data is a bunch of flat files (CSV) from the 3rd party vendor.
Current setup: we truncate the tables of this database, and we then insert the new data from the files, all via SSIS.
Problem: There are times when the files fail to come, and what happens is that we truncate the old data, though we don't have the fresh data set. This leaves us without a database where we would prefer to have yesterday's data over no data at all.
Desired Solution: I would like some sort of mechanism to see if the new data truly exists (these files) prior to truncating our current data.
What I have tried: I tried to capture the data from the files and add them to an ADO recordset and only proceeding if this part was successful. This doesn't seem to work for me, as I have all the data capture activities in one data flow and I don't see a way for me to reuse that data. It would seem wasteful of resources for me to do that and let the in-memory tables just sit there.
What have you done in a similar situation?
If files are not present update some flags like IsFile1Found to false and pass these flags to stored procedure which truncates on conditional basis.
If file is empty then Using powershell through Execute Process Task you can extract first two rows if there are two rows (header + data row) then it means data file is not empty. Then you can truncate the table and import the data.
other approach could be
you can load data into some staging table and from these staging table insert data to the destination table using SQL stored procedure and truncate these staging tables after data is moved to all the destination table. In this way before truncating destination table you can check if staging tables are empty or not.
I looked around and found that some others were struggling with the same issue, though none of them had a very elegant solution, nor do I.
What I ended up doing was to create a flat file connection to each file of interest and have a task count records and save to a variable. If a file isn't there, the package fails and you can stop execution at that point. There are some of these files whose actual count is interesting to me, though for the most part, I don't care. If you don't care what the counts are, you can keep recycling the same variable; this will reduce the creation of variables on your end (I needed 31). In order to preserve resources (read: reduce package execution time), I excluded all but one of the columns in each data source; it made a tremendous difference.

Data Cleanse ENTIRE Access Table of Specific Value (SQL Update Query Issues)

I've been searching for a quick way to do this after my first few thoughts have failed me, but I haven't found anything.
My Issue
I'm importing raw client data into an Access database where the flat file they provide is parsed and converted into a standardized format for our organization. I do this for all of our clients, but this particular client's software gives us a file that puts "(NULL)" in every field that should be NULL. lol as a result, I have a ton of strings rather than a null field!
My goal is to do a data cleanse of the entire TABLE, rather than perform the cleanse at the FIELD level (as I do in my temporary solution below).
Data Cleanse
Temporary Solution:
I can't add those strings to our datawarehouse, so for now, I just have a query with an IIF statement check that replaces "(NULL)" with "" for each field (which took awhile to setup since the client file has roughly 96 fields). This works. However, we work with hundreds of clients, so I'd like to make a scale-able solution that doesn't require many changes if another client has a similar file; not to mention that if this client changes something in their file, I might have to redo my field specific statements.
Long-term Solution:
My first thought was an UPDATE query. I was hoping I could do something like:
UPDATE [ImportedRaw_T]
SET [ImportedRaw_T].* = ""
WHERE ((([ImportedRaw_T].* = "(NULL)"));
This would be easily scale-able, since for further clients I'd only need to change the table name and replace "(NULL)" with their particular default. Unfortunately, you can't use SELECT * with an update query.
Can anyone think of a work-around to the SELECT * issue for the update query, or have a better solution for cleansing an entire table, rather doing the cleanse at the field level?
SIDE NOTES
This conversion is 100% automated currently (Access is called via a watch folder batch), so anything requiring manual data manipulation / human intervention is out.
I've tried using a batch script to just cleanse the data in the .txt file before importing to Access - however, this caused an issue with the fixed-width format of the .txt, which has caused even larger issues with the automatic import of the file to Access. So I'd prefer to do this in Access if possible.
Any thoughts and suggestions are greatly appreciated. Thanks!
Unfortunately it's impossible to implement this in SQL using wildcards instead of column names, there is no such kind syntax.
I would suggest VBA solution, where you need to cycle thru all table fields and if field data type is string, generate and execute SQL UPDATE command for updating current field.
Also use Null instead of "", if you really need Nulls in the field instead of empty strings, they may work differently in calculations.

SSIS data validation

I have a json file that comes with around 125 columns and I need to load it to a DB Table.I'm using SSIS package and after dumping all the JSON file contents to a DB DUMP Table,I need to validate the data and load only the data that is valid to the MASTER Table and Send the rest to a failure table.The failure Table has 250 columns with ERROR for each column.If the first column fails validation,I need to write the error message to the corresponding error column and continue with the validation of second column...Is there some utility IN SSIS that helps in achieving the requirement.
I've tried using Conditional Split but appears like it doesn't fit the bill..
Thanks,
Vijay
I agree with Alleman's suggestion of getting this done via stored procedure. In terms of implementation there are various ways with which you can go about. I am listing one way here
In the database you can create some 10 stored procedures as follows
dbo.usp_ValidateData_Columns1_To_Columns25
dbo.usp_ValidateData_Columns26_To_Columns50
....
....
dbo.usp_ValidateData_Columns226_To_Columns250
In each of this procedures you can have the validate your data in bulk across columns. If validation fails you can insert into the respective error columns.
Once you have this in place you can then call all the above procedures in parallel as part of your SSIS Package.
Post that you would need one more DFT, to pick all those records which are good to be transferred to MASTER.
Basically you are modularizing the whole setup.

How to load data from fixedwidth file to SQLServer, here the main concept is handling data very carefully

I want to load fixed width flat file data into SQL Server, but here the major task is the data is very critical.
the data should load row by row and each row have certain specifications like row 1 belongs to 1 header details and row 2 belongs another part of the details like this.
Finally, here the most critical point is , in my file I have some portions of data comes in different segments with different delimiters, how can I handle these different delimiters in a single file and how can I load data from this file to SQL Server.
Please provide your valuable suggestions here and thanks in advance
Wow, It sounds like your file layout is a mess. Here are two options.
1 - Load the data into a SSIS buffer as a blob of text. Write custom transformations to fix the mess. Might even involve C# scripting. Multiple passes of the data?
Output the formatted data to your target, SQL server.
This is called ETL - extract, translate, load.
2 - Load the data directly to SQL Server as a blob of text into a staging table. Write transformations in TSQL as stored procedures. Kick off the stored procedures from SSIS to fix the mess.
This is called ELT - extract, load, translate.
Again, you are very vague with this question. I can only suggest design patterns.

SSIS DataFlow from Access to MSSQL

I have a simple DataFlow with two objects the source which is a mdb file and the destination which is an MSSQL database.
The idea is to migrate the data from one to another.
The problem is that the data is extracted from an Access query, and one column has ~1000 characters, and in SSIS in advanced properties the external column has the default 255 length so when i execute the task it tries to truncate it. To disable the throw error on truncate is not an option, and modifying the Length of the external column cannot be done, it throws and error regarding the metadata.
First of all can anyone explain WHY?
Second of all i need a resolution and i need it fast because it's kinda driving me crazy.
This kind of problem occours, because the ssis task "guesses" the length of the column by inspecting the first 100(afaik) rows. So if all rows from 1 to 100 have a length of 10 and the row 101 has the legnth of 11, the task will fail, because the length was "guessed" to 10.
Modifying throws an error, because you have validateExternalMetadata set to true. To solve this problem, go to advanced options of your import task (access) and set the value to false.
This means, the task will accept modified values you entered without checking it.
Did you try to SSIS Import and Export Wizard to import the data, from within the BI development environment? That is the easiest way with MsAccess as this not only imports the data but also saves the package. If you get an error during the import ( using the wizard), please post it, as this helps in further investigation. Also, as #stb suggested, try having the first record over 1000 characters.
Access supports queries which are the equivalent to views in MSSQL.
The column size is defined not by looking at a few results but by the default column length of the column data type.
I created another table with the desired data types and before the data flow i've put in the package 2 sql scripts: one to delete all the data in the table and one to execute the query against the table, as to treat it as a temporary table.
Then the actual data flow is executed against this pseudo-temporary table.
This solved my problem.