convert Image column is very slow - ssis

I want to convert data from old database to new database with new structure.
in old database I have attachment table that must be convert to attachment table in new database.
old database attachment table structure is below:
Attachment (ID int, Image Image, ...)
and new database attachment table structure is below :
Attachment (ID int, Image Image, OldID Int, ...)
each time I execute convert package copy only not exists data (new data) from old database to new database.
I use below format for do it :
lookup between old table and new table (ID --> OldID) for check exists record.
When I run SSIS Packages; SSIS, first cache all lookups and source component data in memory then execute package. my source data in this package is very huge and when I run this package it will be run very slowly. I want to get Image column data from old database for each new record after lookup for check exists component. if I use new lookup component for get image column data from old database, SSIS cache this new lookup data and execution time of run this package not change. what must I do?
thanks in advance.

Are you sure you're thinking this through correctly? SSIS should not be slow even if the amount of data you are loading is huge.
Your LOOKUP component needs to make sure it's not doing anything it doesn't need to. If you are pointing it to the table in the new database, change it to a SQL Query at once. In this query you only need to SELECT OldId FROM tbl and point the incoming ID from old database to this. Your data flow should contain ID and Image from Old database, which is mapped ID -> OldIdand "Image -> Image` in your OLE DB Destination. No more is needed for "Insert new rows only" operation like you are doing here.
For this job, there is no need for any custom code or dynamic SQL. You -do- want to get the ID and Image from your source system in the data flow (unless you have major network bottlenecks to sort out) - doing a RBAR lookup to get the image data from the old system is a very backwards way of thinking your ETL.

Select only ID from source table
Do lookup in destination db with no change
For its no match output do lookup in source table, with Cache Mode set to No cache, which will append Image to the flow.
In this case each image will be fetched separately, which may affect performance.

You may also do it in two Data Flows.
In first:
Select only ID from source table
Do lookup in destination db with no change
Store new Ids in string variable IdListToBeFetched as comma separated list using Srcipt Component as destination witch code similar to:
using System.Text;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
StringBuilder sb;
public override void PreExecute()
{
base.PreExecute();
sb = new StringBuilder();
}
public override void PostExecute()
{
base.PostExecute();
Variables.IdListToBeFetched = sb.ToString().TrimEnd(',');
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.ID_IsNull)
{
sb.AppendFormat("{0},", Row.ID);
}
}
}
In second Data Flow set sql command of source to dynamic generated query from expression similar to "select ID, Image from Attachment where ID in (" + #[User::IdListToBeFetched] + ")" and set DelayValidation = True. It will take all Images in single select which should be faster.
To set dynamic generated query as SqlCommand in sources like ADO NET Source or ODBC Source:
select Expression property of Data Flow Task containing your source
find property [your source name].[SqlCommand] and set expression here
To set dynamic generated query as sql command in OLE DB Source (taken from Jamie Thomson blog):
Create a new variable called SourceSQL
Open up the properties pane for SourceSQL variable (by pressing F4)
Set EvaluateAsExpression=TRUE
Set Expression to "select ID, Image from Attachment where ID in (" + #[User::IdListToBeFetched] + ")"
For your OLE DB Source component, open up the editor
Set Data Access Mode="SQL Command from variable"
Set VariableName = "SourceSQL"

Related

C# or BIML code for inserting records into db

I want to insert values into database when the biml code is ran and the package has completed expansion is this possible using BIML or c#?
I have a table called BIML expansion created in my DB and I have test.biml which loads the package test.dtsx whenever the BIML expansion is completed a record should be inserted into my table that expansion has been completed.
Let me know if you have any questions or needs any additional info.
From comments
I tried your code
string connectionString = "Data Source=hq-dev-sqldw01;Initial Catalog=IM_Stage;Integrated Security=SSPI;Provider=SQLNCLI11.1";
string SrcTablequery=#"INSERT INTO BIML_audit (audit_id,Package,audit_Logtime) VALUES (#audit_id, #Package,#audit_Logtime)";
DataTable dt = ExternalDataAccess.GetDataTable(connectionString,SrcTablequery);
It has an error below must declare the scalar variable audit_id can you let me know the issue behind it?
In it's simplest form, you'd have content like this in your Biml script
// Define the connection string to our database
string connectionStringSource = #"Server=localhost\dev2012;Initial Catalog=AdventureWorksDW2012;Integrated Security=SSPI;Provider=SQLNCLI11.1";
// Define the query to be run after *ish* expansion
string SrcTableQuery = #"INSERT INTO dbo.MyTable (BuildDate) SELECT GETDATE()";
// Run our query, nothing populates the data table
DataTable dt = ExternalDataAccess.GetDataTable(connectionStringSource, SrcTableQuery);
Plenty of different ways to do this - you could have spun up your own OLE/ADO connection manager and used the class methods. You could have pulled the connection string from the Biml Connections collection (depending on the tier this is executed in), etc.
Caveats
Depending on the product (BimlStudio vs BimlExpress), there may be a background process compiling your BimlScript to ensure all the metadata is ready for intellisense to pick it up. You might need to stash that logic into a very high tiered Biml file to ensure it's only called when you're ready for it. e.g.
<## template tier="999" #>
<#
// Define the connection string to our database
string connectionStringSource = #"Server=localhost\dev2012;Initial Catalog=AdventureWorksDW2012;Integrated Security=SSPI;Provider=SQLNCLI11.1";
// Define the query to be run after *ish* expansion
string SrcTableQuery = #"INSERT INTO dbo.MyTable (BuildDate) SELECT GETDATE()";
// Run our query, nothing populates the data table
DataTable dt = ExternalDataAccess.GetDataTable(connectionStringSource, SrcTableQuery);
#>
Is that the problem you're trying to solve?
Addressing comment/questions
Given the query of
string SrcTablequery=#"INSERT INTO BIML_audit (audit_id,Package,audit_Logtime) VALUES (#audit_id, #Package,#audit_Logtime)";
it errors out due to #audit_id not being specified. Which makes sense - this query specifies it will provide three variables and none are provided.
Option 1 - the lazy way
The quickest resolution would be to redefine your query in a manner like this
string SrcTablequery=string.Format(#"INSERT INTO BIML_audit (audit_id,Package,audit_Logtime) VALUES ({0}, '{1}', '{2})'", 123, "MyPackageName", DateTime.Now);
I use the string library's Format method to inject the actual values into the placeholders. I assume that audit_id is a number and the other two are strings thus the tick marks surrounding 1 and 2 there. You'd need to define a value for your audit id but I stubbed in 123 as an example. If I were generating packages, I'd likely have a variable for my packageName so I'd reference that in my statement as well.
Option 2 - the better way
Replace the third line with .NET library usage much as you see in heikofritz on using parameters inserting data into access database.
1) Create a database Connection
2) Open connection
3) Create a command object and associate with the connection
4) Specify your statement (use ? as your ordinal marker instead of named parameters since this is oledb)
5) Create an Parameter list and associate with values
Many, many examples out there beyond the referenced but it was the first hit. Just ignore the Access connection string and use your original value.

Write into Excel Destination from SSIS variables

I have 3 SSIS variables namely name, age, gender with initial values set. I want to write these values into excel sheet in one row. Later I will extend this to Array of records.
To do this I have created Excel connection attaching the excel sheet where I want to write.
I added control flow task and double clicked and then added Derived column component to create derived columns for each of above 3 variables . Inside derived column editor I selectd above variables as new derived columns.
And then pipelined excel destination component and mapped sheet columns to derived columns. I executed the SSIS package and its successful. But variables are not written into excel sheet.
What I am doing wrong ?
Again, you need a source. I gave you an "easy" solution. This is probably the best solution to your problem:
This time the source will be a script component (select Source).
Steps after you add Script Component:
Select Source
Go to Inputs and Outputs
Add your Output Columns (Don't forget about data types)
Go back to Script
Add you variables (Gender, Name and Age)
Go into Script
Add the following code
public override void CreateNewOutputRows()
{
Output0Buffer.AddRow();
Output0Buffer.Age = Variables.Age;
Output0Buffer.Gender = Variables.Gender;
Output0Buffer.Name = Variables.Name;
}
You need a source. the easiest would be to use a SQL connection.
Use a variable of type string named SQL.
Set SQL = "Select '" + name+ "' as name,"+ age + "as age,'" + gender + "' as Gender
Set your source to SQL variable.
Connect this Source to Destination and you should have 1 row with 3 columns
Listing the steps clearly as suggested by #KeithL
Create a SSIS variable selectQueryVariables with string datatype.
Assign variable expression as
"SELECT '"+#[User::name]+"' as Name,'"+#[User::gender]+"' as Gender,"+(DT_WSTR,4 )#[User::age]+" as Age"
Add OLE DB Source component and set data access mode as SQL command from variable and select the variable selectQueryVariables in dropdown. Now the source is ready with 3 columns Name, Age and Gender.
Pipeline this with Excel Destination and map columns source and destination.

Prevent Duplicate headers in flat file destination - SSIS

I need some help.
I am importing some data in .csv file from an oledb source. I don't want the headers to appear twice in the destination. If i Uncheck the "Column names in first data row" property , the headers don't get populated in the first execution as well.
Output as of now.
Col1,Col2
A,B
Col1,Col2
C,D
How can I make the package run in such a way that if the file is empty , the headers get inserted. Then if the execution happens again, headers are not included,just the data.
there was a similar thread, but wasn't able to apply the solution as how to use expressions to get the number of rows of destination itself. It was long back , so I created a new.
Your help is deeply appreciated.
-Akshay
Perhaps I'm missing something but this works for me. I am not having the read only trouble with ColumnNamesInFirstDataRow
I created a package level variable named AddHeader, type Boolean and set it to True. I added a Flat File Connection Manager, named FFCM and configured it to use a CSV output of 2 columns HeadCount (int), AddHeader (boolean). In the properties for the Connection Manager, I added an Expression for the property 'ColumnNamesInFirstDataRow' and assigned it a value of #[User::AddHeader]
I added a script task to test the size of the file. It has read/write access to the Variable AddHeader. I then used this script to determine whether the file was empty. If your definition of "empty" is that it has a header row, then I'd adjust the logic in the if check to match that length.
public void Main()
{
string path = Dts.Connections["FFCM"].ConnectionString;
System.IO.FileInfo stats = null;
try
{
stats = new System.IO.FileInfo(path);
// checking length isn't bulletproof based on how the disk is configured
// but should be good enough
// http://stackoverflow.com/questions/3750590/get-size-of-file-on-disk
if (stats != null && stats.Length != 0)
{
this.Dts.Variables["AddHeader"].Value = false;
}
}
catch
{
// no harm, no foul
}
Dts.TaskResult = (int)ScriptResults.Success;
}
I looped through twice to ensure I'd generate the append scenario
I deleted my file and ran the package and only had a header once.
The property that controls whether the column names will be included in the output file or not is ColumnNamesInFirstDataRow. This is a readonly property.
One way to achieve what you are trying to do it would be to have two data flow tasks on the control flow surface preceded by a script task. these two data flow tasks will be identical except that they will be referring to two different flat file connection managers. Again, the only difference between these two would be the different values for the ColumnsInTheFirstDataRow; one true, another false.
Use this Script task to decide whether this is the first run or subsequent runs. Persist this information and check it within the script. Either you can have a separate table for this information, or use some log table to infer it.
Following solution is worked for me.You can also try the following.
Create three variables.
IsHeaderRequired
RowCount
TargetFilePath
Get the source row counts using Execute SQL task and save it in
RowCount variable.
Have script task. Add readonly variables TargetFilePath and
RowCount. Add read and write variable IsHeaderRequired.
Edit the script and add the following line of code.
string targetFilePath = Dts.Variables["TargetFilePath"].Value.ToString();
int rowCount = (int)Dts.Variables["RowCount"].Value;
System.IO.FileInfo targetFileInfo = new System.IO.FileInfo(targetFilePath);
if (rowCount > 0)
{
if (targetFileInfo.Length == 0)
{
Dts.Variables["IsHeaderRequired"].Value = true;
}
else
{
Dts.Variables["IsHeaderRequired"].Value = false;
}
}
Dts.TaskResult = (int)ScriptResults.Success;
Connect your script component to your database
Click connection manager of flat file[i.e your target file] and go
to properties. In the expression, mention the following as shown in
the screenshot.
Map the connectionString to variable "TargetFilePath".
Map the ColumnNamesInFirstDataRow to "IsHeaderRequired".
Expression for Flat file connection Manager.
Final package[screenshot]:
Hope this helps
A solution ....
First, add an SSIS integer variable in the scope of the Foreach Loop or higher - I'll call this RowCount - and make its default value negative (this is important!). Next, add a Row Count to your Data Flow, and assign the result to the RowCount SSIS variable we just made. Third, select your Connection Manager (don't double-click) and open the Properties window (F4). Find the Expressions property, select it, and hit the ellipsis (...) button. Select the ColumnNamesInFirstDataRow property, and use an expression like this:
[#User::RowCount] < 0
Now, when your package starts, RowCount has the static value of -1 or another negative number. When the data flow starts for the first time in your loop, the ColumnNamesInFirstDataRow property will have a value of TRUE. When the first data flow completes, the row count (even if it's zero) is written to the RowCount variable. On the second interation of the loop, the Connection Manager is then reconfigured to NOT write column names...

ssis 2005, write on excel files [duplicate]

I am working with SSIS 2008. I have a select query name sqlquery1 that returns some rows:
aq
dr
tb
This query is not implemented on the SSIS at the moment.
I am calling a stored procedure from an OLE DB Source within a Data Flow Task. I would like to pass the data obtained from the query to the stored procedure parameter.
Example:
I would like to call the stored procedure by passing the first value aq
storedProdecure1 'aq'
then pass the second value dr
storedProdecure1 'dr'
I guess it would be something like a cycle. I need this because the data generated by the OLE DB Source through the stored procedure needs to be sent to another destination and this must be done for each record of the sqlquery1.
I would like to know how to call the query sqlquery1 and pass its output to call another stored procedure.
How do I need to do this in SSIS?
Conceptually, what your solution will look like is an execute your source query to generate your result set. Store that into a variable and then you'll need to do iterate through those results and for each row, you'll want to call your stored procedure with that row's value and send the results into a new Excel file.
I'd envision your package looking something like this
An Execute SQL Task, named "SQL Load Recordset", attached to a Foreach Loop Container, named "FELC Shred Recordset". Nested inside there I have a File System Task, named "FST Copy Template" which is a precedence for a Data Flow Task, named "DFT Generate Output".
Set up
As you're a beginner, I'm going to try and explain in detail. To save yourself some hassle, grab a copy of BIDSHelper. It's a free, open source tool that improves the design experience in BIDS/SSDT.
Variables
Click on the background of your Control Flow. With nothing selected, right-click and select Variables. In the new window that pops up, click the button that creates a New Variable 4 times. The reason for clicking on nothing is that until SQL Server 2012, the default behaviour of variable creation is to create them at the scope of the current object. This has resulted in many lost hairs for new and experienced developers alike. Variable names are case sensitive so be aware of that as well.
Rename Variable to RecordSet. Change the Data type from Int32 to Object
Rename Variable1 to ParameterValue. Change the data type from Int32 to String
Rename Variable2 to TemplateFile. Change the data type from Int32 to String. Set the value to the path of your output Excel File. I used C:\ssisdata\ShredRecordset.xlsx
Rename Variable 4 to OutputFileName. Change the data type from Int32 to String. Here we're going to do something slightly advanced. Click on the variable and hit F4 to bring up the Properties window. Change the value of EvaluateAsExpression to True. In Expression, set it to "C:\\ssisdata\\ShredRecordset." + #[User::ParameterValue] + ".xlsx" (or whatever your file and path are). What this does, is configures a variable to change as the value of ParameterValue changes. This helps ensure we get a unique file name. You're welcome to change naming convention as needed. Note that you need to escape the \ any time you are in an expression.
Connection Managers
I have made the assumption you are using an OLE DB connection manager. Mine is named FOO. If you are using ADO.NET the concepts will be similar but there will be nuances pertaining to parameters and such.
You will also need a second Connection Manager to handle Excel. If SSIS is temperamental about data types, Excel is flat out psychotic-stab-you-in-the-back-with-a-fork-while-you're-sleeping about data types. We're going to wait and let the data flow actually create this Connection Manager to ensure our types are good.
Source Query to Result Set
The SQL Load Recordset is an instance of the Execute SQL Task. Here I have a simple query to mimic your source.
SELECT 'aq' AS parameterValue
UNION ALL SELECT 'dr'
UNION ALL SELECT 'tb'
What's important to note on the General tab is that I have switched my ResultSet from None to Full result set. Doing this makes the Result Set tab go from being greyed out to usable.
You can observe that I have assigned the Variable Name to the variable we created above (User::RecordSet) and I the Result Name is 0. That is important as the default value, NewResultName doesn't work.
FELC Shred Recordset
Grab a Foreach Loop Container and we will use that to "shred" the results that were generated in the preceding step.
Configure the enumerator as a Foreach ADO Enumerator Use User::RecordSet as your ADO object source variable. Select rows in the first table as your Enumeration mode
On the Variable Mappings tab, you will need to select your variable User::ParameterValue and assign it the Index of 0. This will result in the zerotth element in your recordset object being assigned to the variable ParameterValue. It is important that you have data type agreement as SSIS won't do implicit conversions here.
FST Copy Template
This a File System Task. We are going to copy our template Excel File so that we have a well named output file (has the parameter name in it). Configure it as
IsDestinationPathVariable: True
DestinationVarible: User::OutputFileName
OverwriteDestination: True
Operation: Copy File
IsSourcePathVariable: True
SourceVariable: User::TemplateFile
DFT Generate Output
This is a Data Flow Task. I'm assuming you're just dumping results straight to a file so we'll just need an OLE DB Source and an Excel Destination
OLEDB dbo_storedProcedure1
This is where your data is pulled from your source system with the parameter we shredded in the Control Flow. I am going to write my query in here and use the ? to indicate it has a parameter.
Change your Data access mode to "SQL Command" and in the SQL command text that is available, put your query
EXECUTE dbo.storedProcedure1 ?
I click the Parameters... button and fill it out as shown
Parameters: #parameterValue
Variables: User::ParameterValue
Param direction: Input
Connect an Excel Destination to the OLE DB Source. Double click and in the Excel Connection Manager section, click New... Determine if you're needing 2003 or 2007 format (.xls vs .xlsx) and whether you want your file to have header rows. For you File Path, put in the same value you used for your #User::TemplatePath variable and click OK.
We now need to populate the name of the Excel Sheet. Click that New... button and it may bark that there is not sufficient information about mapping data types. Don't worry, that's semi-standard. It will then pop up a table definition something like
CREATE TABLE `Excel Destination` (
`name` NVARCHAR(35),
`number` INT,
`type` NVARCHAR(3),
`low` INT,
`high` INT,
`status` INT
)
The "table" name is going to be the worksheet name, or precisely, the named data set in the worksheet. I made mine Sheet1 and clicked OK. Now that the sheet exists, select it in the drop down. I went with the Sheet1$ as the target sheet name. Not sure if it makes a difference.
Click the Mappings tab and things should auto-map just fine so click OK.
Finally
At this point, if we ran the package it would overwrite the template file every time. The secret is we need to tell that Excel Connection Manager we just made that it needs to not have a hard coded name.
Click once on the Excel Connection Manager in the Connection Managers tab. In the Properties window, find the Expressions section and click the ellipses ... Here we will configure the Property ExcelFilePath and the Expression we will use is
#[User::OutputFileName]
If your icons and such look different, that's to be expected. This was documented using SSIS 2012. Your work flow will be the same in 2005 and 2008/2008R2 just the skin is different.
If you run this package and it doesn't even start and there is an error about the ACE 12 or Jet 4.0 something not available, then you are on a 64bit machine and need to tell BIDS/SSDT that you want to run in 32 bit mode.
Ensure the Run64BitRuntime value is False. This project setting can be found by right clicking on the project, expand the Configuration Properties and it will be an option under Debugging.
Further reading
A different example of shredding a recordset object can be found on How to automate the execution of a stored procedure with an SSIS package?

Populate Derived Column with File's Date Modified

I'm a wannabe to .Net and SQL and am working on an SSIS package that is pulling data from flat files and inputting it into a SQL table. The part that I need assistance on is getting the Date Modified of the files and populating a derived column I created in that table with it. I have created the following variables: FileDate of type DateTime, FilePath of String, and SourceFolder of String for the path of the files. I was thinking that the DateModified could be populated in the derived column w/i the DataFlow, using a Script Component? Can someone please advise on if I'm on the right track? I appreciate any help. Thanks.
A Derived Column Transformation can only work with Integration Services Expressions. A script task would allow you to access the .net libraries and you would want to use the method that #wil kindly posted or go with the static methods in System.IO.File
However, I don't believe you would want to do this in a Data Flow Task. SSIS would have to evaluate that code for every row that flows through from the file. On a semi-related note, you cannot write to a variable until the ... event is fired to signal the data flow has completed (I think it's OnPostExecute but don't quote me) so you wouldn't be able to use said variable in a downstream derived column at any rate. You would of course, just modify the data pipeline to inject the file modified date at that point.
What would be preferable and perhaps your intent is to use a Script Task prior to the Data Flow task to assign the value to your FileDate variable. Inside your Data Flow, then use a Derived Column to add the #FileDate variable into the pipeline.
// This code is approximate. It should work but it's only been parsed by my brain
//
// Assumption:
// SourceFolder looks like a path x:\foo\bar
// FilePath looks like a file name blee.txt
// SourceFolder [\] FilePath is a file that the account running the package can access
//
// Assign the last mod date to FileDate variable based on file system datetime
// Original code, minor flaws
// Dts.Variables["FileDate"].Value = File.GetLastWriteTime(System.IO.Path.Combine(Dts.Variables["SourceFolder"].Value,Dts.Variables["FilePath"].Value));
Dts.Variables["FileDate"].Value = System.IO.File.GetLastWriteTime(System.IO.Path.Combine(Dts.Variables["SourceFolder"].Value.ToString(), Dts.Variables["FilePath"].Value.ToString()));
Edit
I believe something is amiss with either your code or your variables. Do your values approximately line up with mine for FilePath and SourceFolder? Variables are case sensitive but I don't believe that to be your issue given the error you report.
This is the full script task and you can see by the screenshot below, the design-time value for FileDate is 2011-10-05 09:06 The run-time value (locals) is 2011-09-23 09:26:59 which is the last mod date for the c:\tmp\witadmin.txt file
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
namespace ST_f74347eb0ac14a048e9ba69c1b1e7513.csproj
{
[System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{
enum ScriptResults
{
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
};
public void Main()
{
Dts.Variables["FileDate"].Value = System.IO.File.GetLastWriteTime(System.IO.Path.Combine(Dts.Variables["SourceFolder"].Value.ToString(), Dts.Variables["FilePath"].Value.ToString()));
Dts.TaskResult = (int)ScriptResults.Success;
}
}
}
C:\tmp>dir \tmp\witadmin.txt
Volume in drive C is Local Disk
Volume Serial Number is 3F21-8G22
Directory of C:\tmp
09/23/2011 09:26 AM 670,303 witadmin.txt