Rapid Miner - Issue with Data set meta data information list in Read CSV operator - rapidminer

I am using Rapidminer version 6 for data analysis. I am trying to read a csv file with 6000 rows. when i configure the meta data information in the read csv operator, the data is extracted to show only the last entry (column) in the meta data information list. the process xml code is as below
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.1.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.1.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="6.1.000" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="C:\Users\jeganathan.velu\Desktop\Book1.csv"/>
<parameter key="column_separators" value=","/>
<list key="annotations"/>
<list key="data_set_meta_data_information">
<parameter key="1" value="interest_rate_bps.true.integer.regular"/>
<parameter key="1" value="Deposit.true.integer.regular"/>
<parameter key="2" value="Location.true.nominal.regular"/>
</list>
</operator>
<connect from_op="Read CSV" from_port="output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
but the tool outputs only the last column Location instead of all three columns configured in meta information list
If i configure meta data for 10 columns then only the tenth column data is read from the csv
requesting your help to find out if i am doing something wrong or is this a bug? A
Thanks in Advance,
Jeganathan Velu.

I see the problem in your process.
If you change the attribute type from 'regular' to 'attribute' then you'll find it works. I believe 'regular' was the way that normal attributes used to be referred to, but this has since changed (at least in the ReadCSV operator) to 'attribute'.

Related

Referencing Project Parameters in BIML

I've been using Catherine W's post on creating project parameters in BIML with some luck. What I'm having a problem with though is setting the expression of a local parameter equal to the project parameter. It's most likely just an XML formatting issue, but I haven't found any examples of it out on the web and have not figured it out on my own yet. So, any suggestions would be most helpful.
Here's the definition of my project parameters that is in my environment BIML file.
<Projects>
<PackageProject Name="ProjParams">
<Parameters>
<Parameter Name="AgentJobName" DataType="String"></Parameter>
<Parameter Name="LoadType" DataType="String">Full</Parameter>
</Parameters>
</PackageProject>
</Projects>
Then under Packages \ Package I have the Variables. I am defining a user variable named LoadType and setting it to the package variable of LoadType in an expression. (There's something in the package that wouldn't use package parameters so I had to create a user variable) I know the reference to #[$Package::LoadType] is incorrect, but that's what I'm trying to figure out. What should it be to get BIML to put in a package parameter?
<Variables>
<Variable EvaluateAsExpression="true" DataType="String" IncludeInDebugDump="Exclude" Name="LoadType">#[$Package::LoadType]</Variable>
Thanks everyone!
It's working for me
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Projects>
<PackageProject Name="so">
<Parameters>
<Parameter DataType="String" Name="ProjectParameter" >Demo0</Parameter>
</Parameters>
<Packages>
<Package PackageName="so_43721322" />
</Packages>
</PackageProject>
</Projects>
<Packages>
<Package Name="so_43721322">
<Parameters>
<Parameter DataType="String" Name="PackageParameter">Demo1</Parameter>
</Parameters>
<Variables>
<Variable Name="PackageParameter" DataType="String" EvaluateAsExpression="true">#[$Package::PackageParameter]</Variable>
<Variable Name="ProjectParameter" DataType="String" EvaluateAsExpression="true">#[$Project::ProjectParameter]</Variable>
</Variables>
</Package>
</Packages>
</Biml>
I create a project and a package level parameter and then create two variables within my package, each referencing the parameter (#[$Project::ProjectParameter] and #[$Package::PackageParameter])
Am I missing some nuance?

How to work with folders using Msbuild?

I am using msbuild script to deploy ssrs reports. Previously all reports used to be in one folder and I have written a msbuild script to deploy these reports to report server. Now we are maintaining reports at folder level such as customer service, inventory and invoice folders.
How to deploy these individual folder to report server? In report server also we need folder level hierarchy.
Here is a recursive file copy example.
Save the below in a file called "FileCopyRecursive.msbuild" (or FileCopyRecursive.proj)
<?xml version="1.0" encoding="utf-8"?>
<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003" DefaultTargets="AllTargetsWrapped">
<PropertyGroup>
<!-- Always declare some kind of "base directory" and then work off of that in the majority of cases -->
<WorkingCheckout>.</WorkingCheckout>
<WindowsSystem32Directory>c:\windows\System32</WindowsSystem32Directory>
<ArtifactDestinationFolder>$(WorkingCheckout)\ZZZArtifacts</ArtifactDestinationFolder>
</PropertyGroup>
<Target Name="AllTargetsWrapped">
<CallTarget Targets="CleanArtifactFolder" />
<CallTarget Targets="CopyFilesToArtifactFolder" />
</Target>
<Target Name="CleanArtifactFolder">
<RemoveDir Directories="$(ArtifactDestinationFolder)" Condition="Exists($(ArtifactDestinationFolder))"/>
<MakeDir Directories="$(ArtifactDestinationFolder)" Condition="!Exists($(ArtifactDestinationFolder))"/>
<RemoveDir Directories="$(ZipArtifactDestinationFolder)" Condition="Exists($(ZipArtifactDestinationFolder))"/>
<MakeDir Directories="$(ZipArtifactDestinationFolder)" Condition="!Exists($(ZipArtifactDestinationFolder))"/>
<Message Text="Cleaning done" />
</Target>
<Target Name="CopyFilesToArtifactFolder">
<ItemGroup>
<MyExcludeFiles Include="$(WindowsSystem32Directory)\**\*.doesnotexist" />
</ItemGroup>
<ItemGroup>
<MyIncludeFiles Include="$(WindowsSystem32Directory)\**\*.ini" Exclude="#(MyExcludeFiles)"/>
</ItemGroup>
<Copy
SourceFiles="#(MyIncludeFiles)"
DestinationFiles="#(MyIncludeFiles->'$(ArtifactDestinationFolder)\%(RecursiveDir)%(Filename)%(Extension)')"
/>
</Target>
</Project>
Then run this:
"C:\Windows\Microsoft.NET\Framework\v4.0.30319\MSBuild.exe" /target:AllTargetsWrapped FileCopyRecursive.msbuild /l:FileLogger,Microsoft.Build.Engine;logfile=AllTargetsWrapped.log
BONUS!
Here is a "fun with files" msbuild file.
<?xml version="1.0" encoding="utf-8"?>
<Project DefaultTargets="AllTargetsWrapper" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<Target Name="AllTargetsWrapper">
<CallTarget Targets="FunWithFilesTask" />
</Target>
<PropertyGroup>
<WorkingCheckout>c:\windows\System32</WorkingCheckout>
</PropertyGroup>
<!-- ===================================================== -->
<!--
See:
http://msdn.microsoft.com/en-us/library/ms164313.aspx
*Identity Value for the item specified in the Include attribute.
*Filename Filename for this item, not including the extension.
*Extension File extension for this item.
*FullPath Full path of this item including the filename.
*RelativeDir Path to this item relative to the current working directory.
*RootDir Root directory to which this item belongs.
RecursiveDir Used for items that were created using wildcards. This would be the directory that replaces the wildcard(s) statements that determine the directory.
*Directory The directory of this item.
AccessedTime Last time this item was accessed.
CreatedTime Time the item was created.
ModifiedTime Time this item was modified.
-->
<Target Name="FunWithFilesTask">
<ItemGroup>
<MyExcludeFiles Include="$(WorkingCheckout)\**\*.doesnotexist" />
</ItemGroup>
<ItemGroup>
<MyIncludeFiles Include="$(WorkingCheckout)\**\*.ini" Exclude="#(MyExcludeFiles)" />
</ItemGroup>
<PropertyGroup>
<MySuperLongString>#(MyIncludeFiles->'"%(fullpath)"')</MySuperLongString>
</PropertyGroup>
<Message Text="MySuperLongString=$(MySuperLongString)"/>
<Message Text=" "/>
<Message Text=" "/>
<Message Text="The below items are good when you need to feed command line tools, like the console NUnit exe. Quotes around the filenames help with paths that have spaces in them. "/>
<Message Text="I found this method initially from : http://pscross.com/Blog/post/2009/02/22/MSBuild-reminders.aspx Thanks Pscross! "/>
<Message Text=" "/>
<Message Text=" "/>
<Message Text="Flat list, each file surrounded by quotes, with semi colon delimiter: "/>
<Message Text=" #(MyIncludeFiles->'"%(fullpath)"')"/>
<Message Text=" "/>
<Message Text=" "/>
<Message Text="Flat list, each file surrounded by quotes, no comma (space delimiter): "/>
<Message Text=" #(MyIncludeFiles->'"%(fullpath)"' , ' ')"/>
<Message Text=" "/>
<Message Text=" "/>
<Message Text="Flat list, each file surrounded by quotes, with comma delimiter: "/>
<Message Text=" #(MyIncludeFiles->'"%(fullpath)"' , ',')"/>
<Message Text=" "/>
<Message Text=" "/>
<Message Text="List of files using special characters (carriage return)"/>
<Message Text="#(MyIncludeFiles->'"%(fullpath)"' , '%0D%0A')"/>
<Message Text=" "/>
<Message Text=" "/>
</Target>
</Project>
Were you using the msbuild Copy task to do your single folder copy? If so, it shouldn't be a big deal to modify that same task just a bit to copy the entire folder structure. Something similar to this example:
<Copy SourceFiles="c:\src\**\*.txt" DestinationFolder="c:\dest\%(RecursiveDir)"></Copy>
%(RecursiveDir) is a type of well-known item metadata that will contain the value of the wildcard from the source files parameter. Here's a bit more description of MSBuild well-known item metadata:
MSBuild well known item metadata
Complete Example:
<?xml version="1.0" encoding="utf-8"?>
<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<ItemGroup>
<Sources Include="c:\src\**\*.txt" />
</ItemGroup>
<Target Name="copy-folder">
<Copy SourceFiles="#(Sources)" DestinationFolder="c:\dest\%(RecursiveDir)"></Copy>
</Target>
</Project>

BIML Flat File Format with VARCHAR(MAX) Column

I have so far successfully used BIML to auto generate SSIS package (from CSV to SQL Server). But I got into problems where ever I have Varchar(MAX) columns in the Flat File Format.
The problem is If I define a column of type AnsiString with size -1 in the Flat file format, the output SSIS package shows the below warning
The metadata of the following output columns does not match the
metadata of the external columns with which the output columns are
associated.
If I click Yes, the problem is fixed by itself, but that would be my last option as I have 150 packages.
When I checked the Advanced options of Flat File Source Component I can see a difference in data type for the column Comments, External Columns show as DT_TEXT where as the Output Columns show DT_STR. :(
What I don't understand is why the Output columns showing a different data type only for Varchar(Max) when all others are working fine. Aren't the output columns generated from External columns?
Please see the biml code below.
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<FileFormats>
<FlatFileFormat Name="MetadataFileFormat" RowDelimiter="LF" ColumnNamesInFirstDataRow="true" IsUnicode="false">
<Columns>
<Column Name="Category" DataType="AnsiString" Length="128" Delimiter="|" CodePage="1252" />
<Column Name="Comments" DataType="AnsiString" Length="-1" Delimiter="|" />
<Column Name="DisplayName" DataType="AnsiString" Length="256" Delimiter="CRLF" />
</Columns>
</FlatFileFormat>
</FileFormats>
<Connections>
<FlatFileConnection Name="FF_Test" FilePath="C:\Data\Sample.csv" FileFormat="MetadataFileFormat">
</FlatFileConnection>
</Connections>
<Packages>
<Package Name="FFTest" ConstraintMode="Linear">
<Tasks>
<Dataflow Name="DFT Load Data">
<Transformations>
<FlatFileSource Name="FF_SRC" ConnectionName="FF_Test">
</FlatFileSource>
</Transformations>
</Dataflow>
</Tasks>
</Package>
</Packages>
</Biml>
Within a dataflow a DT_STR is bounded between lengths of 0 to 8000. The Flat File Connection Manager is happy to let you specify a length greater than 8k.
However, when you try to use that in a data flow, the component is going to report that it's not a valid length
And it makes sense if you know the concepts of how SSIS gets the performance out of data flow. It preallocates memory and does all the transformations in that memory space. How much memory would you allocate for a MAX type? Exactly...
So, you're going to need to use one of the stream data types: DT_TEXT or DT_NTEXT. Those allow for unlimited length strings.
Biml
I'm actually stumped on this, hopefully Scott can chime in. The emitted DTSX will look as the before screenshot with a data type of DT_STR and length of zero. It runs fine, just looks bad. When you double click to let the editor fix it, it changes to DT_TEXT as it should.
I thought it was just going to be a matter of providing a data type override as we can in an Execute SQL Task, but to no avail, it's not a property on the Columns collection in the flat file source.
Perhaps this was a situation where I needed to mess with the Dataflow overrides property...
<DataflowOverrides>
<OutputPath OutputPathName="Output">
<Columns>
<Column
ColumnName="Comments"
DataType="AnsiString"
CodePage="1252"
Length="-1"
></Column>
</Columns>
</OutputPath>
</DataflowOverrides>
But no, that gave me no better result.
Fine, I gave up and "cheated" by using Mist/BimlOnline to reverse engineer the corrected package back into Biml.
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Connections>
<FlatFileConnection Name="FF_Test" FilePath="C:\ssisdata\SO\Input\so_35438946.txt" FileFormat="FF_Test" />
</Connections>
<Packages>
<Package Name="so_35438946_re" Language="None" VersionBuild="1" CreatorName="BillFellows" CreatorComputerName="AVATHAR" CreationDate="2016-02-16T13:02:49">
<Tasks>
<Dataflow Name="DFT Load Data">
<Transformations>
<DerivedColumns Name="DER Placeholder">
<InputPath OutputPathName="FF_SRC.Output" />
</DerivedColumns>
<FlatFileSource Name="FF_SRC" LocaleId="None" FileNameColumnName="" ConnectionName="FF_Test" />
</Transformations>
</Dataflow>
</Tasks>
<Connections>
<Connection ConnectionName="FF_Test" />
</Connections>
</Package>
</Packages>
<FileFormats>
<FlatFileFormat Name="FF_Test" CodePage="1252" TextQualifer="_x003C_none_x003E_" ColumnNamesInFirstDataRow="true" RowDelimiter="LF">
<Columns>
<Column Name="Category" Length="128" DataType="AnsiString" Delimiter="VerticalBar" MaximumWidth="128" />
<Column Name="Comments" Length="-1" DataType="AnsiString" Delimiter="VerticalBar" />
<Column Name="DisplayName" Length="256" DataType="AnsiString" Delimiter="CRLF" MaximumWidth="256" />
</Columns>
</FlatFileFormat>
</FileFormats>
</Biml>
And now I simply Generate SSIS package and... Well, I suppose it's progress. Comments is identified as DT_TEXT but I still get the warning.
Deep dive into the dtsx
In the data flow's flat file source, the external metadata collection for this column is defined as follows
<externalMetadataColumn
codePage="1252"
dataType="str"
name="Comments"
refId="Package\DFT Load Data\FF_SRC.Outputs[Output].ExternalColumns[Comments]"></externalMetadataColumn>
In the on we let the editor adjust
<externalMetadataColumn
refId="Package\DFT Load Data\FF_SRC.Outputs[Output].ExternalColumns[Comments]"
codePage="1252"
dataType="text"
name="Comments" />
and the one emitted from VS 2013 using the original code, we get
<externalMetadataColumn
codePage="1252"
dataType="str"
name="Comments"
refId="Package\DFT Load Data\FF_SRC.Outputs[Output].ExternalColumns[Comments]">
</externalMetadataColumn>
It might be distasteful but perhaps a bit of XSLT could find any of the instances where you have this named column and data type of str and transform it to text
I didn't try, but found it on Varigance documentation:
<!-- A Length of -1 will automatically be converted to nvarchar(max)/varchar(max) -->
<Column Name="LongString" DataType="String" Length="-1" />

How to manage giant fixed-width file in SSIS?

I have a fixed width file that is about 1200 characters wide and has about 300+ columns. I'm looking for a way to create a fixed-width data source in SSIS without using the UI for the flat file connection manager. Is there a way to modify the column definitions without having to use the UI in SSIS? I can't find a file for the data connection anywhere in the project.
Am I doomed to manually add 300+ columns into the flat-file connection manager one by one?
Two options come to mind. The first is to Install BIDSHelper and use the Create Fixed Width Columns
The other, as #ElectricLlama mentioned is to use BIML. This too will require the installation of BIDS Helper but to convert a .biml file into a .dtsx Short Walkthrough
This should approximate creating a package with a flat file connection manager (with a single column) adding a data flow and inside that consume our flat file and wire it up to a Row count. This is approximate for what you want. Just fill in the XML in the Columns tag.
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Connections>
<FlatFileConnection
Name="FF dchess"
FileFormat="FFF dchess"
FilePath="C:\ssisdata\SO\Input\dchess.txt"
/>
</Connections>
<FileFormats>
<FlatFileFormat
Name="FFF dchess"
CodePage="1252"
RowDelimiter="CRLF"
IsUnicode="false"
FlatFileType="RaggedRight"
>
<Columns>
<Column Name="MyColumn" Length="08" DataType="AnsiString" ColumnType="FixedWidth" CodePage="1252" />
</Columns>
</FlatFileFormat>
</FileFormats>
<Packages>
<Package Name="dchess" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
<Connections >
<Connection ConnectionName="FF dchess" />
</Connections>
<Variables>
<Variable Name="CurrentFileName" DataType="String">C:\ssisdata\so\Input\dchess.txt</Variable>
<Variable Name="RowCountInsert" DataType="Int32">0</Variable>
</Variables>
<Tasks>
<Dataflow Name="DFT Load file" >
<Transformations>
<FlatFileSource
Name="FF_SRC dchess"
ConnectionName="FF dchess"
RetainNulls="true">
</FlatFileSource>
<RowCount Name="CNT Source" VariableName="User.RowCountInsert"></RowCount>
</Transformations>
</Dataflow>
</Tasks>
</Package>
</Packages>
</Biml>
Generated package looks like
Feel free to pick your jaw up off the ground ;)

Is it possible to import CSVs into a RapidMiner repository from the command-line?

I'm considering using RapidMiner to store and analyse a collection of data gathered by a scripted process. Is there a way to import a CSV file into a RapidMiner repository from a command-line script?
Not directly. But you can create a process with the 'Read CSV' operator which is connected to a 'Store' operator and store this process in the repository. This process can be called from the command-line. If the file and the repository location are static and do not change, this is everything you need to do.
But to specify the input file and the repository location dynamically you need macros. These macros can be set in the command-line, but unfortunately are only available in RapidMiner version 5.3 which is currently not released (but will be in a few weeks). In the meantime you can use the up-to-date version from the sourceforge SVN repository (Unuk branch).
Process to store CSV in the repository:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
<process expanded="true" height="190" width="413">
<operator activated="true" class="read_csv" compatibility="5.3.000" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="%{csv-file}"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="store" compatibility="5.3.000" expanded="true" height="60" name="Store" width="90" x="179" y="30">
<parameter key="repository_entry" value="%{repository-location}"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Store" to_port="input"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
Assuming that you have saved this process in //home/steve/csv-to-repository and your current directory is the RapidMiner directory, this is how you can call this from command-line:
./script/rapidminer //home/steve/csv-to-repository "-Mcsv-file=/path/to/your/csv/file" "-Mrepository-location=//repository/path/to/store/csv"