reading csv columns dynamically in Pentaho Kettle

reading csv columns dynamically in Pentaho Kettle - csv

If I have a Table Input step with a query such as Select * from myTable
and it goes to a User Defined Java Class step, the following code allows me to grab the column names dynamically from the table.
RowMetaInterface rowMetaInterface = getInputRowMeta();
List myList = rowMetaInterface.getValueMetaList();
String colName;
for(int i=0;i<myList.size();i++){
colName = ((ValueMetaInterface)myList.get(i)).getName();
}
However, this code doesn't work if the first step is a CSV input step. I have a variable for the CSV filename, so I can't do a 'Get Fields' to pull the columns. Is there a way I can read the csv column names dynamically?

Not a solution, but some interesting hints:
http://diethardsteiner.github.io/pdi/2015/10/31/Transformation-Executor-Record-Groups.html

Related

Importing multiple 1D JSON arrays in Excel

I'm trying to import a JSON file containing multiple unrelated 1D arrays with variable amount of elements into Excel.
The JSON I wrote is :
{
"table":[1,2,3],
"table2":["A","B","C"],
"table3":["a","b","c"]
}
When I import the file using Power Query and expand the columns, it multiplies the previous entries each time I expand a new column.
enter image description here
I there a way to solve this, shows the elements of each array below each other and each array as a new column?

One method would be to transform each Record into a List and then create a table using Table.FromColumns method.
This needs to be done from the Advanced Editor:
Read the code comments and explore the Applied Steps to better understand.
Also HELP topics for the various functions will be useful
let
//Change following line to reflect your actual data source
Source = Json.Document(File.Contents("C:\Users\ron\Desktop\New Text Document.txt")),
//Get Field Names (= table names)
fieldNames = Record.FieldNames(Source),
//Create a list of lists whereby each sublist is derived from the original record
jsonLists = List.Accumulate(fieldNames,{}, (state, current)=> state & {Record.Field(Source,current)}),
//Convert the lists into columns of a new table
myTable = Table.FromColumns(
jsonLists,
fieldNames
)
in
myTable
Results

Write into Excel Destination from SSIS variables

I have 3 SSIS variables namely name, age, gender with initial values set. I want to write these values into excel sheet in one row. Later I will extend this to Array of records.
To do this I have created Excel connection attaching the excel sheet where I want to write.
I added control flow task and double clicked and then added Derived column component to create derived columns for each of above 3 variables . Inside derived column editor I selectd above variables as new derived columns.
And then pipelined excel destination component and mapped sheet columns to derived columns. I executed the SSIS package and its successful. But variables are not written into excel sheet.
What I am doing wrong ?

Again, you need a source. I gave you an "easy" solution. This is probably the best solution to your problem:
This time the source will be a script component (select Source).
Steps after you add Script Component:
Select Source
Go to Inputs and Outputs
Add your Output Columns (Don't forget about data types)
Go back to Script
Add you variables (Gender, Name and Age)
Go into Script
Add the following code
public override void CreateNewOutputRows()
{
Output0Buffer.AddRow();
Output0Buffer.Age = Variables.Age;
Output0Buffer.Gender = Variables.Gender;
Output0Buffer.Name = Variables.Name;
}

You need a source. the easiest would be to use a SQL connection.
Use a variable of type string named SQL.
Set SQL = "Select '" + name+ "' as name,"+ age + "as age,'" + gender + "' as Gender
Set your source to SQL variable.
Connect this Source to Destination and you should have 1 row with 3 columns

Listing the steps clearly as suggested by #KeithL
Create a SSIS variable selectQueryVariables with string datatype.
Assign variable expression as
"SELECT '"+#[User::name]+"' as Name,'"+#[User::gender]+"' as Gender,"+(DT_WSTR,4 )#[User::age]+" as Age"
Add OLE DB Source component and set data access mode as SQL command from variable and select the variable selectQueryVariables in dropdown. Now the source is ready with 3 columns Name, Age and Gender.
Pipeline this with Excel Destination and map columns source and destination.

SSIS - Process a flat file with varying data

I have to process a flat file whose syntax is as follows, one record per line.
<header>|<datagroup_1>|...|<datagroup_n>|[CR][LF]
The header has a fixed-length field format that never changes (ID, timestamp etc). However, there are different types of data groups and, even though fixed-length, the number of their fields vary depending on the data group type. The three first numbers of a data group define its type. The number of data groups in each record varies also.
My idea is to have a staging table with to which I would insert all the data groups. So two records like this,
12320160101|12323456KKSD3467|456SSGFED43520160101173802|
98720160102|456GGLWSD45960160108854802|
Would produce three records in the staging table.
ID Timestamp Data
123 01/01/2016 12323456KKSD3467
123 01/01/2016 456SSGFED43520160101173802
987 02/01/2016 456GGLWSD45960160108854802
This would allow me to preprocess the staged records for further processing (some would be discarded, some have their data broken down further). My question is how to break down the flat file into the staging table. I can split the entire record with pipe (|) and then use a Derived Column Transformation to break down the header with SUBSTRING. After that it gets trickier because of the varying number of data groups.

The solution I came up with myself doesn't try to split at the flat file source, but rather in a script. My Data Flow looks like this.
So the Flat File Source output is just a single column containing the entire line. The Script Component contains output columns for each column in the Staging table. The script looks like this.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var splits = Row.Line.Split('|');
for (int i = 1; i < splits.Length; i++)
{
Output0Buffer.AddRow();
Output0Buffer.ID = splits[0].Substring(0, 11);
Output0Buffer.Time = DateTime.ParseExact(splits[0].Substring(14, 14), "yyyyMMddHHmmssFFF", CultureInfo.InvariantCulture);
Output0Buffer.Datagroup = splits[i];
}
}
Note that in the SynchronousInputID property (Script Transformation Editor > Input and Outputs > Output0) must be set to None. Otherwise you won't have Output0Buffer available in your script. Finally the OLE DB Destination just maps the script output columns to the Staging table columns. This solves the problem I had with creating multiple output Records from a single input record.

Can I use SpreadsheetGear to read from a CSV file without it formatting the cells?

Given a simple CSV file that consists of a string of digit characters and a date in UK format:
"00000000","01/01/2014"
and code to get the used cells:
IWorkbookSet workbookSet = SpreadsheetGear.Factory.GetWorkbookSet();
IWorkbook workbook = workbookSet.Workbooks.Open(#"C:\file.csv");
IRange cells = workbook.Worksheets[0].UsedRange;
when I access cells[0,0].Text it gives it as 0, because it's treating it as numeric and therefore the leading 0s are meaningless. It will do the same for the date. I'm trying to manually construct a DataTable from the cells, but I need the original values in the file.
I tried:
SpreadsheetGear.Advanced.Cells.IValues cells = (SpreadsheetGear.Advanced.Cells.IValues)workbook.Worksheets[0];
var sb = new StringBuilder();
cells[0,0].GetText(sb);
but nothing is appended to the string builder.
How can I get access to the original file values?

SpreadsheetGear does not make available the original values as found in in the CSV file (such as "00000000" in your case). You would only be able to access cell data after it has been parsed and processed by SpreadsheetGear (i.e., converting the above to a double value of 0). If you need the CSV's original values, then you'll need to open up file yourself and manually process and parse it.
It sounds like you ultimately want a DataTable, but if you still require to create a workbook file from your CSV data, once you've created a routine to manually open and parse each "cell" in your CSV file, you could enter each value into a spreadsheet as Text, so that it is preserved as it is found in the CSV file. You can go about this in two ways:
1) Set IRange.NumberFormat to "#", which will treat any future input into that IRange as Text. Example:
worksheet.Cells["A1"].NumberFormat = "#";
worksheet.Cells["A1"].Value = "00000000";
2) Prepend your inputted value with a single apostrophe, which indicates that you want the input to be treated as text. Example:
worksheet.Cells["A1"].Value = "'00000000";
If you still need a DataTable at this point, you could use the IRange.GetDataTable(...) method to accomplish this. Because the cell data is stored as Text, your DataTable values should also reflect these same values Example:
DataTable dt = worksheet.Cells["A1"].GetDataTable(GetDataFlags.None);
(There is a GetDataFlags.FormattedText option, but this isn't really relevant for your case since the cell data is stored as text anyway and so won't be formatted)

Beanshell script to use data from CSV

I have dynamically created parameters using regular expression extractor and beanshell script (given below). I am creating parameters with Name = "pass_" + i.
Now I need to populate the value of these parameter field from a CSV file. I have loaded a CSV file and the login variable contains the value of the first row. The below code populates only the first value in the CSV file. I need the code to iterate through the CSV file and populate the parameter fields with next values present in the first column.
int count = Integer.parseInt(vars.get("pass_matchNr"));
for(int i=1;i<=count;i++) { //regex counts are 1 based
sampler.addArgument(vars.get("pass_" + i),vars.get("login"));}

Try using a CSV data config object. You point to the path of your CSV and can then reference each CSV column in a Jmeter variable with ease. With each iteration, your Jmeter variable will hold the value of the next row in your CSV. From here you can use vars.get("yourVar"); to feed this Jmeter variable into your BeanShell script.
Alternatively, if you need the population from CSV to be done in one pass, an option could be to use the CSV data config object and set up your first column and row to be a concatenation of all the values found in the CSV for example 'ValueA,ValueB,ValueC'. You can then feed this variable into your Jmeter script and parse it in BeanShell by doing a split on (','). That'll leave you with all the values found in your CSV.
If these 2 options are unsuitable, a final option would be to create your own Java custom method which you can then feed into your BeanShell script. For example, you could create a class which reads your CSV file and returns a string in the format you desire. For a detailed step by step guide on setting up custom functions in Jmeter, refer to this article.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

reading csv columns dynamically in Pentaho Kettle - csv

Not a solution, but some interesting hints: http://diethardsteiner.github.io/pdi/2015/10/31/Transformation-Executor-Record-Groups.html

Related

Importing multiple 1D JSON arrays in Excel

Write into Excel Destination from SSIS variables

SSIS - Process a flat file with varying data

Can I use SpreadsheetGear to read from a CSV file without it formatting the cells?

Beanshell script to use data from CSV

Categories

Resources