How to add a row in a CSV file in pentaho data integration - csv

I need to add a row data in a CSV file using Pentaho Data Integration.
I've tried with this transformation
This is my CSV file input configuration
and this is the CSV file output configuration (with the "append" check activated ...)
My constant definition
and this is my CSV file sample
I'd like to have this
Any suggestion will be appreciated!

You can use the Data grid step to create your constant data and the Append streams step to merge two streams into one in your desired order (data type in two streams must be matched and the same order) and then you can write the data to a CSV file. If you don't need a header present in the CSV file you can uncheck the "Header" option in the content tab

Related

Using a variable as the input for a Conditional Split control

Might be going about this completely the wrong way - happy to be shown the error of my ways.
In a nutshell, I've got 50-odd files of mixed types (csv and excel) that I want to import (each file to its own table) to an SQL database.
In the control flow I've got an sql task that returns:
The source data filename
The source data filetype (csv / xlsx)
What I want to name the table to import to.
This object gets passed to a Foreach loop that loops through this object and puts these 3 fields into variables.
I want to then say "if the filetype variable is csv, go and do a flat file import. If it's .xlsx, go and do an excel import"
So inside my for each container I've got a dataflow task.
I want the first thing the dataflow task does to check the filetype variable, and then do the appropriate import.
I think it's got to be in the dataflow, because there isn't an "If" style control I can see in the control flow?
But I'm at a loss as to how I pass a variable into the conditional split.
Any thoughts welcome.
OR! - just had a thought. Is the best way to do this to get a list of all the csv file types, process them in a dataflow, then get a list of all the .xlsx ones and process them - so I'd have:
Get csv filenames & tablenames
for each to loop through these
dataflow to import data from csv
get xlsx filenames and tablenames
for each through these
dataflow to import data from xlsx.
Just doesn't seem as elegant?
Cheers

how to save dataframe into csv pyspark

I am trying to save dataframe into hdfs system.
It gets saved as part-0000 and into multiple parts.
I want to save it as an excel sheet or just one part file?
How can we achieve this?
code used so far:
df1.write.csv('/user/gtree/tree.csv')
Your dataframe is being saved based on its partitions(multiple partitions= multiple files). You can coalesce or bring your partitions down to 1, so that only 1 file can be written.
Link:https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.coalesce
df1.coalesce(1).write.csv('/user/gtree/tree.csv')
You can use .repartition(1) to set the partitions to only 1
df.repartition(1).save(filePath)

Data Factory v2 - Generate a json file per row

I'm using Data Factory v2. I have a copy activity that has an Azure SQL dataset as input and a Azure Storage Blob as output. I want to write each row in my SQL dataset as a separate blob, but I don't see how I can do this.
I see a copyBehavior in the copy activity, but that only works from a file based source.
Another possible setting is the filePattern in my dataset:
Indicate the pattern of data stored in each JSON file. Allowed values
are: setOfObjects and arrayOfObjects.
setOfObjects - Each file contains single object, or line-delimited/concatenated multiple objects. When this option is chosen in an output dataset, copy activity produces a single JSON file with each object per line (line-delimited).
arrayOfObjects - Each file contains an array of objects.
The description talks about "each file" so initially I thought it would be possible, but now I've tested them it seems that setOfObjects creates a line separated file, where each row is written to a new line. The setOfObjects setting creates a file with a json array and adds each line as a new element of the array.
I'm wondering if I'm missing a configuration somewhere, or is it just not possible?
What I did for now is to load the rows in to a SQL table and run a foreach for each record in the table. The I use a Lookup activity to have an array to loop in a Foreach activity. The foreach activity writes each row to a blob store.
For Olga's documentDb question, it would look like this:
In the lookup, you get a list of the documentid's you want to copy:
You use that set in your foreach activity
Then you copy the files using a copy activity within the foreach activity. You query a single document in your source:
And you can use the id to dynamically name your file in the sink. (you'll have to define the param in your dataset too):

How to Get Data from CSV File and Send them to Excel Using Pentaho?

I have a tabular csv file that has seven columns and containing the following data:
ID,Gender,PatientPrefix,PatientFirstName,PatientLastName,PatientSuffix,PatientPrefName
2 ,M ,Mr ,Lawrence ,Harry , ,Larry
I am new to pentaho and I want to design a transformation that moves the data (values of the 7 columns) to an empty excel sheet. The excel sheet has different column names, but should carry the same data, as shown:
prefix_name,first_name,middle_name,last_name,maiden_name,suffix_name,Gender,ID
I tried to design a transformation using the following series of steps, but it gives me errors at the end that I could not interpret them.
What is the proper design to move the data from the csv file to the excel sheet in this case? Any ideas to solve this problem?
As #Brian.D.Myers mentioned in the comment you can use select values step. But here is how you do it step by step explanation.
Select all the fields from CSV file input step.
Configure the select values step as follows.
In the Content tab of Excel writer step click on Get fields button and fill the fields. Alternatively you can use Excel output step as well.

fetching data from multiple file and loading it into raw file destination(raw file should be dynamic) in ssis

I have a source folder which contains 4 csv files with different no of columns in each of the file. I need to fetch only 3 columns(metadata same this 3 columns in all the 4 files) from each csv and load the columns inside Raw Destination from all the files avaiable in source folder. And Raw destination Output file name has to be like wht the inputfilename we are fetching + time stamp.
And at next level, i need to fetch this output raw as raw source and insert this records into oledb destination . and the destination table also has to be in dynamic.
for example i have 4 csv files called, test1.csv(10 columns). test2.csv(8), test3.csv(6), test4.csv(10) along with time stamps.
all this 4 files has columns position_id, asofdate, sumassured in common, now i want to load only these 3 columns to raw destination. If i load test1.csv then my raw destination outputfile name has to be RW_test1_20120119_222222.RW. similalrly if i load second file its filename as raw destination output..
Thanks
Satish
As always, decompose your problems until you've got it into a something you can manage.
Processing CSVs via queries
Following the two questions and answers below will result in a package with an OLEDB Connection Manager configured to operate on CSVs in the folder #[User::InputFolder]. 3 variables CurrentFileName, InputFolder and Query have been defined with an expression set on Query.
The expression for your #[User::Query] would look like "SELECT position_id, asofdate, sumassured FROM " + #[User::CurrentFileName]
Reference answers
SSIS FlatFile Acces via Jet
SSIS Task for inconsistent column count import?
At this point, your package should resemble the center piece below. Verify you can correctly enumerate all of the CSVs in the folder and the OLEDB query piece works.
RAW files
I'm not an expert on RAW file usage so there may be better ways of interacting with them. This will use the fourth variable, RawFileName. Set an expression on it like #[User::InputFolder] + "RawFile.raw" which would result in the file being written to C:\ssisdata\so\satishkumar\RawFile.raw
My general approach is to have a dataflow with a script task that sends no rows into a RAW File Destination.
Configure your destination as
Access mode: File name from variable
Variable name: User::RawFileName
Write option: Create Always
Process CSVs
The concept here is to append all the data into the RAW file that was created in the initial step.
Your source should already be configured as
OLE DB connection manager: FlatFile
Data access mode: SQL command from variable
Variable name: User::Query
Configure your destination as
Access mode: File name from variable
Variable name: User::RawFileName
Write option: Append
Extract from RAW
At this point, the foreach enumerator has completed and all the data has been loaded into the staging file. Now it is time to consume that and send data on to the destination.
Drag a Raw File Source Transformation onto your data flow. Unsurprisingly, you will configure as
Access mode: File name from variable
Variable name: User::RawFileName
Instead of Simulate destination, wire it up to the correct data destination.
Caveat
Be careful when using an expression with GETDATE/GETUTCDATE to define filenames as they are constantly evaluated. In 2005, we had used FileName_HHMMSS and had issues because processing didn't complete in the same second between the creation of a file and the next task that consumed the file. Instead, I have had better success using a dynamic but fixed starting point and generally, that is the system variable, StartTime #[System::StartTime]
You can use ForEach Loop Container on the Control Flow Diagram to iterate txt and csv files.