Pentaho Data Integration - Multiple CSV File Inputs - csv

I've been using Pentaho Data Integration lately and currently I intend to use it to a project I'm in. The assist I'm looking for is the following:
There can be variable CSV file inputs in a folder
Is there a way to get all .csv files (the operator/ series of operators) using Pentaho?
After this step I believe what I have to do is pretty simple, as I only have to merge those files together.
Thanks

Use the Text File Input. It allows for folders using a regular expression and can handle csv files

Add the "Get File Names" step before the "CSV file input" step. When the CSV step has input, then a field appears in the configuration dialog allowing you to get the filename from the incoming stream.

Related

How to Read CSV file using Power Automate?

I have added CSV file to SharePoint Documents library.
I needs to read that CSV file using Power Automate / Flow.
I have created Power Automate flow. Below is the screenshot fro the same.
Which CSV parser do i need to use for read data from file content action?
Can anyone help me for the same?
Thanks
If you want to retrieve the content of the CSV without a premium connector you could use an expression to convert the $content property of the Get File Content action into a string value. You can use the base64tostring function for this.
Below is an example
base64tostring(outputs('Get_file_content')?['body']['$content'])

ADF Merge-Copying JSON files in Copy Data Activity creates error for Mapping Data Flow

I am trying to do some optimization in ADF. Setup is a third-party tool copies one JSON file per object to a BLOB storage container. These feed to a Mapping Data Flow. The individual files written by the third party tool work great. If I copy these files to a different BLOB folder using an Azure Copy Data activity, the MDF can no longer parse the files and gives an error: "JSON parsing error, unsupported encoding or multiline." I started this with a Merge Files, but outcome is same regardless of copy behavior I choose.
2ND EDIT: After another day's work, I have found that the Copy Activity Merge File from JSON to JSON definitely adds an EOL character to each single JSON object as it gets imported to the Merge file. I have also found that the MDF fails definitely with those EOL characters in the Merge file. If I remove all EOL characters from the Merge file, the same MDF will work. For me, this is a bug. The copy activity is adding a character that breaks the MDF. There seems to be a second issue in some of my data that doesn't fail as an individual file but does when concatenated that breaks the MDF when I try to pull all the files together, but I have tested the basic behavior on 1-5000 files and been able to repeat the fail/success tests.
I took the original file, and the copied file, ran them through all of sorts of test, what I eventually found when I dump into Notepad++:
Copied file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\r\n
Original file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\n
If I change the copied file from ending with \r\n to \n, the MDF can read the file again. What is going on here? And how do I change the file write behavior or the MDF settings so that I can concatenate or copy files without the CRLF?
EDIT: NEW INFORMATION -- It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Other possibly relevant information:
Both the source and sink JSON datasets are set to use UTF-8 (not default(UTF-8), although I tried that). Would a different encoding fix this?
I have tried remapping schemas, creating new data sets, creating new Mapping Data Flows, still get the same error.
EDITED for clarity based on comments:
In the case of a single JSON element in a file, I can get this to work -- data preview returns same success or failure as pipeline when run
In the case of multiple documents merged by ADF I get the below instead. It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Repro: Create any valid JSON as a single file, put it in blob storage, use it as a source in a mapping data flow, to do any sink operation. Create a second file with same schema, get them both to run in same flow using wildcard paths. Use a Copy Activity with Merge Files as the Sink Copy Activity and Array of Objects as the File pattern. Try to make your MDF use this new file. If it fails, download the file created by ADF, run it through a formatter (I have used both VS Code -> "Format Document" from standard VS Code JSON extension, and VS 2019 "Unminify" command) and reupload... It should work now.
don't know if you already solved the problem: I came across the exact same problem 3 days ago and after several tries I found a solution:
in the copy data activity under sink settings, use "set of objects" (instead of "array of objects") under File Pattern, so that the merged big JSON has the value of the original small JSON files written per line
in the MDF after setting up the wildcard paths with the *.json pattern, under JSON Settings select: Document per line as the Document form.
After that you should be good to go, as least it solved my problem. The automatic written CRLF in "array of objects" setting in the copy data activity should be a default setting and MSFT should provide the option to omit it in the settings in the future.
According to my test:
1.copy data activity can't change unix(LF) to windows(CRLF).
2.MDF can also parse unix(LF) file and windows(CRLF) file.
Maybe there is something else wrong.
By the way,I see there is a comma after "name":"Customer Name" in your Original file,I delete it before my test.

JMeter read the second sheet of CSV

How can I make JMeter read the second sheet of my CSV?
I want to use CSV Data Set Config.
Normally, it reads the first line of the first sheet but is there any way to be a bit more flexible?
CSV file format doesn't have "sheets", it is a normal plain text file using delimiters in order to represent structured data.
If you are trying to get data from i.e. Microsoft Excel file type - unfortunately you won't be able to do it using CSV Data Set Config. The easiest would be exporting data as separate plain-text CSV files.
If you don't have the possibility to do the export you still can access the data from Excel files but it will be a little bit more tricky as you will have to use JSR223 Test Elements, Groovy language and Apache POI libraries
More information:
Busy Developers' Guide to HSSF and XSSF Features
How to Extract Data From Files With JMeter
Currently you can use CSV Data Set Config for that, you should add external code for example using Apache Commons CSV,
Download the jar file and place it in JMETER_HOME lib folder, and then write the code in JSR223 Element.
Examples exists, code for get second record:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.RFC4180.parse(in);
// go to next record
records.next();
CSVRecord secondRecord = records.next();
//columnOne = secondRecord.get(0);

SSIS ForEachLoop editor for Excel files with any extension

Task: I'm trying to iterate through excel files using foreachloop editor container.
I was successful until i had different extensions meaning it's works as long as file extension is xls or xlsx but not both together.
Problem: I get errors when i try to iterate files with extensions xls and xlsx. Cannot acquire connection to connectionmanager.
For instance: I have abc.xls and agh.xlsx in a folder and i have trouble iterating thru files using Foreachloop editor.I think i understand & know why it's happening but can i write a script to do it or how to complete this task successfully.
Any ideas..
You will need to add 2 For Each Loop containers to iterate through files. the 1st FLC will process only .xls (or .xlsx) and the second FLC would process only .xlsx (or .xls). Other than that, I dont think writing a script would be of any help. But I could be wrong.
Presuming all xls file have the same format and all xlsx files have the same format...
What you also could do is using one FOREACH loop to loop through all Excel files... then add a dummy task (empty Script Task or Sequence Container) and connect it to two Data Flow Tasks. One for XLS and one for XLSX. Then add expressions on the lines between the dummy tasks and data flow tasks where you check the extensions. Something like:
LOWER(RIGHT(#[User::Filepath],4))==".xls"
LOWER(RIGHT(#[User::Filepath],4))=="xlsx"

Creating a CSV file with the Report Generation Toolkit in Labview

I want to create .csv files with the Report Generation Toolkit in Labview.
They must actually be .csv files which can be opened with Notepad or something similar.
Creating a .csv is not that hard, it's just a matter of adding the extension to the file name that's going to be created.
If I create a .csv file this way it opens nicely in excel just the way it should, but if I open it in Notepad it shows all kind of characters and it doesn't even come close to the data I wrote to the file.
I create the files with the Labview code below:
Link to image (can't post image yet because I've got to few points)
I know .csv files can be created with the Write to Spreadsheet VI but I would like to use the Report Generation Toolkit because it's pretty easy to add columns and rows to the file and that is something I really need.
you can use the Robust CSV package on the lavag.org forum to read and write 2D arrays to CSV files.
http://lavag.org/files/file/239-robust-csv/
Calling a file "csv" does not make it a CSV file. I never used the toolkit to generate an Excel file, but I'm assuming it creates an XLS or XLSX file, regardless of what extension you give it, which is why you're seeing gibberish (probably XLS, since it's been around for a while and I believe XLSX is XML, not binary).
I'm not sure what your problem is with the write spreadsheet VI. It has an append input, so I assume you can use that to at least add rows directly to a file, although I can't say I ever tried it. I would prefer handling all the data in memory explicitly, where you can easily use the array functions to add rows or columns to the array and then overwrite the entire file.