Testing transforms in Foundry: How to quickly select some rows from the preview of an existing dataframe as example data? - palantir-foundry

Having developed a pipeline transform, I might want to write a test that applies the new transform to a given small dataset and compares the result with an expected result.
How do I comfortable select a small portion of an existing dataset, preferably including the schema and some rows of interest? Usually I am looking at a given dataset using a preview, like in this example. Is it possible to select some rows and reuse these rows directly either
within the source code to create a pyspark dataset within my test file like
spark_session.createDataFrame([
(0, 1, 2)
], ['col_a', 'col_b', 'col_c'])
or a csv-file which I can upload to my test directory and load from within the test
...?
Basically, what is the most comfortable way to obtain data for the tasks described here from existing data sets, which might be too large to be exported as csv directly?

Related

SSIS - Excel Pivoted file with 2 level of dynamic columns

I have a file format from my customer to load.
File is pivoted format and 2 level of column. (PointName & Size ). Also columns may vary time to time as more points may get added later. This should be handled dynamically.
I know the SSIS UnPivot transformations may not work due to dynamic column nature(File Meta-data may vary with new file) also 2 level of columns.
Request your help in case if anyone has come across this kind of file load requirements.
Adding the expected output file/table format below.

Sorting files with the same header names using SSIS

I have a folder with a lot of data files in. I want to be able to loop through the files, look at the headers and sort them into folders if they have the same headers. Is that possible to do in SSIS? If so would anyone be able point me the direction of how to do this?
I am going to try and explain this as best I can without writing a book as this a multi stepped process that isn't too complex but, might be hard to explain with just test. My apologies but I do not have access to ssdt at the moment so I can not provide images to aid here.
I would use the TextFieldParser class in the VisualBasics.dll. in a script task. This will allow you to read the header from file into a string array. You can then build the string array into a delimited column and load an object variable with a datatable that has been populated with two columns. The first column being the filename and the second being the delimiter headers.
Once you have this variable you can load a sql table with this information. (optional to skip if you want to load the columns directly into sql as you read them. your call)
Once you have your sql table you can create an enumerator for that dataset based on the unique headers column.
Then use a foreach loop task with script task to enumerate thru the unique header sets. Use a sql task to assign the file names that belong to the unique header set.
Within the script loop thru the returned file names and apply the necessary logic to move the files to there respective folders.
This is sort of a high level overview as I am assuming you are familiar enough with SSIS to understand the steps necessary to complete each step. If not then I would be able to elaborate later in the day when I am able to get to my SSIS rig.

Access (possible VBA) Export to multiple files based on column content

I currently have a dataset that contains information for arrivals at over 100 airports. My end goal is create an excel file with a tab for each of those airports containing only the information for that airport (multiple rows).
I can of course create 100 queries and export them in a macro. However the list may change overtime and while I can amend the query that creates the initial file I'd rather not have to tweak the downstream process each time.
I cannot amend the source file process so do not want to export 105 initial files each time.
I am looking for a process that will export based on the contents of the data.
As your question is pretty general, here's a general approach I might take.
Query the data to get a distinct list of airports.
Loop over list from #1, building a set of data just for that airport.
Export the data from #2 to Excel.
The details on how to do each of those is going to depend on what the initial dataset looks like and is beyond the scope of what Stack Overflow is for.

Define same output table across multiple transformations

I have 6 different input datasets. I want to run ETL over all 6 datasets so they all get transformed to the same output table (same columns and types).
I am using Pentaho (Spoon) to do this.
Is there a way I can define an output table schema to be used by all these transformations in Pentaho? I am using MySQL as my output database.
Thanks in advance.
Sounds like you need the Select Values step. Put one of those on the last hop of each dataset's path and make the metadata for the paths all look EXACTLY the same. Then you can connect the output from each Select Values step into a Table Output. All the rows from each set will be mixed together in no particular order.
This can be more challenging than it looks. Spoon will throw errors if any of the fields aren't just exactly identical to the corresponding field from all other datasets. You'll have to find some way to get all the metadata from the datasets to be the same.

process csv File with multiple tables in SSIS

i trying to figure out if its possible to pre-process a CSV file in SSIS before importing the Data into SQL.
I currently receive a file that contains 8 tables with different structures in one flat file.
the Tables are identified by a row with the Table name in it encapsulated by Square Brackets i.e. [DOL_PROD]
the the data is underneath in standard CSV format. Headers first and then the data.
the tables are split by a blank line and the process repeats for the next 7 tables.
[DOL_CONSUME]
TP Ref,Item Code,Description,Qty,Serial,Consume_Ref
12345,abc,xxxxxxxxx,4,123456789,abc
[DOL_ENGPD]
TP Ref,EquipLoc,BackClyLoc,EngineerCom,Changed,NewName
is it posible to split it out into seperate CSV files? or Process it in a loop?
i would really like to be able to perform this all with SSIS automatically.
Kind Regards,
Adam
You can't do that by flat file source and connection manager alone.
There are two ways to achieve your goal:
You can use Script Component as source of the rows and to process the files, and then you'd do whatever you want with a file programatically.
The other way, is to read your flat file treating every row as a single column (i.e. without specifying delimiter), and then, via Data Flow Transformations, you'd be splitting rows, recognizing table names, splitting flows and so on.
I'd strongly advise you to use Script Component, even if you'd have to learn .NET first, because the second option will be a nightmare :). I'd use Flat File Source to extract lines from file as single column, and thet work it in Script Component, rather than reading a "raw" file directly.
Here's a resource that should get you started: http://furrukhbaig.wordpress.com/2012/02/28/processing-large-poorly-formatted-text-file-with-ssis-9/