Azure Datafactory process and filter files to process

Azure Datafactory process and filter files to process - csv

I have a pipeline that processes some files, and in some cases "groups" of files. Meaning the files should be processed together and are correlated with a timestamp.
Ex.
Timestamp#Customer.csv
Timestamp#Customer_Offices.csv
Timestamp_1#Customer.csv
Timestamp_1#Customer_Offices.csv
...
I have a table with all the scopes, and files with respective filemask. I have populated a variable in the beginning of the pipeline based on a parameter
The Get files activity goes to a sFTP location and grab files from a folder. Then I only want to process the "Customer.csv" and ".Customer_Offices.csv" files. This is because the folder location has more file types or scopes to be processed by other pipelines. If I don't filter, the next activities end up by processing metadata of files that are not supposed to. In terms of efficiency and performance s bad, and is even causing some issues with files being left behind.
I've tried something like
#variables('FilesToSearch').contains(#endswith(item().name, 'do I need this 2nd parm in arrays ?'))
but no luck... :(
Any help will be highly appreciated,
Best regards,
Manuel

contains function can direct for a string to find a substring, so you can try something like this expression #contains(item().name,'Customer')
and no need to create a variable.
Or use endsWith function and use this expression:
#or(endswith(item().name,'Customer.csv'),endswith(item().name,'Customer_Offices.csv'))

Related

Changing output dataset path in the transform function

Can we change the output dataset path dynamically in the my_compute_function as show below
from transforms.api import transform, Input, Output
#transform(
my_output=Output("/path/to/my/dataset"),
my_input=Input("/path/to/input"),
)
def my_compute_function(my_output, my_input):
**my_output.path = "new path"**
my_output.write_dataframe(
my_input.dataframe()
)

No, this is not possible. The reason is that the inputs/outputs/transforms are fixed at "CI-time" or "build-time". When you press "commit" in Authoring or you merge a PR, a CI job is kicked off.
In this CI job, all the relations between inputs and outputs are determined. Output datasets that don't exist yet are created, and a "jobspec" is added to them. A "jobspec" is a snippet of JSON that describes to foundry how a particular dataset is generated.
Anytime you press the "build" button on a dataset (or build the dataset through a schedule or similar), the jobspec is consulted. It contains a reference to the repository, revision, source file and entry point of the function that builds this dataset. From there the build is orchestrated and kicks off, invoking your function to produce the final output.
This mechanism allows you to get a "static view" of the entire pipeline, which you can then visualize with Monocle, as you might have seen.
Depending on what your needs are, here are some solutions you might be able to use instead:
Tag the rows you're producing in your transform in some way, so that even though you put them into a single dataset, you can later select them by this tag/category
If your set of categories does not change often, you can instead create the output datasets ahead of time and then filter the rows into the appropriate dataset they should go into.
The main drawback with the latter approach is that it's not very dynamic, so if a new category shows up, you'll manually have to change the code to "triage" it into a new dataset, until the data becomes available.
There's other solutions (ultimately it is possible to make API calls and to manually adjust inputs/outputs as well, for instance) but they are more complex and undesirable from a maintenance perspective.

Psychopy: how to avoid to store variables in the csv file?

When I run my PsychoPy experiment, PsychoPy saves a CSV file that contains my trials and the values of my variables.
Among these, there are some variables I would like to NOT be included. There are some variables which I decided to include in the CSV, but many others which automatically felt in it.
is there a way to manually force (from the code block) the exclusion of some variables in the CSV?
is there a way to decide the order of the saved columns/variables in the CSV?
It is not really important and I know I could just create myself an output file without using the one of PsychoPy, or I can easily clean it afterwards but I was just curious.

PsychoPy spits out all the variables it thinks you could need. If you want to drop some of them, that is a task for the analysis stage, and is easily done in any processing pipeline. Unless you are analysing data in a spreadsheet (which you really shouldn't), the number of columns in the output file shouldn't really be an issue. The philosophy is that you shouldn't back yourself into a corner by discarding data at the recording stage - what about the reviewer who asks about the influence of a variable that you didn't think was important?
If you are using the Builder interface, the saving of onset & offset times for each component is optional, and is controlled in the "data" tab of each component dialog.
The order of variables is also not under direct control of the user, but again, can be easily manipulated at the analysis stage.
As you note, you can of course write code to save custom output files of your own design.

there is a special block called session_variable_order: [var1, var2, var3] in experiment_config.yaml file, which you probably should be using; also, you should consider these methods:
from psychopy import data
data.ExperimentHandler.saveAsWideText(fileName = 'exp_handler.csv', delim='\t', sortColumns = False, encoding = 'utf-8')
data.TrialHandler.saveAsText(fileName = 'trial_handler.txt', delim=',', encoding = 'utf-8', dataOut = ('n', 'all_mean', 'all_raw'), summarised = False)
notice the sortColumns and dataOut params

How to get SSIS to select specific files in directory and assign name to variables (File System Task)

I have the following scenario:
I have a remote server that every week gets loaded with 2 files, these files have the following name format:
"FINAL_NAME06Apr16.txt" and
"FINAL_NAME_F106Apr16.txt"
The part in bold is fixed everytime, but the date changes, now, I need to pick, copy to another directory and rename these files. but I'm not sure about how to pick the name of the files to variables to operate with them as I need to put different name to each file.
How can I proceed? I' pretty sure it has to be done with naming a variable with an expression, but I don't know how to do that part.
I think I need some function to calculate the rest of the filename, I believe maybe some approach could be to first rename the part "FINAL_NAME_F1" and then rename the "FINAL_NAME" since some wildcards will pick both if don't do it that way?
Cheers.

You can calculate the date but why go through that complexity?
A Foreach (File) Loop Container, FELC, will handle this just fine. Add two of them to your control flow.
The first one will use a file mask of FINAL_NAME_F1*.txt. Inside that FELC, use a File System task to copy/move/rename the file to your new location.
That first FELC will run, find the target file and move it. It will then look for the next file, find none and go on to the next task.
Create a second FELC but this one will operate on FINAL_NAME*.txt It's crucial that the first FELC run first as this file mask will match both FINAL_NAME_f1-2019-01-01.txt and FINAL_NAME-2019-01-01.txt. By ordering our operations as such, we can reduce the complexity of the logic required.
Sample answer with a FELC to show where to plumb the various bits

Most efficient way to handle lists saved in a file python3

I have been looking for information in Google and stackoverflow, but I dind't find a good solution.
I need to handle a list, add elements, delete elements... but saved in a file. This is in order to avoid losing the list when the execution finish, because I need to execute my python script periodically. Here are alternatives I found, but they have some problems
Shelve module: can't find how to delete a element in the list (such as list.pop() ) instead of deleting all the list.
pprint.pformat() : to modify information, I need to delete all the document and save the modifed information, very inefficient.
json: tediuos for just a list and doesn't seems to solve my problem
So, which is the best way to handle a list, doing things as easy as mylist.pop() keeping the changes in a file in an efficient way?

Since this has never been answered before, here is an efficient way. The package pysos can handle disk backed lists with inserts/deletes in constant time.
pip install pysos
Code example:
import pysos
db = pysos.List('somefile')
db.append('saved in the file 0')
db.append('saved in the file 1')
db.append('saved in the file 2')
print(db[1]) # => 'saved in the file 1'
del db[1]
print(db[1]) # => 'saved in the file 2'

Jmeter: set property for each loop

I'm trying to create a test that will loop depending on the number of files stored in one folder then output results base on their filename. I'm thinking to use their filename as the name of their result, so for this, I created something like this in BS preProcessor:
props.setProperty("filename", vars.get("current_tc"));
Then use it for the name of the result:
C:\\TEST\\Results\\${__property(filename)}
"current_tc" is the output variable name of a ForEach controller. It returns different value on each loop. e.g loop1 = test1.csv, loop2 = test2.csv ...
I'm expecting that the result name will be test1.csv, test2.csv .... but the actual result is just test1.csv and the result of the other file is also in there. I'm new to Jmeter. Please tell me if I'm doing an obvious mistake.
Test Plan Image

The way of setting the property seems okayish, the question is where and how you are trying to use this C:\\TEST\\Results\\${__property(filename)} line so a snapshot of your test plan would be very useful.
In the meantime I would recommend the following:
Check jmeter.log file for any suspicious entries, if something goes wrong - most probably you will be able to figure out the reason using this file. Normally it is located in JMeter's "bin" folder
Use Debug Sampler and View Results Tree listener combination to check your ${current_tc} variable value, maybe it is the case of the variable not being incremented. See How to Debug your Apache JMeter Script article to learn more about troubleshooting tecnhiques

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Azure Datafactory process and filter files to process - csv

Related

Changing output dataset path in the transform function

Psychopy: how to avoid to store variables in the csv file?

How to get SSIS to select specific files in directory and assign name to variables (File System Task)

Most efficient way to handle lists saved in a file python3

Jmeter: set property for each loop

Categories

Resources