How to check whether a file is serial or partitioned using abinitio functions only - ab-initio

I need to resolve a parameter value depending upon whether I have a serial or multifile. Below is the scenario...
I have created a generic graph where I have a reformat component just after the input file component... At run time! I need to check input file if it is serial or multi... And accordingly I have to populate the layout of reformat...!
Hence.. To achieve this I am looking for some specific abinitio function...!
Thanks

I think there is a function - m_fs_check.
You can use this function in the graph parameters and use the resolved value as a condition to determine the layout.

m_fs_check will check if the directory is a serial or multi directory. However a user can still create a serial file on a multi directory. One option is fire a m_ls -lt command. Result displays a flag 'M' which denotes that a file is multi file. For serial files this flag remains blank.

Use m_expand($INPUT_FILE_PATH) in PDL at PSET level to identify directory depth. if depth is greater than one then its multifile else serial.then use output flag into your reformat.

Related

Azure Datafactory process and filter files to process

I have a pipeline that processes some files, and in some cases "groups" of files. Meaning the files should be processed together and are correlated with a timestamp.
Ex.
Timestamp#Customer.csv
Timestamp#Customer_Offices.csv
Timestamp_1#Customer.csv
Timestamp_1#Customer_Offices.csv
...
I have a table with all the scopes, and files with respective filemask. I have populated a variable in the beginning of the pipeline based on a parameter
The Get files activity goes to a sFTP location and grab files from a folder. Then I only want to process the "Customer.csv" and ".Customer_Offices.csv" files. This is because the folder location has more file types or scopes to be processed by other pipelines. If I don't filter, the next activities end up by processing metadata of files that are not supposed to. In terms of efficiency and performance s bad, and is even causing some issues with files being left behind.
I've tried something like
#variables('FilesToSearch').contains(#endswith(item().name, 'do I need this 2nd parm in arrays ?'))
but no luck... :(
Any help will be highly appreciated,
Best regards,
Manuel
contains function can direct for a string to find a substring, so you can try something like this expression #contains(item().name,'Customer')
and no need to create a variable.
Or use endsWith function and use this expression:
#or(endswith(item().name,'Customer.csv'),endswith(item().name,'Customer_Offices.csv'))

How to get SSIS to select specific files in directory and assign name to variables (File System Task)

I have the following scenario:
I have a remote server that every week gets loaded with 2 files, these files have the following name format:
"FINAL_NAME06Apr16.txt" and
"FINAL_NAME_F106Apr16.txt"
The part in bold is fixed everytime, but the date changes, now, I need to pick, copy to another directory and rename these files. but I'm not sure about how to pick the name of the files to variables to operate with them as I need to put different name to each file.
How can I proceed? I' pretty sure it has to be done with naming a variable with an expression, but I don't know how to do that part.
I think I need some function to calculate the rest of the filename, I believe maybe some approach could be to first rename the part "FINAL_NAME_F1" and then rename the "FINAL_NAME" since some wildcards will pick both if don't do it that way?
Cheers.
You can calculate the date but why go through that complexity?
A Foreach (File) Loop Container, FELC, will handle this just fine. Add two of them to your control flow.
The first one will use a file mask of FINAL_NAME_F1*.txt. Inside that FELC, use a File System task to copy/move/rename the file to your new location.
That first FELC will run, find the target file and move it. It will then look for the next file, find none and go on to the next task.
Create a second FELC but this one will operate on FINAL_NAME*.txt It's crucial that the first FELC run first as this file mask will match both FINAL_NAME_f1-2019-01-01.txt and FINAL_NAME-2019-01-01.txt. By ordering our operations as such, we can reduce the complexity of the logic required.
Sample answer with a FELC to show where to plumb the various bits

Output index of ELKI

I am using ELKI to cluster data from CSV file
I use
-resulthandler ResultWriter
-out folder/
to save the outputdata
But as an output I have some strange indexes
ID=2138 0.1799 0.2761
ID=2137 0.1797 0.2778
ID=2136 0.1796 0.2787
ID=2109 0.1161 0.2072
ID=2007 0.1139 0.2047
The ID is more than 2000 despite I have less than 100 training samples
DBIDs are internal; the documentation clearly says that you shouldn't make too much assumptions on them because their implementation may change. The only reason they are written to the output at all is because some methods (such as OPTICS) may require cross-referencing objects by this unique ID.
Because they are meant to be unique identifiers, they are usually continuously incremented. The next time you click on "run" in the MiniGUI, you will get the next n IDs... so clearly, you clicked run more than once.
The "Tips & Tricks" in the ELKI DBID documentation probably answer your underlying question - how to use map DBIDs to line numbers of your input file. The best way is to if you want to have object identifiers, assign object identifiers yourself by using an identifier column (and configuring it to be an external identifier).
For further information, see the documentation: https://elki-project.github.io/dev/dbids

Working on migration of SPL 3.0 to 4.2 (TEDA)

I am working on migration of 3.0 code into new 4.2 framework. I am facing a few difficulties:
How to do CDR level deduplication in new 4.2 framework? (Note: Table deduplication is already done).
Where to implement PostDedupProcessor - context or chainsink custom? In either case, do I need to remove duplicate hashcodes from the list or just reject the tuples? Here I am also doing column updating for a few tuples.
My file is not moving into archive. The temporary output file is getting generated and that too empty and outside load directory. What could be the possible reasons? - I have thoroughly checked config parameters and after putting logs, it seems correct output is being sent from transformer custom, so I don't know where it is stuck. I had printed TableRowGenerator stream for logs(end of DataProcessor).
1. and 2.:
You need to select the type of deduplication. It is not a big difference if you choose "table-" or "cdr-level-deduplication".
The ite.businessLogic.transformation.outputType does affect this. There is one Dedup only. You can not have both.
Select recordStream for "cdr-level-deduplication", do the transformation to table row format (e.g. if you like to use the TableFileWriter) in xxx.chainsink.custom::PostContextDataProcessor.
In xxx.chainsink.custom::PostContextDataProcessor you need to add custom code for duplicate-handling: reject (discard) tuples or set special column values or write them to different target tables.
3.:
Possibly reasons could be:
Missing forwarding of window punctuations or statistic tuple
error in BloomFilter configuration, you would see it easily because PE is down and error log gives hints about wrong sha2 functions be used
To troubleshoot your ITE application, I recommend to enable the following debug sinks if checking the StreamsStudio live graph is not sufficient:
ite.businessLogic.transformation.debug=on
ite.businessLogic.group.debug=on
ite.businessLogic.sink.debug=on
Run a test with a single input file only and check the flow of your record and statistic tuples. "Debug sinks" write punctuations markers also to debug files.

Jmeter: set property for each loop

I'm trying to create a test that will loop depending on the number of files stored in one folder then output results base on their filename. I'm thinking to use their filename as the name of their result, so for this, I created something like this in BS preProcessor:
props.setProperty("filename", vars.get("current_tc"));
Then use it for the name of the result:
C:\\TEST\\Results\\${__property(filename)}
"current_tc" is the output variable name of a ForEach controller. It returns different value on each loop. e.g loop1 = test1.csv, loop2 = test2.csv ...
I'm expecting that the result name will be test1.csv, test2.csv .... but the actual result is just test1.csv and the result of the other file is also in there. I'm new to Jmeter. Please tell me if I'm doing an obvious mistake.
Test Plan Image
The way of setting the property seems okayish, the question is where and how you are trying to use this C:\\TEST\\Results\\${__property(filename)} line so a snapshot of your test plan would be very useful.
In the meantime I would recommend the following:
Check jmeter.log file for any suspicious entries, if something goes wrong - most probably you will be able to figure out the reason using this file. Normally it is located in JMeter's "bin" folder
Use Debug Sampler and View Results Tree listener combination to check your ${current_tc} variable value, maybe it is the case of the variable not being incremented. See How to Debug your Apache JMeter Script article to learn more about troubleshooting tecnhiques