Data factory copy based off last high water mark value (Dynamic date) - mysql

I'm currently working on a project where I need the data factory pipeline to copy based off the last run date.
The process breakdown....
Data is ingested into a storage account
The data ingested is in the directory format topic/yyyy/mm/dd i.e., multiple files being brought in a single directory hence it's files are partitioned by date which looks like this day format and month and year etc
The process currently filters based on the last high water mark date which updates each time the pipeline is run and triggers daily at 4am, once the copy is successful, a set variable increases the high-water mark value by 1 (I.e., one day), though files are not brought over on the weekends (this is the problem)
The date value (HWM) will not increase if no files are brought over and will continue to loop through the same date.
How to I get the pipeline to increase or look for the next file in that directory given that I use the HWV as the directory to the file, copy and update the HWM value only when completed dynamically. Current update logic
current lookup of HWV lookup and directory path to copy files

Instead of adding 1 to last high water mark value, we can try to update current UTC as watermark value. So that, even when pipeline is not triggered data will be copied to the correct destination folder. I have tried to repro in my environment and below is the approach.
Watermark table is taken initially with watermark value as '1970-01-01'.
This table is referred in the Lookup Activity.
Copy data activity is added and in source, query is given as
select * from tab1 where lastmodified > '#{activity('Lookup1').output.firstRow.watermark_value}'
In Sink, Blob storage is taken. In order to have folder structure as year/month/day,
#concat(formatDateTime(utcnow(),'yyyy'),'/', formatDateTime(utcnow(),'mm'),'/',formatDateTime(utcnow(),'dd'))
is given in folder path.
File is copied as in below path.
Once file is copied, Watermark value is updated with the current UTC time.
update watermark_table
set
watermark_value='#{formatDateTime(utcnow(),'yyyy-MM-dd')}'
where tab_name='tab1'
When pipeline is triggered next day, data will be copied from the watermark value and once file is copied, value of current UTC is updated as watermark value.

I think reading the post a couple of time , what I understood is
You already have a water mark logic .
On the weekend when there are NO files in the folder , the current logic does NOT increment the watermark and so you are facing issues .
If I understand the ask correctly . please use the #dayOfWeek() function . Add a If statement and let the current logic only execute when the day of the week is Monday(2)-Friday(6) .
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expressions-usage#dayofweek

Related

Identifying AEM Content Fragments related to a particular field value updated in a csv

When a Content Fragment is updated by drawing values from a linked csv file, is there a way to identify only those CFs that are related to the particular field value, updated in the csv file?
Suppose we only update the income for Year1 in a sample csv file (row: Income, column: Year1) and this csv may be linked to a number of CFs but I only want to find the CFs containing the particular field value updated, that is income for year1.
I initially thought this would trigger an event for the CF updated as it would have for any other update that we do to CFs directly but in this case an event trigger is not happening possibly because it's drawing values from an updated csv file.
What would be a possible solution to this problem?
There is more than one way. In general, it depends on how you upload the csv file in AEM.
You can implement an event listener to monitor changes in the csv file. If it is updated, you can take the path from the event data, parse the contents and update the CFs.
If you are updating the csv file via Sling Post Servlet, then you can implement org.apache.sling.servlets.post.PostOperation and put the logic in it.

SSIS Slowly changing dimension column

I'm using a Slowly Changing Dimension in SSIS and I'm using a single column called active of type BIT to determine the latest records instead of start date and end date column.
My problem is the following: I want to turn the active value to 0 for records that are no more present in the source file.
For example imagine if my DWH is empty and in the source file I have the following data(Salary is the historisation attribute):
employee_ID|NAME|salary
117|a|100
125|b|150
378|c|200
Now once I charge those into my DWH I get the following data.
employee_code|employee_ID|NAME|salary|active
1|117|a|100|1
2|125|b|150|1
3|378|c|200|1
everything is good so far but now imagine I get a new source where the data is like this:
employee_ID|NAME|salary
117|a|120
125|b|150
Here when I charge this data in the datawarehouse I get the following:
employee_code|employee_ID|NAME|salary|active
1|117|a|100|0
2|125|b|150|1
3|378|c|200|1
4|117|a|120|1
Everything makes sense. Employee A's salary has changed so a new record is added in the DWH and the old record's active value turned to 0. Employee b's salary stayed the same so there is no need to add a new record. However mployee C does not exist in the source file anymore (He quit or got fired) I want to know if there is a way to turn the active value to 0 in such a situation

QlikView - Loading specific files from remote server

I'm trying to solve this problem for a long time, but now I have to ask for your help.
I have one QVD file on my local PC named e.g. server001_CPU.qvd and on remote servers I have shared folder with many files of many types. There are also files named server001_CPU_YYYYMMDD.csv (e.g. server001_CPU_20140806.csv) that are generated every day and that have same structure as local qvd file. They have column DATE. What I need is (in loading script) to check last DATE in local file and load remote files starting from that day to today and then concatenate it together. Something like this:
CPU:
LOAD * FROM server001_CPU.qvd
LET vMAX = Max(DATE) FROM CPU
DO WHILE vMAX <= Today()
CPU:
LOAD * FROM serverpath/server001_CPU_$(vMAX).csv
LOOP
I'm really trying but I'm new to QV and it has strange logic for me. Thanks in advance for any help.
You can try the below script snippet which should do what you need.
What this does is first open your existing data set (in the QVD), and then finds the maximum date and stores it in table MaxCPUDate. This maximum value is then read into a variable and the table is dropped.
This "Max Date" value is then subtracted from today's date to determine the number of loops to execute to load the individual files. The loop variable is added on to the "Max Date" value to create the filename to load.
CPU:
LOAD
*
FROM server001_CPU.qvd (qvd);
MaxCPUDate:
LOAD DISTINCT
max(DATE) as MaxDate
RESIDENT CPU;
LET vMaxCPUDate = peek('MaxDate',-1,'MaxCPUDate');
DROP TABLE MaxCPUDate;
FOR vFileNum = 0 TO (num(Today()) - $(vMaxCPUDate))
LET Filename ='serverpath/server001_CPU_' & date($(vMaxCPUDate) + $(vFileNum),'YYYYMMDD') & '.csv';
CONCATENATE (CPU)
LOAD
*
FROM $(Filename) (txt, codepage is 1252, embedded labels, delimiter is ',', msq);
NEXT

How to select Multiple CSV files based on date and load into table

i receive input files daily in a folder called INPUTFILES. These files have filename along with datetime.
My Package has been scheduled to run everyday. If i receive 2 files for the day, i need to fetch these 2 files and load into the table.
For example i had files in my files
test20120508_122334.csv
test20120608_122455.csv
test20120608_014455.csv
now i need to run files test20120608_122455.csv test20120608_014455.csv for the same day.
I solved the issue. I have taken one varibale which checks for whether a file exists for that particular Day.
If the file exists for a particular day then the value for the variable is assigned to 1.
For Each Loop Container has been taken, and placed the this file exists variable inside the container.
For Loop Properties
EvalExpression ---- #fileexists==1.
if no file exists for that particular day, then the loop fails.

How to specify the report pdf name at run time?

I have a report named "Debt Report ". It runs for every month and a pdf is generated at the first of the month by subscription option.
If I am running the report for the month then the report name of the pdf should be "Debt Report for April" and like wise if I run it for may then the name of the pdf should be "Debt Report for May".
How can I do this?
Assuming you are scheduling the report to a file share, you can set the name of the file share to be Debt Report for #timestamp - this will name the file in the format Debt Report for YYYY_MM_DD_HRMINSS .
If you only want the month name (not the entire timestamp) to appear in the filename, you will need to use a Data Driven Subscription.
Another option, although a bit more technical, is to use the rs.exe utility to generate the report. This involves:
creating a script file that generates the report (this is where you can set the filename to your preference)
creating a batch file that calls rs.exe with the script file as a parameter
running the batch file on a schedule e.g. with Windows scheduler or SQL Server Agent
There is an example here of how to do this (to create Excel files but the principle is the same) http://skamie.wordpress.com/2010/08/11/using-rs-exe-to-render-ssrs-reports/
The solution for this problem is "Data Driven Subscription"
http://msdn.microsoft.com/en-us/library/ms169972(v=sql.105).aspx
http://www.kodyaz.com/reporting-services/create-data-driven-subscription-in-sql-server.aspx
the following link helped me alot but the query given in the link creates trouble- cast the datatype of the getdate and it will solve the problem
http://social.msdn.microsoft.com/Forums/en/sqlreportingservices/thread/0f075d9b-52f5-4a92-8570-43bbdaf2b2b1
I have had to do the same thing ( well almost )
I had to generate a weekly report to file and save it as REPORT-Week01.pdf, then REPORT-Week02.pdf etc.
The mechanism I used was to change the parameter column in the Schedule table via a scheduled job. This computed the required file name and simply replaced it. Then when the scheduled job runs, it writes to the file name setup when the schedule was created ( except that was changed at 1 minute past midnight to what I wanted it to be )
I have since implemeted another set of reports that write to a folder, whihc changes each month the the next months folder name ( currently writing all reports to a folder called 202103 ) tonight the job will run and the output folder will change to 202104 and the scheduled jobs will never need changing