How to select Multiple CSV files based on date and load into table - ssis

i receive input files daily in a folder called INPUTFILES. These files have filename along with datetime.
My Package has been scheduled to run everyday. If i receive 2 files for the day, i need to fetch these 2 files and load into the table.
For example i had files in my files
test20120508_122334.csv
test20120608_122455.csv
test20120608_014455.csv
now i need to run files test20120608_122455.csv test20120608_014455.csv for the same day.

I solved the issue. I have taken one varibale which checks for whether a file exists for that particular Day.
If the file exists for a particular day then the value for the variable is assigned to 1.
For Each Loop Container has been taken, and placed the this file exists variable inside the container.
For Loop Properties
EvalExpression ---- #fileexists==1.
if no file exists for that particular day, then the loop fails.

Related

Data factory copy based off last high water mark value (Dynamic date)

I'm currently working on a project where I need the data factory pipeline to copy based off the last run date.
The process breakdown....
Data is ingested into a storage account
The data ingested is in the directory format topic/yyyy/mm/dd i.e., multiple files being brought in a single directory hence it's files are partitioned by date which looks like this day format and month and year etc
The process currently filters based on the last high water mark date which updates each time the pipeline is run and triggers daily at 4am, once the copy is successful, a set variable increases the high-water mark value by 1 (I.e., one day), though files are not brought over on the weekends (this is the problem)
The date value (HWM) will not increase if no files are brought over and will continue to loop through the same date.
How to I get the pipeline to increase or look for the next file in that directory given that I use the HWV as the directory to the file, copy and update the HWM value only when completed dynamically. Current update logic
current lookup of HWV lookup and directory path to copy files
Instead of adding 1 to last high water mark value, we can try to update current UTC as watermark value. So that, even when pipeline is not triggered data will be copied to the correct destination folder. I have tried to repro in my environment and below is the approach.
Watermark table is taken initially with watermark value as '1970-01-01'.
This table is referred in the Lookup Activity.
Copy data activity is added and in source, query is given as
select * from tab1 where lastmodified > '#{activity('Lookup1').output.firstRow.watermark_value}'
In Sink, Blob storage is taken. In order to have folder structure as year/month/day,
#concat(formatDateTime(utcnow(),'yyyy'),'/', formatDateTime(utcnow(),'mm'),'/',formatDateTime(utcnow(),'dd'))
is given in folder path.
File is copied as in below path.
Once file is copied, Watermark value is updated with the current UTC time.
update watermark_table
set
watermark_value='#{formatDateTime(utcnow(),'yyyy-MM-dd')}'
where tab_name='tab1'
When pipeline is triggered next day, data will be copied from the watermark value and once file is copied, value of current UTC is updated as watermark value.
I think reading the post a couple of time , what I understood is
You already have a water mark logic .
On the weekend when there are NO files in the folder , the current logic does NOT increment the watermark and so you are facing issues .
If I understand the ask correctly . please use the #dayOfWeek() function . Add a If statement and let the current logic only execute when the day of the week is Monday(2)-Friday(6) .
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expressions-usage#dayofweek

Load SSIS multiple files one by one

I'm using the foreach loop container in my SSIS job and I have a folder where we add new files every day containing the name + date.
I need to run the package using the oldest file; after that we load the second one.
Can we change this bulk insert in SSIS jobs?
FI: I used variables to load them (FolderPath and folderName) but they are processed at the same time; I prefer another solution than using script.
Your question is confusing, but I'll try to answer.
If you don't want to use the script component I will suggest...
1.Using a foreach loop container (folder as source) read the file names into a two column table (Filename (full path), datepart from file name)
2.Using another foreach loop container (using ado object as source) select from that table in the desired order to fill ado object.
Load files that match the path

How do you get flat file name and perform a row count from multiple flat files with different columns in SSIS?

I'm trying to get all the file names from a folder directory along with their row counts. (Also file size in bytes if possible) I am using Microsoft Visual Studio 2010 Shell. Here's what I've done so far:
I have created a Foreach Loop Container, set the Enumerator to Foreach File Enumerator and Expressions to a variable to the folder I want to loop over. I left the Files section with *.* and asked to retrieve Name Only. I have changed the Variable Mappings to a New Variable called FullFilePath, Container is Package, Value type is String and Value: is blank.
I then added a Data Flow to the Loop. Added a flat file source, row count, and OLE DB Destination. I changed the Flat file Source properties expression to the same Folder Variable in the Foreach Loop Container Expression. I added the Variable RecordCount to the Row Count function (Int32, value 0). The OLE DB Destination creates a new table with the name OLE DB Destination.
The next step is a Execute SQL Task that does and Insert Into DBO.FileData (FileName,RowCount) Values (?,?). I set 2 parameter mappings - 1) Variable Name from the Foreach Loop Container, FullFilePath and Data Type VarChar, 2) Variable from Row Count, RecordCount and Data Type Long.
I then have another Execute SQL Task that drops the table created by the data flow task. The problem is that with all the these step the Package still does not complete. It actually gets hung up and fails on the pre-execute. It says:
Warning: Access is denied. Error: Cannot open the datafile 'FullFilePath' Error: Flat File Source failed the pre-execute phase and returned error code 0xC020200E.
Anything you see I could be doing wrong? Let me know if pictures would help.
So I figured this out finally. In order to loop over all of the files with varying headers and column counts I decided to change the option in the Flat File Source to unselect "File contains headers." Doing this allowed the all the files to have the same #1 Column, which by default is Column 0(the first column in all of my files is some sort of a numeric field or ID). I was able to map this through row count and insert into a SQL table. Then I was able to finish the Foreach Loop and scribe the file name and row count into another SQL table to record the counts. It is however taking a really really really long time, i.e. it has been running for over 14 hours and it has only counted through 13 files. Granted some files are 250K+ rows but I wouldn't think it would take this long.

QlikView - Loading specific files from remote server

I'm trying to solve this problem for a long time, but now I have to ask for your help.
I have one QVD file on my local PC named e.g. server001_CPU.qvd and on remote servers I have shared folder with many files of many types. There are also files named server001_CPU_YYYYMMDD.csv (e.g. server001_CPU_20140806.csv) that are generated every day and that have same structure as local qvd file. They have column DATE. What I need is (in loading script) to check last DATE in local file and load remote files starting from that day to today and then concatenate it together. Something like this:
CPU:
LOAD * FROM server001_CPU.qvd
LET vMAX = Max(DATE) FROM CPU
DO WHILE vMAX <= Today()
CPU:
LOAD * FROM serverpath/server001_CPU_$(vMAX).csv
LOOP
I'm really trying but I'm new to QV and it has strange logic for me. Thanks in advance for any help.
You can try the below script snippet which should do what you need.
What this does is first open your existing data set (in the QVD), and then finds the maximum date and stores it in table MaxCPUDate. This maximum value is then read into a variable and the table is dropped.
This "Max Date" value is then subtracted from today's date to determine the number of loops to execute to load the individual files. The loop variable is added on to the "Max Date" value to create the filename to load.
CPU:
LOAD
*
FROM server001_CPU.qvd (qvd);
MaxCPUDate:
LOAD DISTINCT
max(DATE) as MaxDate
RESIDENT CPU;
LET vMaxCPUDate = peek('MaxDate',-1,'MaxCPUDate');
DROP TABLE MaxCPUDate;
FOR vFileNum = 0 TO (num(Today()) - $(vMaxCPUDate))
LET Filename ='serverpath/server001_CPU_' & date($(vMaxCPUDate) + $(vFileNum),'YYYYMMDD') & '.csv';
CONCATENATE (CPU)
LOAD
*
FROM $(Filename) (txt, codepage is 1252, embedded labels, delimiter is ',', msq);
NEXT

how to create a SSIS package which creates three text files, using same variables but the textfile is only created when the correct data is found?

There are only 3 files that can be created : "File_1", "File_2" and "File_3". The same variable name is used in each instance (User::FileDirectory) and (User::File_name), but because the actual value of the variable changes, a new file is created.However the files are only created if there is data to go into the file. i.e. if there are no records to populate the file, it will not be created at all. When the files are created, the date the file was created should also be added to the filename. eg: File1_22102011.txt
Ok if the above was a little confusing, the following is how it works,
All the files use the same variable, but it is reset before each file is created.
• So it populates a result set in memory with the first sql selection (ID number, First_Name and Main_Name). It sets the file variable to “File_1”. If there are records in the result set, it creates and writes to this filename.
• Then it creates a new result set with the second selection(Contract No). It sets the variable to "File_2". If there are records in this new result set, a new file will be created from the variable(which now has a new value)
• Finally a third result set is created (Contract_no, ExperianNo, Entity_ID_Number, First_Name, Main_Name), and the file variable is set to "File_3". Again if there are records in the result set, then this file will be created and written to.
I have worked on a few methods to achieve this but they all have failed, So little help will be greatly appreciated.
While what you have works, I think it'd be rather painful to maintain.
I would approach it as 3 sequence containers running in parallel. Each container would have a data flow and two file tasks hanging off it based on success of the parent and the value of row count variable. If the row count variable is 0, delete the file. If it's greater than 0, rename it to File_n
As you can see, I have a container for the first file. The data flow creates an output a.txt file. Based on the value of the variable #RowCount1, it will either delete the empty file or rename it to File_1.
Each data flow would look like a source query, a row count transformation and a file destination with a temporary name (a.txt, b.txt, c.txt). As a file is always created, even if it's empty, we will need to delete or rename it afterwards which will be accomplished based on the file operation tasks.
In my opinion, this approach will be cleaner as it will allow you to test and debug each item in a cleaner manner rather than dealing with an in-memory dataset.