NIFI - QueryDatabaseTable processor. How to query rows which is modified? - mysql

I am working on NIFI Data Flow where my usecase is fetch mysql table data and put into hdfs/local file system.
I have built a data flow pipeline where i used querydatabaseTable processor ------ ConvertRecord --- putFile processor.
My Table Schema ---> id,name,city,Created_date
I am able to receive files in destination even when i am inserting new records in table
But, but ....
When i am updating exsiting rows then processor is not fetching those records looks like it has some limitation.
My Question is ,How to handle this scenario? either by any other processor or need to update some property.
PLease someone help
#Bryan Bende

QueryDatabaseTable Processor needs to be informed which columns it can use to identify new data.
A serial id or created timestamp is not sufficient.
From the documentation:
Maximum-value Columns:
A comma-separated list of column names. The processor will keep track of the maximum value for each column that has been returned since the processor started running. Using multiple columns implies an order to the column list, and each column's values are expected to increase more slowly than the previous columns' values. Thus, using multiple columns implies a hierarchical structure of columns, which is usually used for partitioning tables. This processor can be used to retrieve only those rows that have been added/updated since the last retrieval. Note that some JDBC types such as bit/boolean are not conducive to maintaining maximum value, so columns of these types should not be listed in this property, and will result in error(s) during processing. If no columns are provided, all rows from the table will be considered, which could have a performance impact. NOTE: It is important to use consistent max-value column names for a given table for incremental fetch to work properly.
Judging be the table scheme, there is no sql-way of telling whether data was updated.
There are many ways to solve this. In your case, the easiest thing to do might be to rename column created to modified and set to now() on updates
or to work with a second timestamp column.
So for instance
| stamp_updated | timestamp | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
is the new column added. In the processor you use the stamp_updated column to identify new data
Don't forget to set Maximum-value Columns to those columns.
So what I am basically saying is:
If you cannot tell that it is a new record in sql yourself, nifi cannot either.

Related

SSIS Getting the most up to date record using datetime columns

I am building an SSIS for SQL Server 2014 package and currently trying to get the most recent record from 2 different sources using datetime columns between the two sources and implementing a method to accomplish that. So far I am using a Lookup Task on thirdpartyid to match the records that I need to eventually compare and using a Merge Join to bring them together and eventually have a staging table that has the most recent record.I have a previous data task, not shown that already inserts records that are not in AD1 into a staging table so at this point these records are a one to one match. Both sources look like this with the exact same datetime columns just different dates and some information having null values as there is no history of it.
Sample output
This is my data flow task so far. I am really new to SSIS so any ideas or suggestions would be greatly appreciated.
Given that there is a 1:1 match between your two sources, I would structure this as an Source (V1) -> Lookup (AD1)
Define your lookup based on thirdpartyid and retrieve all the AD1 columns. You'll likely end up with data flow look like like name and name_ad1, etc
I would then add a Derived Column that identifies if the dates are different (assuming in that situation, you need to take action)
IsNull(LastUpdated_AD1) || LastUpdated > LastUpdated_AD1
That expression would create a boolean if the column in AD1 is null or if the V1 last updated column is greater than the AD1 version.
You'd likely then add a Conditional Split into your Data Flow and base it on the value of our new column and then route the changed data into your mechanism for handling updates (OLE DB Command or preferably, an OLE DB Destination + an Execute SQL Task post Data Flow to perform an batch update)
The comment asks
should it all be one expression? like IsNull(AssignmentLastUpdated_AD1) || AssignmentLastUpdated > AssignmentLastUpdated_AD1 || IsNull(RoomLastUpdated_AD1) || RoomLastUpdated > RoomLastUpdated_AD1
You can do it like that but when you get a weird result and someone asks how you got that value, long expressions make it hard to debug. I'd likely have two derived columns components in the data flow. The first would create a has changed column for each set of conditions
HasChangedAssignment
(IsNull(AssignmentLastUpdated_AD1) || AssignmentLastUpdated > AssignmentLastUpdated_AD1)
HasChangedRoom
IsNull(RoomLastUpdated_AD1) || RoomLastUpdated > RoomLastUpdated_AD1
etc
And then in the final derived column, you create the HasChanged column
HasChangedAssignment || HasChangedRoom || HasChangedAdNauseum
Using a pattern based approach like this makes it much easier to build, troubleshoot and/or or make small changes that can have a big impact to the correctness, maintainability and performance of your packages.

Deduplication of records without sorting in a mainframe sequential dataset with already sorted data

This is a query on deduplicating an already sorted mainframe dataset without re-sorting it.
The input sequential dataset has the following structure. 'KEYn' in the first 4 bytes represents the key and the remainder of each row represents the rest of the record's data. There are records in which the same key is repeated though the remaining data is different in each record. The records are already sorted on 'KEYn'.
KEY1aaaaaa
KEY1bbbbbb
KEY2cccccc
KEY3xxxxxx
KEY3yyyyyy
KEY3zzzzzz
KEY3wwwwww
KEY4uuuuuu
KEY5hhhhhh
KEY5ffffff
My requirement is to pick up the first record of each key and drop the remaining 'duplicates'. so the output file for the above input should look like this:
KEY1aaaaaa
KEY2cccccc
KEY3xxxxxx
KEY4uuuuuu
KEY5hhhhhh
Since the data is already sorted, I don't want to use SORT utility with SUM FIELDS=NONE or ICETOOL with SELECT - FIRST operand since both of these will actually end up re-sorting the data on the deduplication key (KEYn). Also the actual dataset I am referring to is huge (1.6 billion records, AVGRLEN 900 VB) and a job actually ran out of sort work space trying to sort it in one go.
My query is: Is there any option available in JCL based utilities to do this deduplication without resorting and using sort work space? I am trying to avoid writing a COBOL/Assembler program to do this.
Try this untested.
OPTION COPY
INREC BUILD=(1,4,SEQNUM,3,ZD,RESTART=(5,4),5)
OUTFIL INCLUDE=(5,3,ZD,EQ,1),BUILD=(1,4,8)

How to skip irregular header information of a Flat File in SSIS?

I have a file like as seen below: Just Ex:
kwqif h;wehf uhfeqi f ef
fekjfnkenfekfh ijferihfq eiuh qfe iwhuq fbweq
fjqlbflkjqfh iufhquwhfe hued liuwfe
jewbkfb flkeb l jdqj jvfqjwv yjwfvjyvdfe
enjkfne khef kurehf2 kuh fkuwh lwefglu
gjghjgyuhhh jhkvv vytvgyvyv vygvyvv
gldw nbb ouyyu buyuy bjbuy
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
I want to dynamically skip header information and load flatfile to DB
Like below:
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
The header information may vary (not fixed no. of rows) from file to file.
Any help..Thanks in advance.
The generic SSIS components cannot meet this requirement. You need to code for this e.g. in an SSIS Script task.
I would code that script to read through the file looking for that header row ID Name Address, and then write that line and the rest of the file out to a new file.
Then I would load that new file using the SSIS Flat File Source component.
You might be able to avoid a script task if you'd prefer not to use one. I'll offer a few ideas here as it's not entirely clear which will be best from your example data. To some extent it's down to personal preference anyway, and also the different ideas might help other people in future:
Convert ID and ignore failures: Set the file source so that it expects however many columns you're forced into having by the header rows, and simply pull everything in as string data. In the data flow - immediately after the source component - add a data conversion component or conditional split component. Try to convert the first column (with the ID) into a number. Add a row count component and set the error output of the data conversion or conditional split to be redirected to that row count rather than causing a failure. Send the rest of the data on its way through the rest of your data flow.
This should mean you only get the rows which have a numeric value in the ID column - but if there's any chance you might get real failures (i.e. the file comes in with invalid ID values on rows you otherwise would want to load), then this might be a bad idea. You could drop your failed rows into a table where you can check for anything unexpected going on.
Check for known header values/header value attributes: If your header rows have other identifying features then you could avoid relying on the error output by simply setting up the conditional split to check for various different things: exact string matches if the header rows always start with certain values, strings over a certain length if you know they're always much longer than the ID column can ever be, etc.
Check for configurable header values: You could also put a list of unacceptable ID values into a table, and then do a lookup onto this table, throwing out the rows which match the lookup - then if you need to update the list of header values, you just have to update the table and not your whole SSIS package.
Check for acceptable ID values: You could set up a table like the above, but populate this with numbers - not great if you have no idea how many rows might be coming in or if the IDs are actually unique each time, but if you're only loading in a few rows each time and they always start at 1, you could chuck the numbers 1 - 100 into a table and throw away and rows you load which don't match when doing a lookup onto this table.
Staging table: This is probably the way I'd deal with it if I didn't want to use a script component, but in part that's because I tend to implement initial staging tables like this anyway, and I'm comfortable working in SQL - so your mileage may vary.
Pick up the file in a data flow and drop it into a staging table as-is. Set your staging table data types to all be large strings which you know will hold the file data - you can always add a derived column which truncates things or set the destination to ignore truncation if you think there's a risk of sometimes getting abnormally large values. In a separate data flow which runs after that, use SQL to pick up the rows where ID is numeric, and carry on with the rest of your processing.
This has the added bonus that you can just pick up the columns which you know will have data you care about in (i.e. columns 1 through 3), you can do any conversions you need to do in SQL rather than in SSIS, and you can make sure your columns have sensible names to be used in SSIS.

Adding Column Flags when Replacing Data Table

I have a base data table on SQL Server 2012 that is replaced every night with a new data set with identical columns. Once the data table is replaced, how do I flag a column whose row values have changed since the last data upload? The data type of the columns in questions are VARCHAR(255) and there can be multiple columns that require such a flag
The benefit I see in trying to implement the above solution is when I have to export data, I can reduce resource utilization by only pulling those rows that have a ISNOTNULL Column change flag
Thanks for your help!

Informatica: how to get the auto-generated primary key of a table in Informatica mapping?

My question is very similar to the one below, but on a informatica environment:
Retrieving the index of an inserted row
Here is a brief summary of the issue: I'm trying to figure out how I can insert a row into a table and then find out what the value of the auto_incremented id column was set to so that I can insert additional data into another table. Our target is SQL server 2008. We have a table which has to be populated by informatica ETLs and the application is also using the same table - so, we can't use informatica sequence generator.
In the past when I have used Oracle database, there was a Oracle sequence generator transformation available in Informatica - but for SQL server, I am not sure.
Any solutions please?
If the values to be populated are pure sequence values and have no other meaning you can use two sequence generators simultaneously. Use an Informatica sequence generator that generates values from -1 to negative infinity. At the same time the SQLserver auto-increment field will contain values from 1 to infinity. There would never be a collision.
You could use a sequence generator with the 'Reset' flag enabled so that it begins with 1 each session run, then use a Lookup to cache the current max sequence value from the target table. Then you can predict the sequence number that SQL Server will generate when a record is inserted by adding NEXTVAL to the max value.
Populate the first table (the table with auto incremented id) first. After that is done, run another mapping where you do a lookup to that table, get the ID value by some other identifiable key value in the table, and proceed to populate other tables (using the retrieved ID) as you wish.