I have a base data table on SQL Server 2012 that is replaced every night with a new data set with identical columns. Once the data table is replaced, how do I flag a column whose row values have changed since the last data upload? The data type of the columns in questions are VARCHAR(255) and there can be multiple columns that require such a flag
The benefit I see in trying to implement the above solution is when I have to export data, I can reduce resource utilization by only pulling those rows that have a ISNOTNULL Column change flag
Thanks for your help!
Related
Is it possible in MySQL in a single statement to both modify the data type of a column and update the existing data in that column so it's compatible with the new data type? Or would I need to, for example, create a new column with the new data type, then use the original column's data to update the values of the new column, and then finally drop the original column?
I want to change an int(11) to a time column where the column's original int(11) data can range from 0 to 180 which represent 5 minute timeslots from 8am-11pm. So, for example, the original int(11) data of say 0 would need to get updated to the new time data of 08:00:00, or 96 updated to 16:00:00, etc..
Obviously if I try modifying the column's data type without any "intervention" on the existing data with the following:
ALTER TABLE <table> MODIFY <column> time DEFAULT NULL
...I'm guessing MySQL will simply not allow it to happen since my old data is likely not compatible with the new data type, and it will obviously also not know how to parse the data the way I want as described above.
No, you'll have to do this in several steps as you describe.
When I copy data from one column in source table A (I only want to copy one column) to a new column in destination table B using INSERT INTO, the copied data is placed at the bottom of the destination column instead of at the top, where I want it to go. The destination table consists of 6 columns; source table 3 cols. I have added an "empty" column to the destination table to receive the data from the col. in the source table. I was forced to define it as a NULL column.
To illustrate: let's say the destination table has 5 columns of 1000 rows each. The destination column is nominally "empty" before the copying. When I finished copying, the first 1000 rows of the destination column have been filled with NULL statements and the copied data begins with row 1001. Now, only the new destination column holds any data after row 1000.
Deleting the NULL statements from the first 1000 rows doesn't move the copied data up to the top. I suspect that the problem may be because the column definitions of the destination table were originally all set at NULL. They should probably have been set at NN but I don't know if this is what is causing my problem.
Any help greatly appreciated.
Everything in sql is organized by rows, not columns. insert always inserts new rows, never modifies existing ones (unless you specify on duplicate key update ...). It sounds like you want to update existing rows; you use the update statement for that, not the insert statement, and you have to provide information to identify which rows to update.
I am working on NIFI Data Flow where my usecase is fetch mysql table data and put into hdfs/local file system.
I have built a data flow pipeline where i used querydatabaseTable processor ------ ConvertRecord --- putFile processor.
My Table Schema ---> id,name,city,Created_date
I am able to receive files in destination even when i am inserting new records in table
But, but ....
When i am updating exsiting rows then processor is not fetching those records looks like it has some limitation.
My Question is ,How to handle this scenario? either by any other processor or need to update some property.
PLease someone help
#Bryan Bende
QueryDatabaseTable Processor needs to be informed which columns it can use to identify new data.
A serial id or created timestamp is not sufficient.
From the documentation:
Maximum-value Columns:
A comma-separated list of column names. The processor will keep track of the maximum value for each column that has been returned since the processor started running. Using multiple columns implies an order to the column list, and each column's values are expected to increase more slowly than the previous columns' values. Thus, using multiple columns implies a hierarchical structure of columns, which is usually used for partitioning tables. This processor can be used to retrieve only those rows that have been added/updated since the last retrieval. Note that some JDBC types such as bit/boolean are not conducive to maintaining maximum value, so columns of these types should not be listed in this property, and will result in error(s) during processing. If no columns are provided, all rows from the table will be considered, which could have a performance impact. NOTE: It is important to use consistent max-value column names for a given table for incremental fetch to work properly.
Judging be the table scheme, there is no sql-way of telling whether data was updated.
There are many ways to solve this. In your case, the easiest thing to do might be to rename column created to modified and set to now() on updates
or to work with a second timestamp column.
So for instance
| stamp_updated | timestamp | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
is the new column added. In the processor you use the stamp_updated column to identify new data
Don't forget to set Maximum-value Columns to those columns.
So what I am basically saying is:
If you cannot tell that it is a new record in sql yourself, nifi cannot either.
I have a file like as seen below: Just Ex:
kwqif h;wehf uhfeqi f ef
fekjfnkenfekfh ijferihfq eiuh qfe iwhuq fbweq
fjqlbflkjqfh iufhquwhfe hued liuwfe
jewbkfb flkeb l jdqj jvfqjwv yjwfvjyvdfe
enjkfne khef kurehf2 kuh fkuwh lwefglu
gjghjgyuhhh jhkvv vytvgyvyv vygvyvv
gldw nbb ouyyu buyuy bjbuy
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
I want to dynamically skip header information and load flatfile to DB
Like below:
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
The header information may vary (not fixed no. of rows) from file to file.
Any help..Thanks in advance.
The generic SSIS components cannot meet this requirement. You need to code for this e.g. in an SSIS Script task.
I would code that script to read through the file looking for that header row ID Name Address, and then write that line and the rest of the file out to a new file.
Then I would load that new file using the SSIS Flat File Source component.
You might be able to avoid a script task if you'd prefer not to use one. I'll offer a few ideas here as it's not entirely clear which will be best from your example data. To some extent it's down to personal preference anyway, and also the different ideas might help other people in future:
Convert ID and ignore failures: Set the file source so that it expects however many columns you're forced into having by the header rows, and simply pull everything in as string data. In the data flow - immediately after the source component - add a data conversion component or conditional split component. Try to convert the first column (with the ID) into a number. Add a row count component and set the error output of the data conversion or conditional split to be redirected to that row count rather than causing a failure. Send the rest of the data on its way through the rest of your data flow.
This should mean you only get the rows which have a numeric value in the ID column - but if there's any chance you might get real failures (i.e. the file comes in with invalid ID values on rows you otherwise would want to load), then this might be a bad idea. You could drop your failed rows into a table where you can check for anything unexpected going on.
Check for known header values/header value attributes: If your header rows have other identifying features then you could avoid relying on the error output by simply setting up the conditional split to check for various different things: exact string matches if the header rows always start with certain values, strings over a certain length if you know they're always much longer than the ID column can ever be, etc.
Check for configurable header values: You could also put a list of unacceptable ID values into a table, and then do a lookup onto this table, throwing out the rows which match the lookup - then if you need to update the list of header values, you just have to update the table and not your whole SSIS package.
Check for acceptable ID values: You could set up a table like the above, but populate this with numbers - not great if you have no idea how many rows might be coming in or if the IDs are actually unique each time, but if you're only loading in a few rows each time and they always start at 1, you could chuck the numbers 1 - 100 into a table and throw away and rows you load which don't match when doing a lookup onto this table.
Staging table: This is probably the way I'd deal with it if I didn't want to use a script component, but in part that's because I tend to implement initial staging tables like this anyway, and I'm comfortable working in SQL - so your mileage may vary.
Pick up the file in a data flow and drop it into a staging table as-is. Set your staging table data types to all be large strings which you know will hold the file data - you can always add a derived column which truncates things or set the destination to ignore truncation if you think there's a risk of sometimes getting abnormally large values. In a separate data flow which runs after that, use SQL to pick up the rows where ID is numeric, and carry on with the rest of your processing.
This has the added bonus that you can just pick up the columns which you know will have data you care about in (i.e. columns 1 through 3), you can do any conversions you need to do in SQL rather than in SSIS, and you can make sure your columns have sensible names to be used in SSIS.
I'm working on an SSIS package, adding update functionality (updating rows by using a staging table). To do this, I use a lookup and conditional split where I compare all the columns.
For some reason, some of the data throws false positives and marks rows as changed, when they have not. I've isolated this down to a single string column (zip code).
The column comes straight from a lookup. The source data column is varchar(9), the destination (i.e. source of second value) is char(9). In SSIS, both columns come through as DT_STR,9,1252
If I start with an empty table, and run the package twice, the second time about 20% of the rows come up as changed, even though they haven't. The following sql joins the existing rows to the "updated" rows in the staging table and compares their zips:
SELECT a.key_DestinationZip, b.key_DestinationZip,
CASE WHEN a.key_DestinationZip = b.key_DestinationZip then 1 else 0 end
FROM [dbo].[sta_Sales] as a
join [dbo].[fact_Sales] as b
on a.key_FullSalesNumber = b.key_FullSalesNumber
with results similar to
78735 78735 1
38138 38138 1
Your source data is varchar(9) and your lookup data is char(9). I believe, but have not tested, this is resulting in |65401| and |65401 | (4 spaces there and pipes for delineation only) in your data.
The data coming from your source system is going to be affected by the ANSI_PADDING setting when it was loaded. By default, SSIS isn't going to pad out the string.
Therefore, in your lookup, you will want to either pad the source data to 9 characters or trim the lookup's key.
And unrelated to this, but you might want to store the postal code separate from the zip+4 data. The later is more likely to change than the former when/if you ever run your data through an address validation service.
That looks to me like the problem is your data has two zip codes.