Datastage incremental charging - parameter-passing

We would like to perform incremental loading in DataStage (in parallel environement). Exactly load only the delta between the previous load and the new one (for create, update, delete the records in DWH).
We would like to store the last key recovered during the previous load to be able to restart the request from the next record on a new loading.
We have already successfully used a parameter to filter the SQL load query at runtime. Unfortunately, we have not yet found the possibility to retrieve the last key (max (Key) - Aggregator?) And to store it in this parameter.
Which stage to use, to output a single value in the same parallel job, and then store to parameter ?
Any ideas ?
Thanks for your help.

Think about getting the max value from your target - it is most probably a database and a max() is easy to do.
Check out my post about getting some data from the "stream" to a parameter

Thank's Michael,
I've found Head stage to get the max(LastRowId) in the same job, with 'All rows (after skip) = False), and 'Number of Rows (Per partition)=1. And I run the job in sequential mode...
That's worked fine.

Related

How to do updates on all rows at every button click

I have a python app that has an admin dashboard.
There I have a button called "Update DB".
(The app uses MySQL and SQLAlchemy)
Once it's clicked it makes an API call and gets a list of data and writes that to the DB, and if there are new records returned by the API call it adds them and does not duplicate currently existing records.
However if API call returns less items, it does not delete them.
Since I don't even have a "starting to google" point I need some guidance of what type of SQL query should my app be making.
Like once button is clicked ,it needs to go through all the rows:
do the changes to the updated records that existed
add new ones if there are any returned by the API call
delete ones that API call did not return.
What is this operation called or how can I accomplish this in mysql?
Once I find out about this I'll see how can I do that in SQLAlchemy.
You may want to set a timestamp column to the time of latest action on the table and have a background thread remove old rows as another action. I don't know of any atomic action that may perform the desired data reformation. Another option might be satisfactory is to write the replacement batch to a staging table, rename both versions (swap) and drop the old table. HTH

SSIS OLE DB conditional "insert"

I have no idea whether this can be done or not, but basically, I have the following data flow:
Extracts the data from an XML file (works fine)
Simply splits the records based on an enclosed condition (works fine)
Had to add a derived column object due to some character set issues (might be better methods, but it works)
Now "Step 4" is where I'm running into a scenario where I'd only like to insert the values that have a corresponding match in my database, for instance, the XML has about 6000 records, and from those, I have maybe 10 of them that I need to match back against and insert them instead of inserting all 6000 of them and doing the compare after the fact (which I could also do, but was hoping there'd be another method). I was thinking that I might be able to perform a sql insert command within the OLE DB DESTINATION object where the ID value in the file matches, but that's what I'm not 100% clear on or if it's even possible for that matter. Should I simply go the temp table route and scrub the data after the fact, or can I do this directly in the destination piece? Any suggestions would be greatly appreciated.
EDIT
Thanks to the last comment from billinkc, I managed to get bit closer, where I can identify the matches and use that result set, but somehow it seems to be running the data flow twice, which is strange.... I took the lookup object out to see whether it was causing it and somehow it seems to be the case, any reason why it would run this entire flow twice with the addition of the lookup? I should have a total of 8 matches, which I confirmed with the data viewer output, but then it seems to be running it a second time for the same file.
Is there a reason you can't use a Lookup transformation to find existing records. Configure it so that it routes non-match records to the no match output and then only connect the match found connector to the "Navigator Staging Manager Funds"
I believe that answers what you've asked but I wonder if you're expressing the right desire? My assumption is the lookup would go against the existing destination and so the lookup returns the id 10 for a row. All of the out of the box destinations in SSIS only perform inserts, so that row that found a match would now get doubled. As you are looking for existing rows, that usually implies you'd want to perform an update to an existing row. If that's the case, there is a specially designed transformation, the OLE DB Command. It is the component that allows for updates. There is a performance problem with that component, it issues a single update statement per row flowing through it. For 10 rows, I think it'd be fine. Otherwise, the pattern you'd use is to write all the new rows (inserts) into your destination table and then write all of your changed rows (updates) into a second staging-type table. After the data flow is complete, then use an Execute SQL Task to perform a set based update statement.
There are third party options that handle combined upserts. I know Pragmatic Works has an option and there are probably others on the tasks and components site.

Notify Creation of MySQL Records with AJAX

I am working on a project and would like a notification each time a new record is added to a specific table in a MySQL DB. I would like a small popup to be displayed each time a new record is added, but without refreshing the page. I heard AJAX was the way to go, but am not very familiar with it.
Nope it won't. The database won't tell you when a record is inserted. You can use AJAX to send a request to the server. Then, that server can query for changes. It can send a response indicating whether there is a change. Then, the AJAX request's response handler can show a message accordingly.
But implementing this will cause quite a load on both the webserver and the database server. So if you do this, choose the timing wisely. Don't execute this procedure 10 times a second or you'll kill your server as soon as you hit a 100 visitors.
To solve your problem, break it up in two pieces:
1. Get the actual AJAX request to work. Let the server return dummy values and try to handle them correctly. Hint: Use JQuery.ajax (or even JQuery.get) to ease your life.
2. Get the server to query for changes. If you want to monitor a single table this can easily be done. Add a timestamp column to the table if you don't already have one. You can configure it so that it will be updated each time the table is updates. Then, query for the highest timestamp. Don't forget to add an index to that column!
You can experiment with other solutions too. Add a trigger that alters a date/time in a different table. That way, the polling only needs to query that single column instead of the 'max' query.
To handle the change correctly, I think its best to let Javascript hold on to the last timestamp. Send the timestamp back in the response. Javascript can compare the timestamp to the last timestamp and show a message if needed. This way, you won't need to keep the timestamps in the session.

SQL Server: unique key for batch loads

I am working on a data warehousing project where several systems are loading data into a staging area for subsequent processing. Each table has a "loadId" column which is a foreign key against the "loads" table, which contains information such as the time of the load, the user account, etc.
Currently, the source system calls a stored procedure to get a new loadId, adds the loadId to each row that will be inserted, and then calls a third sproc to indicate that the load is finished.
My question is, is there any way to avoid having to pass back the loadId to the source system? For example, I was imagining that I could get some sort of connection Id from Sql Server, that I could use to look up the relevant loadId in the loads table. But I am not sure if Sql Server has a variable that is unique to a connection?
Does anyone know?
Thanks,
I assume the source systems are writing/committing the inserts into your source tables, and multiple loads are NOT running at the same time...
If so, have the source load call a stored proc, newLoadStarting(), prior to starting the load proc. This stored proc will update a the load table (creates a new row, records start time)
Put a trigger on your loadID column that will get max(loadID) from this table and insert as the current load id.
For completeness you could add an endLoading() proc which sets an end date and de-activates that particular load.
If you are running multiple loads at the same time in the same tables...stop doing that...it's not very productive.
a local temp table (with one pound sign #temp) is unique to the session, dump the ID in there then select from it
BTW this will only work if you use the same connection
In the end, I went for the following solution "pattern", pretty similar to what Markus was suggesting:
I created a table with a loadId column, default null (plus some other audit info like createdDate and createdByUser);
I created a view on the table that hides the loadId and audit columns, and only shows rows where loadId is null;
The source systems load/view data into the view, not the table;
When they are done, the source system calls a "sp__loadFinished" procedure, which puts the right value in the loadId column and does some other logging (number of rows received, date called, etc). I generate this from a template as it is repetitive.
Because loadId now has a value for all those rows, it is no longer visible to the source system and it can start another load if required.
I also arrange for each source system to have its own schema, which is the only thing it can see and is its default on logon. The view and the sproc are in this schema, but the underlying table is in a "staging" schema containing data across all the sources. I ensure there are no collisions through a naming convention.
Works like a charm, including the one case where a load can only be complete if two tables have been updated.

SSIS - user variable used in derived column transform is not available - in some cases

Unfortunately I don't have a repro for my issue, but I thought I would try to describe it in case it sounds familiar to someone... I am using SSIS 2005, SP2.
My package has a package-scope user variable - let's call it user_var
first step in the control flow is an Execute SQL task which runs a stored procedure. All that SP does is insert a record in a SQL table (with an identity column) and then go back and get the max ID value. The Execute SQL task saves this output into user_var
the control flow then has a Data Flow Task - it goes and gets some source data, has a derived column which sets a column called run_id to user_var - and saves the data to a SQL destination
In most cases (this template is used for many packages, running every day) this all works great. All of the destination records created get set with a correct run_id.
However, in some cases, there is a set of the destination data that does not get run_id equal to user_var, but instead gets a value of 0 (0 is the default value for user_var).
I have 2 instances where this has happened, but I can't make it happen. In both cases, it was just less that 10,000 records that have run_id = 0. Since SSIS writes data out in 10,000 record blocks, this really makes me think that, for the first set of data written out, user_var was not yet set. Then, after that first block, for the rest of the data, run_id is set to a correct value.
But control passed on to my data flow from the Execute SQL task - it would have seemed reasonable to me that it wouldn't go on until the SP has completed and user_var is set. Maybe it just runs the SP, but doesn't wait for it to complete?
In both cases where this has happened there seemed to be a few packages hitting the table to get a new user_var at about the same time. And in both cases lots of data was written (40 million rows, 60 million rows) - my thinking is that that means the writes were happening for a while.
Sorry to be both long-winded AND vague. A winning combination! Does this sound familiar to anyone? Thanks.
Updating to show the SP I use to get the user_var:
CREATE PROCEDURE [dbo].[sp_GetRunIDForPackage] (#pkg varchar(50)) AS
-- add a new entry for this run of this package - the RUN_ID is an IDENTITY column and so
-- will get created for us
INSERT INTO shared.STAGE_LOAD_JOB( EFFECTIVE_TS, EXECUTED_BY )
VALUES( getdate(), #pkg )
-- now go back into the table and get the new RUN_ID for this package
SELECT MAX( RUN_ID )
FROM shared.STAGE_LOAD_JOB
WHERE EXECUTED_BY = #pkg
Is this variable being accessed lots of times, from lots of places? Do you have a bunch of parallel data flows using the same variable?
We've encountered a bug in both SQL 2005 and 2008 whereby a "race condition" causes the variable to be inaccessable from some threads, and the default value is used. In our case, the variable was our "base folder" location for packages, causing our overall execution control package to not find its sub-packages.
More detail here: SSIS Intermittent variable error: The system cannot find the file specified
Unfortunately, the work-around is to hard-code a default value into the variable that will work when the race condition happens. Easy for us (set base folder to be correct for our prod environment), but looks a lot hard for your issue.
Perhaps you could use multiple variables (one for each data flow), and a bunch of Execute SQL tasks to populate those variables? REALLY ugly, but it should help.
Did you check the value of user_var before getting to the Derived Column Component? It sounds like user_var may be 0 so you are doing run_id = user_var; run_id = 0. I may be naive to think it is that simple but that's the first thing I would check.
Given the procedure code, you might want to replace this:
SELECT MAX( RUN_ID )
FROM shared.STAGE_LOAD_JOB
WHERE EXECUTED_BY = #pkg
with this:
select scope_identity()
The scope_identity() function returns the identity that was entered in the current scope, which is the procedure. Not sure if this will solve the problem, but I find it best to work through them all as they might have unrelated consequences.