Good day!
I have a SSIS package that retrieves data from a database and exports it to a flat file (simple process). The issue I am having is that the data my package retrieves each morning depends on a separate process to load the data into a table prior to my package retrieving it.
Now, the process which initially loads the data does inserts metadata into a table which shows the start and end date/time. I would like to setup something in my package that checks the metadata table for an end date/time for the current date. If the current date exists, then the process continues... IF no date/ time exists then the process stops (here is the kicker) BUT the package re-triggers itself automatically an hour later to check if the initial data load is complete.
I have done research on checkpoints, etc. but all that seems to cover is if the package fails it would pick up where it left when the package is restarted. I don't want to manually re-trigger the process, I'd like it to check the metadata and re-start itself if possible. I could even put in processing that if it checks the metadata 3 times it would stop completely.
Thanks so much for your help
What you want isn't possible exactly the way you describe it. When a package finishes running, it's inert. It can't re-trigger itself, something has to re-trigger it.
That doesn't mean you have to do it manually. The way I would handle this is to have an agent job scheduled to run every hour for X number of hours a day. The job would call the package every time, and the meta data would tell the package whether it needs to do anything or just do nothing.
There would be a couple of ways to handle this.
They all start by setting up the initial check, just as you've outlined above. See if the data you need exists. Based on that, set a boolean variable (I'll call it DataExists) to TRUE if your data is there or FALSE if it isn't. Or 1 or 0, or what have you.
Create two Precedent Contraints coming off that task; one that requires that DataExists==TRUE and, obviously enough, another that requires that DataExists==FALSE.
The TRUE path is your happy path. The package executes your code.
On the FALSE path, you have options.
Personally, I'd have the FALSE path lead to a forced failure of the package. From there, I'd set up the job scheduler to wait an hour, then try again. BUT I'd also set a limit on the retries. After X retries, go ahead and actually raise an error. That way, you'll get a heads up if your data never actually lands in your table.
If you don't want to (or can't) get that level of assistance from your scheduler, you could mimic the functionality in SSIS, but it's not without risk.
On your FALSE path, trigger an Execute SQL Task with a simple WAITFOR DELAY '01:00:00.00' command in it, then have that task call the initial check again when it's done waiting. This will consume a thread on your SQL Server and could end up getting dropped by the SQL Engine if it gets thread starved.
Going the second route, I'd set up another Iteration variable, increment it with each try, and set a limit in the precedent constraint to, again, raise an actual error if you're data doesn't show up within a reasonable number of attempts.
Thanks so much for your help! With some additional research I found the following article that I was able to reference and create a solution for my needs. Although my process doesn't require a failure to attempt a re-try, I set the process to force fail after 3 attempts.
http://microsoft-ssis.blogspot.com/2014/06/retry-task-on-failure.html
Much appreciated
Best wishes
Related
I have an SSIS package that reads from and writes to a table using OLE DB source and target. Recently we've added two new columns to the table. My SSIS package doesn't use these columns, but now I'm getting "The external columns are out of synchronization with the data source" warnings when I open the package.
I tried to run the package and it finished successfully, but I can see these warnings in the execution results. I could refresh the metadata of course, but there are many packages that are running in Production and using this table, so I don't think it's a good idea to refresh all of them and redeploy...
Is it a good idea to set ValidateExternalMetadata to false when I create a package so I won't get these warnings in the future? Any other suggestions for this?
The ValidateExternalMetadata setting is "You can pay me now or pay me later"
A data flow must ensure the meta data it was built against remains true whenever the task runs. The SSIS designer also validates the metadata whenever the package is opened for editing.
Flipping the default setting from True to False can save you cycles during development if a participant (source/destination) is extremely complex/busy/latency-filled.
Setting this to False can also improve the start time of an SSIS package as the task will only be validated if it runs. Say you have a foreach file enumerator and it only finds a file once a quarter (but runs every day as Accounting can't quite tell you when they'll have the final numbers ready HYPOTHETICALLY SPEAKING). Since the Data Flow task will only get validated 4 out of 365.25 days, that could be a beneficial performance savings. Probably not much but if you're trying to eek out every last bit of performance, that's a knob you can flip.
Another coding aphorism is that "Warnings are just errors waiting to grow up." Adding columns is unlikely to grow into an error but you are now spending CPU cycles on every package execution raising and handling the OnWarning event due to mismatched meta data. If you have an ops team, they may grow complacent about warning's from SSIS packages and miss a more critical warning.
A way to avoid this in the future is to write explicit queries in your source. Currently, the SELECT * (or the underlying table as source) are reporting back the new columns which is where the impedance mismatch comes into play. If you only ever bring in the columns you asked for, adding columns won't cause this warning to surface.
Removing a column of course will flat out cause the package to fail (under either the explicit or implicit column selection approaches).
I have a long running task that updates some SQLAlchemy objects. A session is opened at the start of the task, updates are made along the way, and the transaction is committed at the end. The problem is that the task runs very long, so the connection will have closed (timed out, "gone away", whatever you want to call it) before the commit is even able to happen. This will cause the commit to fail and the whole task to fail.
This seems absolutely the correct way to do write to a DB to short tasks or non-Celery related things. But it is certainly a problem if the tasks take too long.
Is there some other recommended pattern? Should the Celery task not even utilize the SQLAlchemy objects and instead use some sort of static class whose data can be used to update the actual SQLAlchemy objects, only at the end of the task, maybe? That is the only possible solution I have come up with. I would like to know if there are others or if my idea has other problematic implications.
Open the transaction only when it must occur. Therefore, the session should be closed while the actual task is running, but the objects should be kept around and edited by the task. Then use merge. Make sure the setting expire_on_commit is False.
So, i have to read a excel file in which each row contains some data that i want do write in my database. I pass the whole file to laravel, it reads the file and format it to a array and then i make a new insertion (or update) in my databse.
The thing is, the input excel file can contain thousands of rows and its taking a while to complete, giving a timeout error in some cases.
When i try to make this locally i use set_time_limit(0); function so timeout doesnt occur, and it works pretty wel. But in a remote server this function is disabled for security reasons and my code crashes because of a timeout.
Somebody can help in how to solve this problem ? Maybe another ideia in how to better solve this problem ?
A nice way to handle tasks that take a long time is by making use of so called jobs.
You can make a job called ImportExcel and dispatch it when someone send you a file.
Take a good look at the docs, they have some great examples on how to do this.
You can take care of this using following steps :
1. Take the csv file and store it temporarily in storage :
You can store the large csv when user uploads. If it's something which is not uploaded from frontend, just make sure you have it saved to be processed in next step.
2. Then dispatch a job which can be queued :
You can create a job which can handle this asynchronously. You can use Supervisor to manage queues and timeouts etc.
3. Use package like thephpleague :
Using this package(or similar), you can chunk the records or read one at a time. It is really really helpful to keep your memory usage under limit. Plus it has different options of methods available to read the data from files.
4. Once file is processed, you can delete it from the temporary storage :
Just some teardown cleanup activity.
I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!
This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.
Before converting a project to use mysql, I have questions regarding the best way to avoid loss of a simple record update due to either a server crash or a program shutdown due to exceeding a/the cgi run-time limit.
My project is public and therefore applicable to any / many hosts where high level server side management isn't an option.
I wish to open a list file (or table) and acquire a list of records to parse one at a time.
While parsing each acquired list record, have the program / script perform a task with each record and update a counter (simple table) upon successful completion of each task (alternatively update each record with a success flag).
Do mysql tables get auto updated to the hard drive when "updated" or "added" to, thus, avoiding loss of all table changes to the point of crash if / when the program / script is violently terminated as described?
To have any chance with and do same with simple text files the counter has to be opened and closed for each update (as all content of open files on most O/S get clobbered when crashed).
Any description outline of mysql commands / processes etc to follow, if needed to avoid described losses, would also be very much appreciated.
Also, if any sugestions, are they applicable to both InnoDB and MyISM?
A simple answer comes to mind: SQL TRANSACTIONS. They're like a stack of SQL commands that 1. have to be "commited" 2. would come into action only if the last command is successfully executed.
I think this would help:
http://www.sqlteam.com/article/introduction-to-transactions
If my answer wasn't correct, pls, let me know if i misunderstood your intensions.