A way to execute pipeline periodically from bounded source in Apache Beam - mysql

I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!

This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.

Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.

Related

SSIS restart/ re-trigger package automatically

Good day!
I have a SSIS package that retrieves data from a database and exports it to a flat file (simple process). The issue I am having is that the data my package retrieves each morning depends on a separate process to load the data into a table prior to my package retrieving it.
Now, the process which initially loads the data does inserts metadata into a table which shows the start and end date/time. I would like to setup something in my package that checks the metadata table for an end date/time for the current date. If the current date exists, then the process continues... IF no date/ time exists then the process stops (here is the kicker) BUT the package re-triggers itself automatically an hour later to check if the initial data load is complete.
I have done research on checkpoints, etc. but all that seems to cover is if the package fails it would pick up where it left when the package is restarted. I don't want to manually re-trigger the process, I'd like it to check the metadata and re-start itself if possible. I could even put in processing that if it checks the metadata 3 times it would stop completely.
Thanks so much for your help
What you want isn't possible exactly the way you describe it. When a package finishes running, it's inert. It can't re-trigger itself, something has to re-trigger it.
That doesn't mean you have to do it manually. The way I would handle this is to have an agent job scheduled to run every hour for X number of hours a day. The job would call the package every time, and the meta data would tell the package whether it needs to do anything or just do nothing.
There would be a couple of ways to handle this.
They all start by setting up the initial check, just as you've outlined above. See if the data you need exists. Based on that, set a boolean variable (I'll call it DataExists) to TRUE if your data is there or FALSE if it isn't. Or 1 or 0, or what have you.
Create two Precedent Contraints coming off that task; one that requires that DataExists==TRUE and, obviously enough, another that requires that DataExists==FALSE.
The TRUE path is your happy path. The package executes your code.
On the FALSE path, you have options.
Personally, I'd have the FALSE path lead to a forced failure of the package. From there, I'd set up the job scheduler to wait an hour, then try again. BUT I'd also set a limit on the retries. After X retries, go ahead and actually raise an error. That way, you'll get a heads up if your data never actually lands in your table.
If you don't want to (or can't) get that level of assistance from your scheduler, you could mimic the functionality in SSIS, but it's not without risk.
On your FALSE path, trigger an Execute SQL Task with a simple WAITFOR DELAY '01:00:00.00' command in it, then have that task call the initial check again when it's done waiting. This will consume a thread on your SQL Server and could end up getting dropped by the SQL Engine if it gets thread starved.
Going the second route, I'd set up another Iteration variable, increment it with each try, and set a limit in the precedent constraint to, again, raise an actual error if you're data doesn't show up within a reasonable number of attempts.
Thanks so much for your help! With some additional research I found the following article that I was able to reference and create a solution for my needs. Although my process doesn't require a failure to attempt a re-try, I set the process to force fail after 3 attempts.
http://microsoft-ssis.blogspot.com/2014/06/retry-task-on-failure.html
Much appreciated
Best wishes

SSIS Transfer Object Task Timeout

I can see I'm not the only person who's experienced an issue with the SSIS Transfer Database Object Task and timeouts, however, people using this for the extract phase of an ETL must be something fairly common, so I'm trying to establish what is the usual/accepted way to do this.
I have a web application that uses Entity Framework to generate ~250 tables, some of which occasionally have schema updates.
The bulk of the transform and load portion of our ETL is handled by a series of stored procedures, however, these read from a copy of the application's tables that are initially loaded in the Transfer Database Objects task.
Initially, we set up an SSIS package that simply ran the Transfer Database Objects task, and then kicked off the stored proc. That meant that the job was fairly resilient to change, and the only changes required were changes to the stored proc, if and when a schema update affected the tables that were used therein.
Unfortunately, as one of our application instances has grown over time, the Transfer Database Objects task is reaching the point where I'm regularly seeing Timeout errors. Those don't appear to be connection timeouts, or anything I can control on the server side, and from what I can see, I can't amend the CommandTimeout on the underlying SMO stuff within that Task.
I can see that some people manually craft their extract, such that they run a separate Data Flow task to pull the information from each table, which has the obvious bonus that these can be run in parallel, however, in my case, that's going to mean an initial chunk of work to craft 250ish of these, and a maintenance task whenever the schema changes on the source database, no matter how minor.
I've come across Biml, which looked like a possible way to at least ease that overhead, however, it doesn't appear this can run on VS2017 yet.
Does anyone have any particular patterns they follow for this, or if I do need individual data flow tasks, is there some way to automate the schema update, perhaps using some kind of SSIS automation and something from the entity framework?
It turns out the easiest way around this is to write a clone of the Transfer task, but with appropriate additions to allow more control over batching and timeouts etc. Details are available in this article: https://blogs.msdn.microsoft.com/mattm/2007/04/18/roll-your-own-transfer-sql-server-objects-task/

Azure - Trigger which copies multible blobs

I am currently working on a ServiceBus trigger (using C#) which copies and moves related blobs to another blob storage and Azure Data Lake. After copying, the function has to emit a notification to trigger further processing tasks. Therefore, I need to know, when the copy/move task has been finished.
My first approach was to use a Azure Function which copies all these files. However, Azure Functions have a processing time limit of 10 minutes (when manually set) and therefore it seems to be not the right solution. I was considering calling azCopy or StartCopyAsync() to perform an asynchronous copy, but as far as I understand, the processing time of the function will be as long as azCopy takes. To solve the time limit problem, I could use WebJobs instead, but there are also other technologies like Logic Apps, Durable Azure functions, Batch jobs, etc. which makes me confused about choosing the right technology for this problem. The function won't be called every second but might copy large data. Does anybody have an idea?
I just found out that Azure Functions only have a time limit when using consumption plan. If there is no better solution for copy blob tasks, I'll go for Azure Functions.

Backing up DynamoDB tables via data pipeline vs manually creating a json for dynamoDB

I need to back up a few DynamoDB tables which are not too big for now to S3. However, these are tables another team uses/works on but not me. These back ups need to happen once a week, and will only be used to restore the DynamoDB tables in disastrous situations (so hopefully never).
I saw that there is a way to do this by setting up a data pipeline, which I'm guessing you can schedule to do the job once a week. However, it seems like this would keep the pipeline open and start incurring charges. So I was wondering, if there is a significant cost difference between backing the tables up via the pipeline and keeping it open, or creating something like a powershellscript that will be scheduled to run on an EC2 instance, which already exists, which would manually create a JSON mapping file and update that to S3.
Also, I guess another question is more of a practicality question. How difficult it is to backup dynamoDB tables to Json format. It doesn't seem too hard but wasn't sure. Sorry if these questions are too general.
Are you are working under the assumption that Data Pipeline keeps the server up forever? That is not the case.
For instance, you have defined a Shell Activity, after the activity completes, the server will terminate. (You may manually set the termination protection. Ref.
Since you only run a pipeline once a week, the costs are not high.
If you run a cron job on ec2 instance, that instance needs to up when you want to run the backup - and that could be a point of failure.
Incidentally, Amazon provides a Datapipeline sample on how to export data from dynamodb.
I just checked the pipeline cost page, and it says "For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $0.60 per month". So I think I'm safe.

How to make this SSIS scenario more parallel

I have a million rows in a database table. For each row I have to run a custom exe, parse the output and update another database table
How can I run process multiple rows in parallel?
I now have a simple dataflow task ->GetData->Run Script (Run Process , Parse Output)->Store Data
For 6000 rows it took 3 hours.Way too much.
There is the single bottleneck here, running the process per each row. Increasing "EngineThreads" would not help at all, as there will be only one thread running this particular script transform anyway. The time spent in other transforms probably does not matter at all. Processes are heavy weight objects, and running thousands of them will never be cheap.
I can think of following ideas to make it better:
1) The best way to fix it is to convert your custom EXE into an assembly and call it from the script transform - to avoid the overhead of creating processes, parsing the output etc.
2) If you have to use the separate processes, you can try to run these processes in parallel. It will help if the process mostly waits for some input/output (i.e. it is I/O bound). If the processes are memory bound or CPU bound, you would not win much by running them in parallel.
2A) Complex script, simple package.
To run them in parallel, modify the ProcessInput method in your script to start the process asynchronously, and don't wait for the process completion - move to the next row and create the next process. Subscribe to process output and process Exited event, so you know when it has finished. Limit the number of processes run in parallel - otherwise you'll run out of memory. Wait until all the processes are done before returning from ProcessInput call.
2B) Simple script, complex package.
Keep the current sequential script, but partition the data using SSIS. Add conditional split transform, and split the input stream into multiple streams, based on some hash expression - something that will make each output to receive approximately the same amount of data. The number of streams equals the number of process instances you want to run in parallel. Add your script transform to each output of conditional split. Now you should also increase "Engine Threads" property :) and these transforms will run in parallel. (Note: based on tag, I assume you use SSIS 2008. You'll need to insert additional Union All transforms to make it work in SSIS 2005).
This should make it perform better, but millions of processes is a lot. You'll hardly get really good performance here.
If you are executing this process using the "data flow" container, then there is a property on it called "EngineThreads" which defaults to a value of 5. You can set it to a higher number like 20, which will devote more threads to processing those rows.
That is just a performance tweak or optmisation, if your ssis package is still running really slowly then I would perhaps address the architecture and design of your package.