Spring batch deployed on openshift using several pods - openshift

I deploy an application on Openshift and I use at least 2 pods.
My war contains a Spring Batch application, scheduled by a Spring cron.
Of course, each pod start the same batch at the same time, and it's my problem/question.
Is there a way to avoid this behaviour ? I would like to start only one batch instance (or is there a way to configure Spring batch to check if a batch is already running ?)
Thanks in advance.

Assuming you use Deployment, it's not trivial, but here are some ideas that can help you.
Use ScheduledJobs/CronJobs from Kubernetes. Meaning you would ditch controling of launching batch from your app completely and have dedicated pod launched to perform batch job and die
Use master elector sidecar for establishing the right to exec batch (https://github.com/kubernetes/contrib/tree/master/election)
Implement some locking mechanism on your own
Use StatefulSet and bind batch to run only on a praticular hostname (ie. by config var passed to Pods like BATCH_HOSTNAME. StatefulSets have deterministic names so you could say that batch should run only on my-pods-0

It sounds like you need leader election in your situation. Spring Integration provides leader election functionality you can use to determine who is the master. That master would be the one that actually launches the jobs. The other would just ignore the scheduled event. You can read more about Spring Integration's leader election in the documentation here: https://docs.spring.io/spring-integration/api/org/springframework/integration/support/leader/LockRegistryLeaderInitiator.html

Related

How to call an informatica workflow which running in different integration service

I have 2 workflows workflow 1 in Integration service 1 and workflow 2 integration service 2.
How do I call workflow 2 from workflow 1 I am currently trying to call then using command prompt but it didn't work
Just to let you know these integration servicce 1 is informatica 9.2
and integration service 2 is informatica version 10.2
PowerCenter does not provide suppor for cross-workflows dependencies. Regardless of whether these are configured to use same or a different Integration Service.
The best way to solve this kind of challenges is to use separate scheduling tool, such as AirFlow, Control-M, Autosys - or any other.
It is also possible to expose the workflow as a webservice and call it from different workflows, if needed. Not really convenitent, but possible.
Lastly, it's possible to use command line interface pmcmd startworkflow in a command task of one workflow to have the other one started.
I have done something similar this way:
The other WF is a web service one/ or is executed along a web service.
Add an application connection.
The WSH where your WF runs should be the endpoint of that connection.
Add this WF inside the mapping of the other one as a Web Service transformation.

Should I use k8s statefulsets directly or mysql-operator to deploy master-slave mysql cluster?

So I want to deploy a master-slaves MySQL cluster in k8s. I found 2 ways that seem popular:
The first one is to use statefulsets directly from k8s official document: https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/
The second one is to use operator, i.e. https://github.com/oracle/mysql-operator
Which way is most commonly used?
Also, in statefulsets, if my MySQL master dies, will k8s automatically promote the slave to be the master?
Lastly, when my logic backend app performs an operation (CRUD) to MySQL cluster, how does k8s know which pod to route to, i.e. write operation can only be sent to master while read is sent to all?
Users can deploy and maintain a set of highly available MySQL services in k8s based on StatefulSets, the process is relatively complex. This process requires users to familiarize themselves with various k8s resource objects, learn many MySQL operation details and maintain a set of complex management scripts. Kubernetes Operators are designed to reduce the threshold for deploying complex applications on k8s.
Operator hides the orchestration details of complex applications and greatly reduces the threshold to use them in k8s. If you need to deploy other complex applications, we recommend that you use the Operator.
Speaking about master election while using StatefulSet.
Electing potential slave to be a master is not an automatic process - you have to configure this manually using Xtrabackup - here is more information - setting_up_replication.
Take a look: cloning-existing-data, starting-replication, mysql-statefulset-operator.
Useful tools: vitess for better MySQL networking management and percona-xtradb-cluster that provides superior performance, scalability and instrumentation.

Cloud SQL instance 2nd Generation ALTERNATIVE activation policy "ON DEMAND"

I have problem with Cloud SQL billing.
My Cloud SQL has used all 720 Hours running machine (db-g1-small : changed from db-n1-standard-1 recently)
I've found accordding to Cloud SQL Documentation that
For Second Generation instances, the activation policy is used only to start or stop the instance.
So without ON_DEMAND policy of the First Generation, how can I reduce these costs on my Cloud SQL instance?
PS. Look like my cloud server not automatically down because it stay 4 sleep connections
Indeed for second generation instances of Cloud SQL, the only activation policies available are ALWAYS and NEVER, so it's not possible anymore to leave that kind of instance handling entirely on Cloud SQL's hands.
However you can create a workaround for this by executing a cron job that turns the instances on/off on a fixed schedule. Eg: you can run a cron job that runs on friday night to shutdown the instance and on monday morning to shut it back on.
You can use the following command to do so:
gcloud sql instances patch [INSTANCE_NAME] --activation-policy [ACTIVATION_POLICY_VALUE]
Moreover, you can create a feature request on Google Cloud's Public Issue Tracker System to re-include that functionality on Cloud SQL in the future, but there are no guaratees that this will happen.

Google Cloud Composer - Create Environment - with a few compute engine instances - That is expensive

I am new to Google Cloud Composer and following the QuickStart instruction, Create the Environment, Load Dag, Check Airflow, and Delete the Environment.
But in (real life) production use case, after we finish load dag files and run them in the environment. Should we delete the Google Cloud Composer Environment? Because there might be several compute instances in that composer and doing nothing now. It is expensive.
But if I delete the environment, then I would lose the access to its airflow web portal, and I could not check the processing logs of my processing on the deleted environment.
So what should I do? In real life production case, should I delete or not delete the environment after the processing is done?
Apache Airflow (and therefore Cloud Composer) is for orchestrating workflows, not for ETL batch jobs that only require transient compute resources. Similarly to how you wouldn't turn a server off just because a scheduled cron task isn't running, Composer environments are meant to be long-running compute resources that are always online, such that you can schedule repeating workflows whenever necessary (whether that be per second, daily, etc)
In a real production case, a Composer environment should always be left running, or no DAGs will be scheduled when it is down. If you have a development environment and wish to save money, then you can resize the Composer environment's attached GKE cluster to 0 nodes so you won't be billed for them. Similarly, if you don't think you're running enough DAGs to justify the cost, consider smaller worker machine sizes.

rsync mechanism in wso2 all in one active-active

I am deploying active-active all in one in 2 separate servers with wso2-am 2.6.0 and wso2 analytics 2.6.0. I am configuring my servers by this link. In part 4 and 5 about rsync mechanism I have some questions:
1.how can I figure out that my server is working rsync or sync??
2.What will happen in future if I don't use rsync now and also don't use configuration on part 4 and 5 ?
1.how can I figure out that my server is working rsync or sync??
It is not really clear what are you asking for.. rsync is just a command to synchronize files in folders.
What is the rsync used for - when deploying an API, the gateway creates or updates a few synapse sequences or apis in the filesystem (repository/deployment/server) and these file updates need to be synchronized to all gateway nodes.
I personally don't advice using rsync, the whole issue is that you need to invoke regularly the rsynccommand to synchronize the files created by a master node. That creates certain delay for service availability and most important, if something goes wrong and you want to use another node as the master, you need to switch the rsync direction, which is not really automated process.
We usually keep it simple using a shared filesystem (nfs, gluster, ..) and then we have all active-active setup (ok, setting up HA NFS or glusterFS is not particulary simple, but that's usually job of the infra guys)
2.What will happen in future if I don't use rsync now and also don't use configuration on part 4 and 5 ?
In the case the filesystems between gateways is not synced or shared - you deploy an api from the publisher to a single gateway node, but other gateway nodes won't create the synapse sequences and api artefacts. As a result the other nodes won't pass the client request to the backend