Is there a 'max-retries' for Kubernetes Jobs?

Is there a 'max-retries' for Kubernetes Jobs? - google-compute-engine

I have batch jobs that I want to run on Kubernetes. The way I understand Jobs:
If I choose restartPolicy: Never it means that if the Job fails, it will destroy the Pod and reschedule onto (potentially) another node. If restartPolicy: OnFailure, it will restart the container in the existing Pod. I'd consider a certain number of failures unrecoverable. Is there a way I can prevent it from rescheduling or restarting after a certain period of time and cleanup the unrecoverable Jobs?
My current thought for a workaround to this is to have some watchdog process that looks at retryTimes and cleans up Jobs after a specified number of retries.

Summary of slack discussion:
No, there is no retry limit. However, you can set a deadline on the job as of v1.2 with activeDeadlineSeconds. The system should back off restarts and then terminate the job when it hits the deadline.

FYI, this has now been added as .spec.backoffLimit.
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/

Related

Can't make kubernetes example for wordpress & mysql with persistent data work

I followed this kubernetes example to create a wordpress and mysql with persistent data
I followed everything from the tutorial from creation of the disk to deployment and on the first try deletion as well
1st try
https://s3-ap-southeast-2.amazonaws.com/dorward/2017/04/git-cmd_2017-04-03_08-25-33.png
Problem: persistent volumes does not bind to the persistent volume claim. It remains at pending status both for the creation of the pod and the volume claim. Volume status remains at Released state as well.
Had to delete everything as describe in the example and try again. This time I mounted the created volumes to an instance in the cluster, formatted the disk using ext4 fs then unmounted the disks.
2nd try
https://s3-ap-southeast-2.amazonaws.com/dorward/2017/04/git-cmd_2017-04-03_08-26-21.png
Problem: After formatting the volumes, they are now bound to the claims yay! unfortunately mysql pod doesn't run with status crashLoopback off. Eventually the wordpress pod crashed as well.
https://s3-ap-southeast-2.amazonaws.com/dorward/2017/04/git-cmd_2017-04-03_08-27-22.png
Did anyone else experience this? I'm wondering if I did something wrong or if something has changed from the write up of the exam til now that made the example break. How do I go around fixing it?
Any help is appreciated.

Get logs for pods:
kubectl logs pod-name
If log indicates the pods are not even starting (crashloopback) investigate the events in k8s:
kubectl get events
The event log indicates the node running out of memory (OOM):
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
1m 7d 1555 gke-hostgeniuscom-au-default-pool-xxxh Node Warning SystemOOM {kubelet gke-hostgeniuscom-au-default-pool-xxxxxf-qmjh} System OOM encountered
Trying a larger instance size should solve the issue.

Rolling Update with Kubernetes Deployment without increasing the cluster size

I have a cluster that can only run one Pod per node due to our configuration (sometimes Kubernetes will randomly run two on one pod but w/e). Any time I have to update my Deployment which causes a Rolling Update, Kubernetes will simply never finish the update.
The reason for this appears to be that there isn't enough room in the nodes to deploy the new pods from the rolling update.
Now, some of you may say that I may simple increase the cluster size every time I want to perform an update. The problem with that approach is that I have enabled autoscaling on the cluster and the Deployment replicas is set high so that Kubernetes automatically scales with the cluster. This means I can't change the cluster size to accomodate the Rolling Update.
How can I perform a Rolling Update with this configuration?

Can you set maxSurge to 0 and maxUnavailable to some positive value?

Gearman and lost workers

I want to use gearman as a queue-system for a webproject. Therefor I tried gearman and gearmanManager which works great. But know, I ask myself what happen if a worker, job or server has gone (for example php error or whatever the reason might be).
What happens if I run an synchron job where the client (browser)
waits for a callback and the worker/server crashed?
What happens if there is an asynchronus job which chrashed while it is in work? Will gearman try it again (cause it is stored as a queue in mysql) or will gearman ignore it? And what happen to it if it will not be ignored and the worker crash over and over again?
I would like to found out if the system ist scalabel (for example by NFS). Do you have examples for me or experiences on which I can get forward?
Thank you all and have a very nice weekend...
Phil

If the server has crashed, your client won't have anything to connect to. How you handle that is up to your application. If the worker has crashed, and you're attempting to run a synchronous task, the client will wait until a worker becomes available, unless you've provided a timeout value.
If the async task is performed and the worker, gearman will requeue the task at hand. If it crashes each worker that grabs it, you might be left without any active workers. The C-version / gearmand has an option to tweak this behavior:
-j [ --job-retries ] arg (=0) Number of attempts to run the job
before the job server removes it. This
is helpful to ensure a bad job does not
crash all available workers. Default is
no limit.
If you deploy broken or fragile code you will have issues, regardless of which system you're using. Bad code breaks applications. :-)
A Gearman setup is scalable, but remember that if you use the persistent queue support, each server needs to have it own backend. You can't share the same database and tables across server instances, since the persistent queue is only maintained as a backup in case gearmand dies and has to be restarted. It might be more useful to allow your application to easily queue tasks again if needed.

Reconfigure and reboot a Hudson/Jenkins slave as part of a build

I have a Jenkins (Hudson) server setup that runs tests on a variety of slave machines. What I want to do is reconfigure the slave (using remote APIs), reboot the slave so that he changes take effect, then continue with the rest of the test. There are two hurdles that I've encountered so far:
Once a Jenkins job begins to run on the slave, the slave cannot go down or break the network connection to the server otherwise Jenkins immediately fails the test. Normally, I would say this is completely desirable behavior. But in this case, I would like for Jenkins to accept the disruption until the slave comes back online and Jenkins can reconnect to it - or the slave reconnects to Jenkins.
In a job that has been attached to the slave, I need to run some build tasks on the Jenkins master - not on the slave.
Is this possible? So far, I haven't found a way to do this using Jenkins or any of its plugins.
EDIT - Further Explanation
I really, really like the Jenkins slave architecture. Combined with the plugins already available, it makes it very easy to get jobs to a slave, run, and the results pulled back. And the ability to pick any matching slave allows for automatic job/test distribution.
In our situation, we use virtualized (VMware) slave machines. It was easy enough to write a script that would cause Jenkins to use VMware PowerCLI to start the VM up when it needed to run on a slave, then ship the job to it and pull the results back. All good.
EXCEPT Part of the setup of each test is to slightly reconfigure the virtual machine in some fashion. Disable UAC, logon as a different user, have a different driver installed, etc - each of these changes requires that the test VM/slave be rebooted before the changes take affect. Although I can write slave on-demand scripts (Launch Method=Launch slave via execution of command on the master) that handle this reconfig and restart, it has to be done BEFORE the job is run. That's where the problem occurs - I cannot configure the slave that early because the type of configuration changes are dependent on the job being run, which occurs only after the slave is started.
Possible Solutions
1) Use multiple slave instances on a single VM. This wouldn't work - several of the configurations are mutually exclusive, but Jenkins doesn't know that. So it would try to start one slave configuration for one job, another slave for a different job - and both slaves would be on the same VM. Locks on the jobs don't prevent this since slave starting isn't part of the job.
2) (Optimal) A build step that allows a job to know that it's slave connection MIGHT be disrupted. The build step may have to include some options so that Jenkins knows how to reconnect the slave (will the slave reconnect automatically, will Jenkins have to run a script, will simple SSH suffice). The build step would handle the disconnect of the slave, ignore the usually job-failing disconnect, then perform the reconnect. Once the slave is back up and running, the next build step can occur. Perhaps a timeout to fail the job if the slave isn't reconnectable in a certain amount of time.
** Current Solution ** - less than optimal
Right now, I can't use the slave function of Jenkins. Instead, I use a series of build steps - run on the master - that use Windows and PowerShell scripts to power on the VM, make the configurations, and restart it. The VM has a SSH server running on it and I use that to upload test files to the test VM, then remote execute them. Then download the results back to Jenkins for handling by the job. This solution is functional - but a lot more work than the typical Jenkins slave approach. Also, the scripts are targeted towards a single VM; I can't easily use a pool of slaves.

Not sure if this will work for you, but you might try making the Jenkins agent node programmatically tell the master node that it's offline.
I had a situation where I needed to make a Jenkins job that performs these steps (all while running on the master node):
revert the Jenkins agent node VM to a powered-off snapshot
tell the master that the agent node is disconnected (since the master does not seem to automatically notice the agent is down, whenever I revert or hard power off my VMs)
power the agent node VM back on
as a "Post-build action", launch a separate job restricted to run on the agent node VM
I perform the agent disconnect step with a curl POST request, but there might be a cleaner way to do it:
curl -d "offlineMessage=&json=%7B%22offlineMessage%22%3A+%22%22%7D&Submit=Yes" http://JENKINS_HOST/computer/THE_NODE_TO_DISCONNECT/doDisconnect
Then when I boot the agent node, the agent launches and automatically connects, and the master notices the agent is back online (and will then send it jobs).
I was also able to toggle a node's availability on and off with this command (using 'toggleOffline' instead of 'doDisconnect'):
curl -d "offlineMessage=back_in_a_moment&json=%7B%22offlineMessage%22%3A+%22back_in_a_moment%22%7D&Submit=Mark+this+node+temporarily+offline" http://JENKINS_HOST/computer/NODE_TO_DISCONNECT/toggleOffline
(Running the same command again puts the node status back to normal.)
The above may not apply to you since it sounds like you want to do everything from one jenkins job running on the agent node. And I'm not sure what happens if an agent node disconnects or marks itself offline in the middle of running a job. :)
Still, you might poke around in this Remote Access API doc a bit to see what else is possible with this kind of approach.

Very easy. You create a Master job that runs on the Master, from the master job you call the client job as a build step (it's a new kind of build step and I love it). You need to check that the master job should wait for the client job to finish. Then you can run your script to reconfigure your client and run the second test on the client.
An even better strategy is to have two nodes running on your slave machines. You need to configure two nodes in Jenkins. I used that strategy successfully with a unix slave. The reason was that I needed different environment variables to be set up and I didn't wanted to push that into the jobs. I used ssh clients, so I don't know if it is possible with different client types. Than you might be able to run both tests at the same time or you chain the jobs or use the master strategy mentioned above.

How can I guarantee the w3wp process exists before turning on perfmon logging?

I have a batch script I run before our performance tests that does some pre-test setup on our server; it clears log files, starts the proper services, restores the database, sets some app settings and turns on perfmon logging.
My problem; the w3wp process we need to monitor is not always present at the time we turn on perfmon logging. It's pretty much hit-or-miss if this process is in the log. The test takes anywhere from 4 to 18 hours to complete, and I don't know until the test is done whether or not w3wp was monitored (it doesn't seem that perfmon detects new processes even though my log file is configured to monitor Process(*)), which ends up wasting a lot of time.
Is there a way to force w3wp to get loaded? Is there some command I can call just prior to starting the perfmon logs?
Or, is it possible to configure the perfmon log to monitor processes that may not exist at the time the log is started?

If you install the IIS Admin tools, you can call a command line app called TinyGet. You can pass in any page on your webserver to initialize it. This would start up the process so you can capture it.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008