Openshift Online issue: pod with persistent volume failed scheduling - openshift

I have a small webapp which use to run fine on Openshift Online for 9 months, which consist in a python service and a postgresql database (with, of course, a persistent volume)
All of a sudden, last tuesday, the postgresql pod stopped working, so I tried to redeploy the service. And it's been almost 2 days now that the pod scheduling constantly fail. I have the following entry in the events log:
Failed Scheduling 0/110 nodes are available: 1 node(s) had disk pressure, 5 node(s) had taints that the pod didn't tolerate, 6 node(s) didn't match node selector, 98 node(s) exceed max volume count.
37 times in the last 13 minutes
So, it looks like a "disk full" issue at RH's datacenters, which should be easy to fix but I don't see any notification of it on the status page (https://status.starter.openshift.com/)
My problem looks a lot like the one described for start-us-west-1:
Investigating - Currently Openshift SRE team trying to resolve this incident. There are high chances that you will face difficulties having pods with attached volumes scheduled.
We're sorry for the inconvenience.
Yet I'm on starter-ca-central-1, which should not be affected. Since it's been such a long time, I'm wondering if anyone at RH is aware of the issue ? But I cannot find a way to contact them for users with a starter plan
Anybody face the same issue on ca-central-1 ?

As mentioned by Graham in the comment, https://help.openshift.com/forms/community-contact.html is the way to go
A few hours (12, actually) after posting my issue to this link, I got a feedback from someone at RH who said that my request was taken into account.
This morning, my app is up at last, and the trouble notice in on the status page:
Investigating - Currently Openshift SRE team trying to resolve this incident. There are high chances that you will face difficulties having pods with attached volumes scheduled.
We're sorry for the inconvenience.
Not sure of what would have happened if I hadn't contacted them...

After at least 4 months of normal working my app running on Starter US West 1 suddenly started to get the following error message during the deployment:
0/106 nodes are available: 1 node(s) had disk pressure, 29 node(s)
exceed max volume count, 3 node(s) were unschedulable, 4 node(s) had
taints that the pod didn't tolerate, 6 node(s) didn't match node
selector, 63 Insufficient cpu.
Nothing has changed on settings until the fail started. I've realized that problem just occur on deployments with persistent volume, like PostgreSQL Persistent in my case.
I submitted this issue over the above mentioned url right now. When I got some response or some solution I'll post here.

Related

Elastic Beanstalk checking health too soon after launching instance

I'm having problems with the configuration of an Elastic Beanstalk environment. Almost immediately, within maybe 20 seconds of launching a new instance, it's starts showing warning status and reporting that the health checks are failing with 500 errors.
I don't want it to even attempt to do a health check on the instance until it's been running for at least a couple of minutes. It's a Spring Boot application and needs more time to start.
I have an .ebextensions/autoscaling.config declared like so...
Resources:
AWSEBAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
HealthCheckType: ELB
HealthCheckGracePeriod: 200
DefaultInstanceWarmup: 200
NewInstancesProtectedFromScaleIn: false
TerminationPolicies:
- OldestInstance
I thought the HealthCheckGracePeriod should do what I need, but it doesn't seem to help. EB immediately starts trying to get a healthy response from the instance.
Is there something else I need to do, to get EB to back off and leave the instance alone for a while until it's ready?
The HealthCheckGracePeriod is the correct approach. The service will not be considered unhealthy during the grace period. However, this does not stop the ELB from sending the healthchecks according to the defined (or default) health check interval. So you will still see failing healthchecks, but they won't make the service be considered "unhealthy".
There is no setting to prevent the healthcheck requests from being sent at all during an initial period, but there should be no harm in the checks failing during the grace period.
You can make the HealthCheckIntervalSeconds longer, but that will apply for all healthcheck intervals, not just during startup.

the zone does not have enough resources available to fulfil the request/ the resource is not ready

the zone does not have enough resources available to fulfil the request/ the resource is not ready
I failed to start my instance (through the web browser), it gave me the error:
"The zone 'projects/XXXXX/zones/europe-west2-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later."
I thought it might be the quota problem at first, after checking my quota, it showed all good. Actually, I listed the available zones, europe-west2-c was available, but I still gave a shot to move the zone. Then I tried "gcloud compute instances move XXXX --zone europe-west2-c --destination-zone europe-west2-c", however, it still failed, popped up the error:
"ERROR: (gcloud.compute.instances.move) Instance cannot be moved while in state: TERMINATED"
Okay, terminated... then I tried to restart it by "gcloud compute instances reset XXX", the error showed in the way:
ERROR: (gcloud.compute.instances.reset) Could not fetch resource: - The resource 'projects/XXXXX/zones/europe-west2-c/instances/XXX' is not ready
I searched the error, some people solved this problem by deleting the disk. While I don't want to wipe the memory, how could I solve this problem?
BTW, I only have one instance, with one persistent disk attached.
Its recommended to deploy and balance your workload across multiple zones or regions1 to reduce the likelihood of an outage, by building resilient and scalable architectures.
If you want an immediate solution, create a snapshot 2, then create an instance from the snapshot with different zone or region 3.
After migrating it you are still experiencing the same issue, I suggest to contact GCP support4.

Geth stuck syncing on the last 80 blocks

On Windows 10, on my command prompt, I go
> geth --rinkeby
Which start to sync my node with the network
On another command prompt, I go
> geth --rinkeby attach ipc:\\.\pipe\geth.ipc
And then
> eth.syncing
which gives
{
currentBlock: 3500871,
highestBlock: 3500955,
knownStates: 25708160,
pulledStates: 25680474,
startingBlock: 3500738
}
As you can see, I am always behind from the highest block by about 80. I've heard this is normal for the testnet. I created an account on Rinkeby and requested ether via the faucet: https://faucet.rinkeby.io/. I also tried https://faucet.ropsten.be/ but can't get ether.
On the geth console, I can show my account which gives
> eth.accounts
["0x7bf0a466e7087c4d40211c0fa8aaf3011176b6c6"]
and viewing the balance I get:
eth.getBalance(eth.accounts[0])
I don't know if this is due to my node being 80 blocks behind the highest node...?
Edit: It may be worth adding that I created a symbolic link from my AppData/Roaming/Ethereum on my C drive to another folder on my D drive as I was running out of space. (Don't know if that effects my sync)
I guess you faced with a problem known as "not sync last 65 blocks"
Q: I'm stuck at 64 blocks behind mainnet?!
A: As explained above, you are not stuck, just finished with the block
download phase, waiting for the state download phase to complete too.
This latter phase nowadays take a lot longer than just getting the
blocks.
For more information https://github.com/ethereum/mist/issues/3760#issuecomment-390892894
Stop the geth and start again. It’s pretty normal to be behind the highest block. For ether, check on etherscan once if you actually received the ether from the faucet or not. That way you will know on what block height you have received your either from faucet. Then wait till your geth sync till that block. Also the best option would be to use something like Quicknode where you don’t need to be concerned of always keeping your machine running or waiting for hours before continuing with Development work. Yes they have a small nominal fee, but for the service they provide it’s pretty worth it.

NullPointerExceptions while executing LoadTest on WSO2BPS

While performing loadtests on WSO2 BPS 3.2.0 we`ve ran onto the problem.
Let me tell you more about out project and our actions.
Our BPS process is designed to manage some interactions with 3 systems. Basically it is "spread" on two parts - first one to CREATE INSTANCE in one of systems, then waiting a bit, and then SELECT OFFER in instance context.
In real life it looks like: user wants to get a product, the application asks system for an offers and then the user selects offer from available ones.
IN BPS the first part is a straight-forward process, the second part is spread on two flows - one to refresh information with a new offers, and another is to wait if the user chooses one of them.
Our aim is to stand about 1000-1500 simulatious threads on the load-test. An external systems are simulated by mockups executed by LoadUI.
We can achieve our goal if we disable "Process-Level Monitoring Events" in deployment descriptor (set it to "none") of our process. Everything goes well and smooth for hours.
But if we enable this feature (and we need to), everything falls with an error very soon (on about 100-200 run):
[2015-07-28 17:47:02,573] ERROR {org.wso2.carbon.bpel.core.ode.integration.BPELProcessProxy} - Error processing response for MEX null
java.lang.NullPointerException
at org.wso2.carbon.bpel.core.ode.integration.BPELProcessProxy.onResponse(BPELProcessProxy.java:402)
at org.wso2.carbon.bpel.core.ode.integration.BPELProcessProxy.onAxisServiceInvoke(BPELProcessProxy.java:187)
at
[....Et cetera....]
After the first appearance of this error another one type appears - other threads just fall after the timeout.
It seems that database is ok (by the way, it is MySQL 5.6.25). The dashboard shows no extreme levels of input or output.
So I think the BPS itself makes a bottleneck. We have gave it 8gb heap and its conf options are set for extreme amounts of threads (if it possible negative values are set and if not - just ridiculously big like 100000).
Anyone has ever faced this problem? Appreciate any help very much.
Solved in BPS 3.5.0 version, refer to release-notes

Cant delete disks on Google Cloud: not in ready state

I have a "standard persistent disk" of size 10GB on Google Cloud using Ubutu 12.04. Whenever, I try to remove this, I encounter following error
The resource 'projects/XXX/zones/us-central1-f/disks/tahir-run-master-340fbaced6a5-d2' is not ready
Does anybody know about what's going on? How can I get rid of this disk?
This happened to me recently as well. I deleted an instance but the disk didn't get deleted (despite the auto-delete option being active). Any attempt to manually delete the disk resource via the dev console resulted in the mentioned error.
Additionally, the progress of the associated "Delete disk 'disk-name'" operation was stuck on 0%. (You can review the list of operations for your project by selecting compute -> compute engine -> operations from the navigation console).
I figured the disk-resource was "not ready" because it was locked by the stuck-operation, so I tried deleting the operation itself via the Google Compute Engine API (the dev console doesn't currently let you invoke the delete method on operation-resources). It goes without saying, trying to delete the operation proved to be impossible as well.
At the end of the day, I just waited for the problem to fix-itself. The following morning, I tried deleting the disk again, as it looks like the lock had been lifted in the meantime, as the operation was successful.
As for the cause of the problem, I'm still left clueless. It looks like the delete-operation was stuck for whatever reason (probably related to some issue or race-condition going on with the data-center's hardware/software infrastructure).
I think this probably isn't considered as a valid answer by SO's standards, but I felt like sharing my experience anyway, as I had a really hard time finding any info about this kind of google cloud engine problems.
If you happen to ever hit the same or similar issue, you can try waiting it out, as any stuck operation will (most likely) eventually be canceled after it has been in PENDING state for too long, releasing any locked resources in the process.
Alternatively, if you need to solve the issue ASAP (which is often the case if the issue is affecting any resource which is crtical to your production environment), you can try:
Contacting Google Support directly (only available to paid support customers)
Posting in the Google Compute Engine discussion group
Send an email to gc-team(at)google.com to report a Production issue
I believe your issue is the same as the one that was solved few days ago.
If your issue didn't happen after performing those steps, you can follow Andrea's suggestion or create a new issue.
Regards,
Adrián.