Cant delete disks on Google Cloud: not in ready state - google-compute-engine

I have a "standard persistent disk" of size 10GB on Google Cloud using Ubutu 12.04. Whenever, I try to remove this, I encounter following error
The resource 'projects/XXX/zones/us-central1-f/disks/tahir-run-master-340fbaced6a5-d2' is not ready
Does anybody know about what's going on? How can I get rid of this disk?

This happened to me recently as well. I deleted an instance but the disk didn't get deleted (despite the auto-delete option being active). Any attempt to manually delete the disk resource via the dev console resulted in the mentioned error.
Additionally, the progress of the associated "Delete disk 'disk-name'" operation was stuck on 0%. (You can review the list of operations for your project by selecting compute -> compute engine -> operations from the navigation console).
I figured the disk-resource was "not ready" because it was locked by the stuck-operation, so I tried deleting the operation itself via the Google Compute Engine API (the dev console doesn't currently let you invoke the delete method on operation-resources). It goes without saying, trying to delete the operation proved to be impossible as well.
At the end of the day, I just waited for the problem to fix-itself. The following morning, I tried deleting the disk again, as it looks like the lock had been lifted in the meantime, as the operation was successful.
As for the cause of the problem, I'm still left clueless. It looks like the delete-operation was stuck for whatever reason (probably related to some issue or race-condition going on with the data-center's hardware/software infrastructure).
I think this probably isn't considered as a valid answer by SO's standards, but I felt like sharing my experience anyway, as I had a really hard time finding any info about this kind of google cloud engine problems.
If you happen to ever hit the same or similar issue, you can try waiting it out, as any stuck operation will (most likely) eventually be canceled after it has been in PENDING state for too long, releasing any locked resources in the process.
Alternatively, if you need to solve the issue ASAP (which is often the case if the issue is affecting any resource which is crtical to your production environment), you can try:
Contacting Google Support directly (only available to paid support customers)
Posting in the Google Compute Engine discussion group
Send an email to gc-team(at)google.com to report a Production issue

I believe your issue is the same as the one that was solved few days ago.
If your issue didn't happen after performing those steps, you can follow Andrea's suggestion or create a new issue.
Regards,
Adrián.

Related

Memorystore Session and Cloud Functions

I have a few functions and I share user session among the functions using Cloud Memorystore. I used connect-redis package and I modified it to work with Memorystore.
It works without issues mostly. However, I have found that at times, the cloud functions were unable to access the session. It doesn't happen frequently and I have faced this issue maybe three to four times in the last one and a half month. There are no errors in the functions and I have rigorously checked my functions.
I have always found that redeploying the functions, even without any changes to the code fixes the issue. I have only been working with GCP products for over two months now and I am not sure if these two products are incompatible or are there any edge cases that are being triggered that results in the following issue.
Due to sudden nature of the error I am also not sure if I can replicate the events leading to the error. What can I do to debug this error and have a more concrete understanding of what's happening?
According to this the usage of Cloud Functions with Memorystore should work normally without any issues.
It could be caused by many factors. Possibly connection timeout, cold start of the Function, maybe misuse of Memorystore leading to an issue there preventing it from working as expected.
What I suggest is that you add logging before and after every part of the code that completes a big part.
So basically try to locate which parts of the code are causing the issue or not showing expected results when the issue occurs, then split that part into smaller parts to find what is causing the issue. If even with logging everything seems okay with the Cloud Function, most likely something is going on on Memorystore side.
It might also be worth it then to open a Public Issue for further investigation if the issue seems to be something irrelevant to your code or configuration. Issue Tracker

safemode required for future system updates

My ps4 is currently set to automatically update, but since about a week ago or so, it's required that i boot up in safe mode in order to do so, giving me error CE-30002-5 (needing safe mode to update). This is going to get really tedious turning on the system, finding out it needs a system update, shutting it down, booting back up into safemode, updating, and then letting it boot back up in standard mode. Is there a way to fix this somehow where it doesn't need safe mode to update?
From a quick google search:
https://www.playstation.com/en-ie/get-help/help-library/error-codes/ce-30002-5/
In my opinion, I feel that you are focussing on the wrong aspect of this. It seems you are focusing on the issue that this task will become tedious, and not on the issue that you should be focussing on - which is that your system is throwing an error (eg: Something is not right or broken) .
You've got an error code, and google is your friend. You'll get better help there than someone else doing your research for you.

Corrupt chrome extensions and user retention [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have a decently popular Chrome extension, and yesterday I accidentally released a corrupted version of it, and didn't catch it for 10 hours. Within those 10 hours, the extension was updated for most users and I lost about half of them over the night based off of my Google Analytics reports (I had about 600 pageviews every 30 minutes, and now I only have 285). When I found out about my mistake I quickly reverted to an older version that works, but now, about 30 hours after an update that fixes the corruption, my pageviews are still the same.
My questions are:
Have I lost all those users or has it simply not updated for them yet?
If an extension is corrupt does it still check for updates or do the users have to press repair?
Any insight would be fantastic. As you can imagine, losing half your users over night because of 1 line of code is difficult to process.
What could have happened?
Well, this isn't a really clear situation, but, based on your information, there are a few possible scenarios:
Your users have uninstalled your extension because of this corrupt version. In this case (the worst one), it's pretty impossible to bring your users back, unfortunately.
Your "corrupted" version had issues with the update handling. For example issues with chrome.runtime.onInstalled to check the update and add new features, or issues with the Analytics part. This means that:
Your extension worked fine before the update.
It has been updated with a broken update handling function/method.
The new update (rollback to a working version) didn't solve anything, because your already corrupted extension is now unable to apply updates and/or send page views to analytics.
Your users disabled your extension in an attempt to narrow the issues (that's very uncommon, an edge case).
Your users didn't get the new update yet (which after thirty hours is also pretty uncommon).
What could you do?
Again, let's split the situation up:
In the first case, you cannot really do anything, unfortunately. That was a bad mistake! Learn from it and always test a thousand times before pushing updates.
Thinking about the second case? You should test your corrupted version on your machine, maybe using Chrome Canary to make things faster. This obviously means that you should have the previous versions stored somewhere; if you haven't got them, then it gets pretty hard, and for the future: always keep backups of your previous versions. Installing the old version, then manually updating to the corrupted one, and finally to the last one, can really help you understand what's going on. You should meticulously check the update method and see if there's something wrong.
Note: if you're not listening to chrome.runtime.onUpdateAvalilable and manually calling chrome.runtime.reload() to update your extension immediately, a Chrome restart may be necessary for it to update.
Just wait, although that's uncommon for this situation to happen, but this also is the only thing you can do in such cases.
Well, same as case #3.
If an extension is corrupt does it still check for updates or do the users have to press repair?
Well, there's no such thing as a "corrupted extension". Chrome will always check for updates (at least if the user didn't disable them in the chrome://flags), even if your extension is just a bunch of SyntaxErrors. Don't worry about this.
Extreme fix
In case that you're not sure about what to do, a re-design of your extension and a drastical purge from all the to-dos and bad practices is always a good thing. Just backup your previous version first, and start working on the 2.0 beta! Possibly, updating and improving the extension will bring much more users to it than just solving an existing issue. Personally, I almost always experienced an installation increase performing drastic restyles and re-writing better code from scratch.
I wish you to find the problem and bring your users back ASAP.
As you can imagine, losing half your users over night because of 1 line of code is difficult to process.
Yes, I do imagine. As an extension developer I really suffered for similar mistakes when I was learning update handling. So... well, break a leg! Wish you the best.

How can I investigate these mystery Django crashes?

A Django site (hosted on Webfaction) that serves around 950k pageviews a month is experiencing crashes that I haven't been able to figure out how to debug. At unpredictable intervals (averaging about once per day, but not at the same time each day), all requests to the site start to hang/timeout, making the site totally inaccessible until we restart Apache. These requests appear in the frontend access logs as 499s, but do not appear in our application's logs at all.
In poring over the server logs (including those generated by django-timelog) I can't seem to find any pattern in which pages are hit right before the site goes down. For the most recent crash, all the pages that are hit right before the site went down seem to be standard render-to-response operations using templates that seem pretty straightforward and work well the rest of the time. The requests right before the crash do not seem to take longer according to timelog, and I haven't been able to replicate the crashes intentionally via load testing.
Webfaction says that isn't a case of overrunning our allowed memory usage or else they would notify us. One thing to note is that the database is not being restarted (just the app/Apache) when we bring the site back up.
How would you go about investigating this type of recurring issue? It seems like there must be a line of code somewhere that's hanging - do you have any suggestions about a process for finding it?
I once had some issues like this, and it basically boiled down to my misunderstanding of thread-safety within django middleware. Basically the django middleware is I believe a singleton that is shared among all threads, and these threads were thrashing with the values set on a custom middleware class I had. My solution was to rewrite my middleware to not use instance or class attributes that changed, and to switch the critical parts of my application to not use threads at all with my uwsgi server as these seemed to be an overall performance downside for my app. Threaded uwsgi setups seem to work best when you have views that may complete at different intervals (some long running views and some fast ones).
Since you can't really describe what the failure conditions are until you can replicate the crash, you may need to force the situation with ab (apache benchmark). If you don't want to do this with your production site you might replicate the site in a subdomain. Warning: ab can beat the ever loving crap out of a server, so RTM. You might also want to give the WF admins a heads up about what you are going to do.
Update for comment:
I was suggesting using the exact same machine so that the subdomain name was the only difference. Given that you used a different machine there are a large number of subtle (and not so subtle) environmental things that could tweak you away from getting the error to manifest. If the new machine is OK, and if you are willing to walk away from the problem without actually solving it, you might simply make it your production machine and be happy. Personally I tend to obsess about stuff like this, but then again I'm also retired and have plenty of time to play with my toes. :-)

Is there any way to monitor the number of CAS stackwalks that are occurring?

I'm working with a time sensitive desktop application that uses p/invoke extensively, and I want to make sure that the code is not wasting a lot of time on CAS stackwalks.
I have used the SuppressUnmanagedCodeSecurity attribute where I think it is necessary, but I might have missed a few places. Does anyone know if there is a way to monitor the number of CAS stackwalks that are occurring, and better yet pinpoint the source of the security demands?
You can use the Process Explorer tool (from Sysinternals) to monitor your process.
Bring up Process Explorer, select your process and right click to show "Properties". Then, on the .NET tab, select the .NET CLR Security object to monitor. Process Explorer will show counters for
Total Runtime Checks
Link Time Checks
% Time in RT Checks
Stack Walk Depth
These are standard security performance counters described here ->
http://msdn.microsoft.com/en-us/library/adcbwb64.aspx
You could also use Perfmon or write your own code to monitor these counters.
As far as I can tell, the only one that is really useful is item 1. You could keep an eye on that while you are debugging to see if it is increasing substantially. If so, you need to examine what is causing the security demands.
I don't know of any other tools that will tell you when a stackwalk is being triggered.