Memorystore Session and Cloud Functions - google-cloud-functions

I have a few functions and I share user session among the functions using Cloud Memorystore. I used connect-redis package and I modified it to work with Memorystore.
It works without issues mostly. However, I have found that at times, the cloud functions were unable to access the session. It doesn't happen frequently and I have faced this issue maybe three to four times in the last one and a half month. There are no errors in the functions and I have rigorously checked my functions.
I have always found that redeploying the functions, even without any changes to the code fixes the issue. I have only been working with GCP products for over two months now and I am not sure if these two products are incompatible or are there any edge cases that are being triggered that results in the following issue.
Due to sudden nature of the error I am also not sure if I can replicate the events leading to the error. What can I do to debug this error and have a more concrete understanding of what's happening?

According to this the usage of Cloud Functions with Memorystore should work normally without any issues.
It could be caused by many factors. Possibly connection timeout, cold start of the Function, maybe misuse of Memorystore leading to an issue there preventing it from working as expected.
What I suggest is that you add logging before and after every part of the code that completes a big part.
So basically try to locate which parts of the code are causing the issue or not showing expected results when the issue occurs, then split that part into smaller parts to find what is causing the issue. If even with logging everything seems okay with the Cloud Function, most likely something is going on on Memorystore side.
It might also be worth it then to open a Public Issue for further investigation if the issue seems to be something irrelevant to your code or configuration. Issue Tracker

Related

When running the DAML sandbox an error occurs

The following error occurs when running the sandbox:
io.grpc.netty.NettyServerHandler onStreamError
WARNING: Stream Error
io.netty.handler.codec.http2.Http2Exception$HeaderListSizeException: Header size exceeded max allowed size (8192)
What could the cause of this be?
I have seen this error numerous times, and it is a consequence of having a transaction failure in a complex DAML model/transaction when running on the Sandbox. When you experience a transaction failure (fetch/exercise an inactive contract, lookupByKey returned a stale cid, head [], divide-by-zero, etc) the sandbox helpfully tries to provide transaction trace information in the error result.
This is normally fine for relatively simple models. With more complex models this trace can exceed the maximum header size producing the error you see. When this happens I have found the trace in the sandbox.log file, sometimes along with other errors that help explain what is going on.
The trace is an unformatted dump, so it can take a bit of effort to decode manually, but I have done it many times myself and the information I needed to identify the issue has always been there —— and to be honest, generally just knowing the choice I was exercising + the specific class of error is normally enough to point me in the right direction.
I believe there is some tooling being built to help with this sort of diagnosis; however, I don't know how advanced the work on that is.

CaseStudy in Bugzilla

I wish to conduct a case study in bugzilla, where I would like to ideally find out some statistics such as
The number of Memory Leaks
The percentage of bugs which are performance bugs
The percentage of semantic bugs
How can I search through the bugzilla database for softwares such as apache http server or mysql database server to generate such statistics. I would like an idea of how to get started?
I finally figured it out and am going to show my approach here. It's more or less a manual process. Hopefully this helps others as well who might be doing case-studies:
Selecting the bugs:
I went to bugs.mysql.com and searched for all bugs which were marked as resolved and fixed. Unfortunately, for mysql you cannot select specific components. I filtered out a random time range (2013-2014). And saved all of these in excel file(csv)
Classifying and filtering the bugs:
I manually went through the bugs, I skipped the ones which I could see clearly belonged to the documentation component, installation, compilation failure, or required restarts.
Then I read the report, and checked if the report actually suggested a bug fix, and if the bugfix was semantic (i.e. change limits, add a condition check, make sure some if condition is correctly marked for an edge case etc. - most were along similar lines). I followed a similar process for performance, resource-leak (both cpu resources, and memory leak considered in this category), and concurrency bugs.

Chrome error: "Inspected target disconnected. Once it reloads we will attach to it automatically."

I receive a Chrome (43.0.2357.124) "Aw, snap!" error that renders "Inspected target disconnected. Once it reloads we will attach to it automatically." in the developer console.
Without being too specific to my project and trying to make the question more generally applicable, it appears to occur occur during a Promise that features a ~5 second delay.
This function (can be seen in context on the repo https://github.com/mitTransportAnalyst/CoAXs/blob/master/public/scripts/main/services/analystService.js#L35-L44) performs fine on Firefox 38.0.5. It is receiving a large GeoJSON array - perhaps that could somehow be related to the issue, though I do not know for sure.
At this point, any advice on next steps for how to debug this would be appreciated, even googling this specific issue doesn't come up with any results (5 irrelevant results as of Wed 6:00, June 17: https://www.google.com/search?sclient=psy-ab&biw=1280&bih=678&q=%22inspected%20target%20disconnected%22%20chrome&oq=%22inspected%20target%20disconnected%22%20chrome&gs_l=serp.3...805885.806603.1.806844.2.2.0.0.0.0.72.122.2.2.0....0...1c.1.64.serp..2.0.0.O7y1WqVbj0c&pbx=1&psj=1&ion=1&cad=cbv&sei=LvKBVfarHcyw-AHVioHYBg&rct=j#q=%22Inspected+target+disconnected%22+chrome).
Added this as a comment but interested to see if anyone knows why this happened:
Issue ended up being related to the delayed receipt of > 3 MB files (assembled piecemeal). There is some (limited) documentation of this error occurring here code.google.com/p/v8/issues/detail?id=3968 (the results of which are, unfortunately, inconclusive). Ended up working with the data provider and reducing file size substantially, which resolved the issue. Curiously - if anyone can posit as to why this was occurring - console.loging where data was concatenated seemed to avoid the issue. If this didn't occur, the tab would suddenly exceed ~1.3GB and crash.
You can see link to point where I was console.loging here: https://github.com/mitTransportAnalyst/CoAXs/blob/master/public/scripts/analyst.js#L10343
Turn off your extensions. I had a Knockoutjs context debugger plugin and it caused the very same behaviour with the same version of Chrome.
I just have the same problem. When i check code that have a infinity loop. That is add the same object again and again it takes high memory. When it full the ram then the page is going to unresponsive. When i check it in Mozilla Firefox the ram full alert is shown in my Antivirus. Chrome can handle it but Mozilla can't take it. It will loop as it's possible. So don't blame chrome it is handle the exception. Check the codes. If its not your page then leave it.
Finally check the loops....

Cant delete disks on Google Cloud: not in ready state

I have a "standard persistent disk" of size 10GB on Google Cloud using Ubutu 12.04. Whenever, I try to remove this, I encounter following error
The resource 'projects/XXX/zones/us-central1-f/disks/tahir-run-master-340fbaced6a5-d2' is not ready
Does anybody know about what's going on? How can I get rid of this disk?
This happened to me recently as well. I deleted an instance but the disk didn't get deleted (despite the auto-delete option being active). Any attempt to manually delete the disk resource via the dev console resulted in the mentioned error.
Additionally, the progress of the associated "Delete disk 'disk-name'" operation was stuck on 0%. (You can review the list of operations for your project by selecting compute -> compute engine -> operations from the navigation console).
I figured the disk-resource was "not ready" because it was locked by the stuck-operation, so I tried deleting the operation itself via the Google Compute Engine API (the dev console doesn't currently let you invoke the delete method on operation-resources). It goes without saying, trying to delete the operation proved to be impossible as well.
At the end of the day, I just waited for the problem to fix-itself. The following morning, I tried deleting the disk again, as it looks like the lock had been lifted in the meantime, as the operation was successful.
As for the cause of the problem, I'm still left clueless. It looks like the delete-operation was stuck for whatever reason (probably related to some issue or race-condition going on with the data-center's hardware/software infrastructure).
I think this probably isn't considered as a valid answer by SO's standards, but I felt like sharing my experience anyway, as I had a really hard time finding any info about this kind of google cloud engine problems.
If you happen to ever hit the same or similar issue, you can try waiting it out, as any stuck operation will (most likely) eventually be canceled after it has been in PENDING state for too long, releasing any locked resources in the process.
Alternatively, if you need to solve the issue ASAP (which is often the case if the issue is affecting any resource which is crtical to your production environment), you can try:
Contacting Google Support directly (only available to paid support customers)
Posting in the Google Compute Engine discussion group
Send an email to gc-team(at)google.com to report a Production issue
I believe your issue is the same as the one that was solved few days ago.
If your issue didn't happen after performing those steps, you can follow Andrea's suggestion or create a new issue.
Regards,
Adrián.

How can I investigate these mystery Django crashes?

A Django site (hosted on Webfaction) that serves around 950k pageviews a month is experiencing crashes that I haven't been able to figure out how to debug. At unpredictable intervals (averaging about once per day, but not at the same time each day), all requests to the site start to hang/timeout, making the site totally inaccessible until we restart Apache. These requests appear in the frontend access logs as 499s, but do not appear in our application's logs at all.
In poring over the server logs (including those generated by django-timelog) I can't seem to find any pattern in which pages are hit right before the site goes down. For the most recent crash, all the pages that are hit right before the site went down seem to be standard render-to-response operations using templates that seem pretty straightforward and work well the rest of the time. The requests right before the crash do not seem to take longer according to timelog, and I haven't been able to replicate the crashes intentionally via load testing.
Webfaction says that isn't a case of overrunning our allowed memory usage or else they would notify us. One thing to note is that the database is not being restarted (just the app/Apache) when we bring the site back up.
How would you go about investigating this type of recurring issue? It seems like there must be a line of code somewhere that's hanging - do you have any suggestions about a process for finding it?
I once had some issues like this, and it basically boiled down to my misunderstanding of thread-safety within django middleware. Basically the django middleware is I believe a singleton that is shared among all threads, and these threads were thrashing with the values set on a custom middleware class I had. My solution was to rewrite my middleware to not use instance or class attributes that changed, and to switch the critical parts of my application to not use threads at all with my uwsgi server as these seemed to be an overall performance downside for my app. Threaded uwsgi setups seem to work best when you have views that may complete at different intervals (some long running views and some fast ones).
Since you can't really describe what the failure conditions are until you can replicate the crash, you may need to force the situation with ab (apache benchmark). If you don't want to do this with your production site you might replicate the site in a subdomain. Warning: ab can beat the ever loving crap out of a server, so RTM. You might also want to give the WF admins a heads up about what you are going to do.
Update for comment:
I was suggesting using the exact same machine so that the subdomain name was the only difference. Given that you used a different machine there are a large number of subtle (and not so subtle) environmental things that could tweak you away from getting the error to manifest. If the new machine is OK, and if you are willing to walk away from the problem without actually solving it, you might simply make it your production machine and be happy. Personally I tend to obsess about stuff like this, but then again I'm also retired and have plenty of time to play with my toes. :-)