Google Cloud SQL No Response - mysql

We are running a Sails.js API on Google Container Engine with a Cloud SQL database and recently we've been finding some of our endpoints have been stalling, never sending a response.
I had a health check monitoring /v1/status and it registered 100% uptime when I had the following simple response;
status: function( req, res ){
res.ok('Welcome to the API');
}
As soon as we added a database query, the endpoint started timing out. It doesn't happen all the time, but seemingly at random intervals, sometimes for hours on end. This is what we have changed the query to;
status: function( req, res ){
Email.findOne({ value: "someone#example.com" }).then(function( email ){
res.ok('Welcome to the API');
}).fail(function(err){
res.serverError(err);
});
}
Rather suspiciously, this all works fine in our staging and development environments, it's only when the code is deployed in production that the timeout occurs and it only occurs some of the time. The only thing that changes between staging and production is the database we are connecting to and the load on the server.
As I mentioned earlier we are using Google Cloud SQL and the Sails-MySQL adapter. We have the following error stacks from the production server;
AdapterError: Invalid connection name specified
at getConnectionObject (/app/node_modules/sails-mysql/lib/adapter.js:1182:35)
at spawnConnection (/app/node_modules/sails-mysql/lib/adapter.js:1097:7)
at Object.module.exports.adapter.find (/app/node_modules/sails-mysql/lib/adapter.js:801:16)
at module.exports.find (/app/node_modules/sails/node_modules/waterline/lib/waterline/adapter/dql.js:120:13)
at module.exports.findOne (/app/node_modules/sails/node_modules/waterline/lib/waterline/adapter/dql.js:163:10)
at _runOperation (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/finders/operations.js:408:29)
at run (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/finders/operations.js:69:8)
at bound.module.exports.findOne (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/finders/basic.js:78:16)
at bound [as findOne] (/app/node_modules/sails/node_modules/lodash/dist/lodash.js:729:21)
at Deferred.exec (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/deferred.js:501:16)
at tryCatcher (/app/node_modules/sails/node_modules/waterline/node_modules/bluebird/js/main/util.js:26:23)
at ret (eval at <anonymous> (/app/node_modules/sails/node_modules/waterline/node_modules/bluebird/js/main/promisify.js:163:12), <anonymous>:13:39)
at Deferred.toPromise (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/deferred.js:510:61)
at Deferred.then (/app/node_modules/sails/node_modules/waterline/lib/waterline/query/deferred.js:521:15)
at Strategy._verify (/app/api/services/passport.js:31:7)
at Strategy.authenticate (/app/node_modules/passport-local/lib/strategy.js:90:12)
at attempt (/app/node_modules/passport/lib/middleware/authenticate.js:341:16)
at authenticate (/app/node_modules/passport/lib/middleware/authenticate.js:342:7)
at Object.AuthController.login (/app/api/controllers/AuthController.js:119:5)
at bound (/app/node_modules/sails/node_modules/lodash/dist/lodash.js:729:21)
at routeTargetFnWrapper (/app/node_modules/sails/lib/router/bind.js:179:5)
at callbacks (/app/node_modules/sails/node_modules/express/lib/router/index.js:164:37)
Error (E_UNKNOWN) :: Encountered an unexpected error :
Could not connect to MySQL: Error: Pool is closed.
at afterwards (/app/node_modules/sails-mysql/lib/connections/spawn.js:72:13)
at /app/node_modules/sails-mysql/lib/connections/spawn.js:40:7
at process._tickDomainCallback (node.js:381:11)
Looking at the errors alone, I'd be tempted to say that we have something misconfigured. But the fact that it works some of the time (and has previously been working fine!) leads me to believe that there's some other black magic at work here. Our Cloud SQL instance is D0 (though we've tried upping the size to D4) and our activation policy is "Always On".
EDIT: I had seen others complain about Google Cloud SQL eg. this SO post and I was suspicious but we have since moved our database to Amazon RDS and we are still seeing the same issues, so it must be a problem with sails and the mysql adapter.
This issue is leading to hours of downtime a day, we need it resolved, any help is much appreciated!

This appears to be a sails issue, and not necessarily related to Cloud SQL.

Is there any way the QPS limit for Google Cloud SQL is being reached? See here: https://cloud.google.com/sql/faq#sizeqps

Why is my database instance sometimes slow to respond?
In order to minimize the amount you are charged for instances on per use billing plans, by default your instance becomes passive if it is not accessed for 15 minutes. The next time it is accessed there will be a short delay while it is activated. You can change this behavior by configuring the activation policy of the instance. For an example, see Editing an Instance Using the Cloud SDK.
It might be related to your policy setting. If you set it to ON_DEMAND, the instance will sleep to save your budget so that the first query to activate the instance is slow. This might cause the timeout.
https://cloud.google.com/sql/faq?hl=en

Related

Internal server error on Data Management API (GET search)

Today we get internal server error on the search method.
Before today almost all was fine (except this method returns "Too many requests" very often).
We call Data Management API to get all versions with "rvt" extension
GET projects/:project_id/folders/:folder_id/search
We get
{"jsonapi":{"version":"1.0"},"errors":[{"id":"1b2ec532-d1ec-4ed2-a174-02fb6a097195","status":"500","detail":"Internal Server Error"}]}
What is a root cause of this issue?
What we can use to filter versions without any error?

Couchbase Java SDK times out with BUCKET_NOT_AVAILABLE

I am doing a lookup operation Couchbase Java SDK 3.0.9 which looks like this:
// Set up
bucket = cluster.bucket("my_bucket")
collection = bucket.defaultCollection()
// Look up operation
val specs = listOf(LookupInSpecStandard.get("hash"))
collection.lookupIn(id, specs)
The error I get is BUCKET_NOT_AVAILABLE. Here are is the full message:
com.couchbase.client.core.error.UnambiguousTimeoutException: SubdocGetRequest, Reason: TIMEOUT {"cancelled":true,"completed":true,"coreId":"0xdb7f8e4800000003","idempotent":true,"reason":"TIMEOUT","requestId":608806,"requestType":"SubdocGetRequest","retried":39,"retryReasons":["BUCKET_NOT_AVAILABLE"],"service":{"bucket":"export","collection":"_default","documentId":"export:main","opaque":"0xcfefb","scope":"_default","type":"kv"},"timeoutMs":15000,"timings":{"totalMicros":15008977}}
The strange part is that this code hasn't been touched for months and the lookup broke out of a sudden. The CB cluster is working fine. Its version is
Enterprise Edition 6.5.1 build 6299.
Do you have any ideas what might have gone wrong?
Note that in Couchbase Java SDK 3.x, the Cluster::bucket method returns instantly, and continues opening a bucket in the background. So the first operation you perform - a lookupIn here - needs to wait for that resource opening to complete before it can proceed. It looks like it took a little longer to access the Couchbase bucket than usual and you got a timeout.
I recommend using the Bucket::waitUntilReady method after opening a bucket, to block until the resource opening is complete:
bucket = cluster.bucket("my_bucket")
bucket.waitUntilReady(Duration.ofMinutes(1));
This problem can occur because of firewall. You need to allow these ports.
Client-to-node
Unencrypted: 8091-8097, 9140 [3], 11210
Encrypted: 11207, 18091-18095, 18096, 18097
You can check more from below
https://docs.couchbase.com/server/current/install/install-ports.html#_footnotedef_2

How to handle "Unexpected EOF at target" error from API calls?

I'm creating a Forge application which needs to get version information from a BIM 360 hub. Sometimes it works, but sometimes (usually after the code has already been run once this session) I get the following error:
Exception thrown: 'Autodesk.Forge.Client.ApiException' in mscorlib.dll
Additional information: Error calling GetItem: {
"fault":{
"faultstring":"Unexpected EOF at target",
"detail": {
"errorcode":"messaging.adaptors.http.flow.UnexpectedEOFAtTarget"
}
}
}
The above error will be thrown from a call to an api, such as one of these:
dynamic item = await itemApi.GetItemAsync(projectId, itemId);
dynamic folder = await folderApi.GetFolderAsync(projectId, folderId);
var folders = await projectApi.GetProjectTopFoldersAsync(hubId, projectId);
Where the apis are initialized as follows:
ItemsApi itemApi = new ItemsApi();
itemApi.Configuration.AccessToken = Credentials.TokenInternal;
The Ids (such as 'projectId', 'itemId', etc.) don't seem to be any different when this error is thrown and when it isn't, so I'm not sure what is causing the error.
I based my application on the .Net version of this tutorial: http://learnforge.autodesk.io/#/datamanagement/hubs/net
But I adapted it so I can retrieve multiple nodes asynchronously (for example, all of the nodes a user has access to) without changing the jstree. I did this to allow extracting information in the background without disrupting the user's workflow. The main change I made was to add another Route on the server side that calls "GetTreeNodeAsync" (from the tutorial) asynchronously on the root of the tree and then calls it on each of the returned children, then each of their children, and so on. The function waits until all of the nodes are processed using Task.WhenAll, then returns data from each of the nodes to the client;
This means that there could be many api calls running asynchronously, and there might be duplicate api calls if a node was already opened in the jstree and then it's information is requested for the background extraction, or if the background extraction happens more than once. This seems to be when the error is most likely to happen.
I was wondering if anyone else has encountered this error, and if you know what I can do to avoid it, or how to recover when it is caught. Currently, after this error occurs, it seems that every other api call will throw this error as well, and the only way I've found to fix it is to rerun the code (I use Visual Studio so I just rerun the server and client, and my browser launches automatically)
Those are sporadic errors from our apigee router due to latency issues in the authorization process that we are currently looking into internally.
When they occur please cease all your upcoming requests, wait for a few minutes and retry again. Take a look at stuff like this or this to help you out.
And our existing reports calling out similar errors seem to point to concurrency as one of the factors leading up to the issue so you might also want to limit your concurrent requests and see if that mitigate the issue.

Google Cloud Function: lazy loading not working

I deploy a google cloud function with lazy loading that loads data from google datastore. The last update time of my function is 7/25/18, 11:35 PM. It works well last week.
Normally, if the function is called less than about 30 minutes since last called. The function does not need to load data loaded from google datastore again. But I found that the lazy loading is not working since yesterday. Even the time between two function is less than 1 minute.
Does anyone meet the same problem? Thanks!
The Cloud Functions can fail due to several reasons such as uncaught exception and internal process crashes, therefore, it is required to check the logs files / HTTP responses error messages to verify the issue root cause and determine if the function is being restarted and generating Function execution timeouts that could explain why your function is not working.
I suggest you take a look on the Reporting Errors documentation that explains the process required to return a function error in order to validate the exact error message thrown by the service and return the error at the recommended way. Keep in mind that when the errors are returned correctly, then the function instance that returned the error is labelled as behaving normally, avoiding cold starts that leads higher latency issues, and making the function available to serve future requests if need be.

Fiware CEP server stops responding

In developing in Fi-Cloud's CEP I've been having an issue that has been happening repeatedly. As I'm trying to develop a definition to perform a task, CEP's server and Authoring Tool stop responding, although ssh is still responsive.
This issue happens as I develop. I'm using the AuthoringTool to alter the definition bit by bit and then I re-upload it to the server through the authoring tool's export feature.
To reinitiate the proton with the new definition each time I alter it, I use Google's Postman with this single operation:
-PUT (url:http://{ip}:8080/ProtonOnWebServerAdmin/resources/instances/ProtonOnWebServer)
header: 'Content-Type' : 'application/json'; body : {"action": "ChangeDefinitions","definitions-url" : "/ProtonOnWebServerAdmin/resources/definitions/Definition_Name"}
At the same time, I'm logged in with three ssh intances, one to monitor the files being created on /opt/tomcat10/sample/ and other things, and the other two to 'tail -f ' log files the definition writes to, as events are processed: one log for events recieved and another log for events detected by the EPAgent.
I'm iterating through these procedures over and over as I'm developing and eventualy CEP server and the Authoring Tool stop responding.
By "tailing" tomcat's log file (# tail -f /opt/tomcat10/logs/catalina.out) I can see that, when under these circumstances, if I attemp a:
-GET (url: http://{ip}:8080/ProtonOnWebServerAdmin/resources/instances/ProtonOnWebServer)
I get no response back and tomcat logs the following response:
11452100 [http-bio-8080-exec-167] ERROR org.apache.wink.server.internal.RequestProcessor - An unhandled exception occurred which will be propagated to the container.
java.lang.OutOfMemoryError: PermGen space
Exception in thread "http-bio-8080-exec-167" java.lang.OutOfMemoryError: PermGen space
Ssh is still responsive and I can look at tomcat's log this way.
To get over this and continue, I exit ssh connections and restart CEP's instance in the Fi-Cloud.
Is the procedure I'm using to re-upload and re-run the definition inapropriate? Should I take a different approach to developing?
When you update a definition that the CEP is already working with, and you want the CEP engine to work with the updated definition, you need to:
Export the definition using the authoring tool export (as you did)
Stop the engine run, using REST PUT
PUT //host:8080/ProtonOnWebServerAdmin/resources/instances/ProtonOnWebServer
{"action":"ChangeState","state":"stop"}
Start the engine, using REST PUT
PUT //host:8080/ProtonOnWebServerAdmin/resources/instances/ProtonOnWebServer
{"action":"ChangeState","state":"start"}
You don't need to activate the "ChangeDefinitions" action, since it is the same definition name that the engine is already working with.
Activating "ChangeDefinitions" action, only influences the next run of the CEP, and has no influence on the current run.
This answer your question about how you should update a CEP definition.
Hope it will solve your issue.