Google Cloud Function: Expose Secret as Environment Variable? - google-cloud-functions

I have set up a few Google Cloud Functions that access various APIs in their implementation. Naturally, these APIs require tokens or username/passwords to work. I have created these secrets in Google Cloud Secret Manager and can successfully access them via the Cloud Function using the Google Cloud Console UI.
My question is not about implementation but what the difference is between reference methods:
Mounting Secret as a volume?
Exposing Secret as environment variable?
All my functions use the second option. Is this a bad practice and/or does this create a security leak? I did a search and couldn't find anything definitive and Google's documentation doesn't mention anything about the differences. The word "expose" has me worried, thinking that my Secrets would be accessible by others. I would love a pros/cons of each that I and future users could reference.
Thank you!

Using Secret Manager is a good practice.
The primary difference between mounting a secret as a volume versus as an environment variable is the access method and when the secret is read from Secret Manager.
Mounting a secret as a volume reads the secret each time the volume/file is read. If you are referencing the latest tag, updates to secrets will update the secret in Functions the next time you read the volume/file.
Exposing a secret as an environment variable reads the secret at instance cold start. That means if you update the secret, the Function instance will continue to use the last value even if you specify latest. Only on instance cold start is the new secret read from Secret Manager. If you have multiple function instances running, some might use the previous value and some might use the current value. That depends on when each Function instance was started.
Mounting a secret as a volume can be more expensive because the secret might be read more often.

Related

Kubernetes :: web interface to start a pod

Background:
As a backoffice service for our insurance mathematicians, a daily cronjob runs a pod.
Inside the pod, fairly complex future simulations take place.
The pod has two containers, an application server and a db server.
The process has few variables which are fed into the pod.
This is done by configmaps and container env variables.
When the pod is ready after approx. 10 hours, it copies the resulting database to another database
and then it's done. It runs daily because market data changes daily. And we also daily check our new codebase.
Great value, high degree of standardisation, fully automated.
So far so good.
But it uses the same configuration every time it runs.
Now what?
Our mathematicians would like to be able to start the pod feeding their own configuration data into it.
For example on a webpage with configurable input data.
Question:
Is there an existing Kubernetes framework implementing this?
"Provide a webpage with configurable input fields which are transformed into configmaps and env variables starting the pod"?
Sure, not too difficult to write.
But we do cloud native computing also because we want to reuse solutions of general problems and not write it ourselves if possible.
Thanks for any hints in advance.
They can start a Kubernetes Job for one time tasks. Apart from Google Cloud Console UI I'm not aware of an UI where you can configure fields for a config map. Maybe you can write a custom python script that launches these jobs.
https://kubernetes.io/docs/concepts/workloads/controllers/job/

AWS Cognito to authenticate App users and retrieve settings from MySQL database

I am doing some research for a mobile app I want to develop, and was wondering whether I could get feedback on the following architecture. Within my future app users should be able to authenticate and register themselves via the mobile app and retrieve and use their settings after a successful authentication.
What I am looking for is an architecture in which user accounts are managed by AWS Cognito, but all application related information is stored in a MySQL database hosted somewhere else.
Why host the database outside of AWS? Because of high costs / vendor lock-in / for the sake of learning about architecture rather than going all-in on AWS or Azure
Why not build the identity management myself? Because in the end I want to focus on the App and don't spent a lot of energy on something that AWS can already provide me with (yeah I know, not quite in line with my last argument above, but otherwise all my time goes into database AND IAM)
One of my assumptions in this design (please correct me if I am wrong) is that it is only possible to retrieve data from a MySQL database with 'fixed credentials'. Therefore, I don't want the app (the user's device) to make these queries (but do this on the server instead) as the credentials to the database would otherwise be stored on the device.
Also, to make it (nearly) impossible for users to run queries on the database with a fake identity, I want the server to retrieve the User ID from AWS Cognito (rather than using the ID token from the device) and use this in the SQL query. This, should protect the service from a fake user ID injection from the device/user.
Are there functionalities I have missed in any of these components that could make my design less complicated or which could improve the flow?
Is that API (the one in the step 3) managed by the AWS API Gateway? If so, your cognito user pool can be set as Authorizer in your AWS API Gateway, then the gateway will take care automatically of the token verification (Authorizers enable you to control access to your APIs using Amazon Cognito User Pools or a Lambda function).
You can also do the token verification in a Lambda if you need to verify something else in the token.
Regarding to the connection between NodeJS (assuming that is an AWS lambda) that will work fine, but keep in mind the security as your customers data will travel outside AWS, and try to use tools like AWS Secret Manager to keep your database passwords safe and rotate them from time to time in your lambda.

Where to store key-value pairs for runtime retrieval from within Cloud Function?

In a Cloud Function I need to retrieve a bunch of key-value pairs to process. Right now I'm storing them as json-file in Cloud Storage.
Is there any better way?
Env-variables don't suite as (a) there are too many kv pairs, (b) the same gcf may need different sets of kv depending on the incoming params, (c) those kv could be changed over time.
BigQuery seems to be an overkill, also given that some kv have few levels of nesting.
Thanks!
You can use Memorystore, but it's not persistent see the FAQ.
Cloud Memorystore for Redis provides a fully managed in-memory data
store service built on scalable, secure, and highly available
infrastructure managed by Google. Use Cloud Memorystore to build
application caches that provides sub-millisecond data access. Cloud
Memorystore is compatible with the Redis protocol, allowing easy
migration with zero code changes.
Serverless VPC Access enables you to connect from the Cloud Functions environment directly to your Memorystore instances.
Note: Some resources, such as Memorystore instances, require connections to come from the same region as the resource.
Update
For persisted storage you could use Firestore.
See a tutorial about using Use Cloud Firestore with Cloud Functions

designing an agnostic configuration service

Just for fun, I'm designing a few web applications using a microservices architecture. I'm trying to determine the best way to do configuration management, and I'm worried that my approach for configuration may have some enormous pitfalls and/or something better exists.
To frame the problem, let's say I have an authentication service written in c++, an identity service written in rust, an analytics services written in haskell, some middletier written in scala, and a frontend written in javascript. There would also be the corresponding identity DB, auth DB, analytics DB, (maybe a redis cache for sessions), etc... I'm deploying all of these apps using docker swarm.
Whenever one of these apps is deployed, it necessarily has to discover all the other applications. Since I use docker swarm, discovery isn't an issue as long all the nodes share the requisite overlay network.
However, each application still needs the upstream services host_addr, maybe a port, the credentials for some DB or sealed service, etc...
I know docker has secrets which enable apps to read the configuration from the container, but I would then need to write some configuration parser in each language for each service. This seems messy.
What I would rather do is have a configuration service, which maintains knowledge about how to configure all other services. So, each application would start with some RPC call designed to get the configuration for the application at runtime. Something like
int main() {
AppConfig cfg = configClient.getConfiguration("APP_NAME");
// do application things... and pass around cfg
return 0;
}
The AppConfig would be defined in an IDL, so the class would be instantly available and language agnostic.
This seems like a good solution, but maybe I'm really missing the point here. Even at scale, tens of thousands of nodes can be served easily by a few configuration services, so I don't forsee any scaling issues. Again, it's just a hobby project, but I like thinking about the "what-if" scenarios :)
How are configuration schemes handled in microservices architecture? Does this seem like a reasonable approach? What do the major players like Facebook, Google, LinkedIn, AWS, etc... do?
Instead of building a custom configuration management solution, I would use one of these existing ones:
Spring Cloud Config
Spring Cloud Config is a config server written in Java offering an HTTP API to retrieve the configuration parameters of applications. Obviously, it ships with a Java client and a nice Spring integration, but as the server is just a HTTP API, you may use it with any language you like. The config server also features symmetric / asymmetric encryption of configuration values.
Configuration Source: The externalized configuration is stored in a GIT repository which must be made accessible to the Spring Cloud Config server. The properties in that repository are then accessible through the HTTP API, so you can even consider implementing an update process for configuration properties.
Server location: Ideally, you make your config server accessible through a domain (e.g. config.myapp.io), so you can implement load-balancing and fail-over scenarios as needed. Also, all you need to provide to all your services then is just that exact location (and some authentication / decryption info).
Getting started: You may have a look at this getting started guide for centralized configuration on the Spring docs or read through this Quick Intro to Spring Cloud Config.
Netflix Archaius
Netflix Archaius is part of the Netflix OSS stack and "is a Java library that provides APIs to access and utilize properties that can change dynamically at runtime".
While limited to Java (which does not quite match the context you have asked), the library is capable of using a database as source for the configuration properties.
confd
confd keeps local configuration files up-to-date using data stored in external sources (etcd, consul, dynamodb, redis, vault, ...). After configuration changes, confd restarts the application so that it can pick up the updated configuration file.
In the context of your question, this might be worthwhile to try as confd makes no assumption about the application and requires no special client code. Most languages and frameworks support file-based configuration so confd should be fairly easy to add on top of existing microservices that currently use env variables and did not anticipate decentralized configuration management.
I don't have a good solution for you, but I can point out some issues for you to consider.
First, your applications will presumably need some bootstrap configuration that enables them to locate and connect to the configuration service. For example, you mentioned defining the configuration service API with IDL for a middleware system that supports remote procedure calls. I assume you mean something like CORBA IDL. This means your bootstrap configuration will not be just the endpoint to connect to (specified perhaps as a stringified IOR or a path/in/naming/service), but also a configuration file for the CORBA product you are using. You can't download that CORBA product's configuration file from the configuration service, because that would be a chicken-and-egg situation. So, instead, you end up with having to manually maintain a separate copy of the CORBA product's configuration file for each application instance.
Second, your pseudo-code example suggests that you will use a single RPC invocation to retrieve all the configuration for an application in a single go. This coarse level of granularity is good. If, instead, an application used a separate RPC call to retrieve each name=value pair, then you could suffer major scalability problems. To illustrate, let's assume an application has 100 name=value pairs in its configuration, so it needs to make 100 RPC calls to retrieve its configuration data. I can foresee the following scalability problems:
Each RPC might take, say, 1 millisecond round-trip time if the application and the configuration server are on the same local area network, so your application's start-up time is 1 millisecond for each of 100 RPC calls = 100 milliseconds = 0.1 second. That might seem acceptable. But if you now deploy another application instance on another continent with, say, a 50 millisecond round-trip latency, then the start-up time for that new application instance will be 100 RPC calls at 50 milliseconds latency per call = 5 seconds. Ouch!
The need to make only 100 RPC calls to retrieve configuration data assumes that the application will retrieve each name=value pair once and cache that information in, say, an instance variable of an object, and then later on access the name=value pair via that local cache. However, sooner or later somebody will call x = cfg.lookup("variable-name") from inside a for-loop, and this means the application will be making a RPC every time around the loop. Obviously, this will slow down that application instance, but if you end up with dozens or hundreds of application instances doing that, then your configuration service will be swamped with hundreds or thousands of requests per second, and it will become a centralised performance bottleneck.
You might start off writing long-lived applications that do 100 RPCs at start-up to retrieve configuration data, and then run for hours or days before terminating. Let's assume those applications are CORBA servers that other applications can communicate with via RPC. Sooner or later you might decide to write some command-line utilities to do things like: "ping" an application instance to see if it is running; "query" an application instance to get some status details; ask an application instance to gracefully terminate; and so on. Each of those command-line utilities is short-lived; when they start-up, they use RPCs to obtain their configuration data, then do the "real" work by making a single RPC to a server process to ping/query/kill it, and then they terminate. Now somebody will write a UNIX shell script that calls those ping and query commands once per second for each of your dozens or hundreds of application instances. This seemingly innocuous shell script will be responsible for creating dozens or hundreds of short-lived processes per second, and of those short-lived processes will make numerous RPC calls to the centralised configuration server to retrieve name=value pairs one at a time. That sort of shell script can pu a massive load on your centralised configuration server.
I am not trying to discourage you from designing a centralised configuration server. The above points are just warning about scalability issues you need to consider. Your plan for an application to retrieve all its configuration data via one coarse-granularity RPC call will certainly help you to avoid the kinds of scalability problems I mentioned above.
To provide some food for thought, you might want to consider a different approach. You could store each application's configuration files on a web sever. A shell start script "wrapper" for an application can do the following:
Use wget or curl to download "template" configuration files from the web server and store the files on the local file system. A "template" configuration file is a normal configuration file but with some placeholders for values. A placeholder might look like ${host_name}.
Also use wget or curl to download a file containing search-and-replace pairs, such as ${host_name}=host42.pizza.com.
Perform a global search-and-replace of those search-and-replace terms on all the downloaded template configuration files to produce the configuration files that are ready to use. You might use UNIX shell tools like sed or a scripting language to perform this global search-and-replace. Alternatively, you could use a templating engine like Apache Velocity.
Execute the actual application, using a command-line argument to specify the path/to/downloaded/config/files.

Are service discovery tools responsible for providing service credentials as well?

I am trying to understand some basic concepts as I am still new to the concept of service discovery and cloud programming (excuse the cliché). A question that has been in my head for some time is: are service discovery solutions like Consul, etcd & Zookeeper responsible for providing service credentials as well?
For example, if we have a web application which queries information about the location of database server(s), who is responsible for providing it with the credentials (username, password) for connecting to it? I do know that this is probably subjective but I would be glad to learn more about best practices related to that.
Indeed, see Consul and Vault. Now, for the reasoning: service registries typically don't come with a full-fledged set of ACLs, etcetera, to protect secrets, plus they gossip said secrets around the network, dump them left and right on disk - it's a security nightmare. You want to make sure that access is as limited as possible, strictly on a need-to-know basis. Therefore, use some specific tool to do that - Hardware Security Modules, Vault, Chef encrypted databags, and so on.
The tools Consul and Vault do exactly the split you propose: Consul for service discovery, Vault for sharing secrets (and implementing something like a lease and dynamic secrets).
Please read about these two tools to see if this concept works for you.