GCE randomly changing disk names for additional mounted disks under /dev/disk/by-id? - google-compute-engine

I see an apparent random problem about once a month that is doing my head in. Google appears to be changing the naming convention for additional disks (to root) and how they are presented under /dev/disk/by-id/ at boot.
All the time the root disk is available as /dev/disk/by-id/google-persistent-disk-0
MOST of the time the single extra disk we mount is presented as /dev/disk/by-id/google-persistent-disk-1
We didn't give this name but we wrote our provisioning scripts to expect this convention.
Every now and then, on rebooting the VM, our startup scripts fail in executing a safe mount:
/usr/share/google/safe_format_and_mount -m "mkfs.ext4 -F" /dev/disk/by-id/google-persistent-disk-1 /mountpoint
They fail because something has changed the name of the disk. Its no longer /dev/disk/by-id/google-persistent-disk-1 its now /dev/disk/by-id/google-{the name we gave it when we created it}
Last time I updated our startup scripts to use this new naming convention it switched back an hour later. WTF?
Any clues appreciated. Thanks.

A naming convention beyond your control is not a stable API. You should not write your management tooling to assume this convention will never be changed -- as you can see, it's changing for reasons you have nothing to do with, and it's likely that it will change again. If you need access to the list of disks on the system, you should query it through udev, or you can consider using /dev/disk/by-uuid/ which will not change (because the UUID is generated at filesystem creation) instead of /dev/disk/by-id/.

Related

Google cloud compute instance metrics taking up disk space

I have a google cloud compute instance set up but it's getting low on disk space. It looks like it is the /mnt/stateful_partition/var/lib/metrics directory taking up a significant amount of space (3+gb). I assume this is the compute metrics but I can't find any way to safely remove these other than just deleting the files. Is this going to cause any issues?
The path you are referring are File System directories that are used for the GCE VM instance, and you are correct that the metrics folder is safe to be removed. To learn more about these directories, see Disks and file system overview.
I would also suggest to create a snapshot first if you wanted to make sure that the changes you will do on your instance won't affect your system performance. So that you can easily revert it back to your previous instance state.

Using Consul for dynamic configuration management

I am working on designing a little project where I need to use Consul to manage application configuration in a dynamic way so that all my app machines can get the configuration at the same time without any inconsistency issue. We are using Consul already for service discovery purpose so I was reading more about it and it looks like they have a Key/Value store which I can use to manage my configurations.
All our configurations are json file so we make a zip file with all our json config files in it and store the reference from where you can download this zip file in a particular key in Consul Key/Value store. And all our app machines need to download this zip file from that reference (mentioned in a key in Consul) and store it on disk on each app machine. Now I need all app machines to switch to this new config at the same time approximately to avoid any inconsistency issue.
Let's say I have 10 app machines and all these 10 machines needs to download zip file which has all my configs and then switch to new configs at the same time atomically to avoid any inconsistency (since they are taking traffic). Below are the steps I came up with but I am confuse on how loading new files in memory along with switch to new configs will work:
All 10 machines are already up and running with default config files as of now which is also there on the disk.
Some outside process will update the key in my consul key/value store with latest zip file reference.
All the 10 machines have a watch on that key so once someone updates the value of the key, watch will be triggered and then all those 10 machines will download the zip file onto the disk and uncompress it to get all the config files.
(..)
(..)
(..)
Now this is where I am confuse on how remaining steps should work.
How apps should load these config files in memory and then switch all at same time?
Do I need to use leadership election with consul or anything else to achieve any of these things?
What will be the logic around this since all 10 apps are already running with default configs in memory (which is also stored on disk). Do we need two separate directories one with default and other for new configs and then work with these two directories?
Let's say if this is the node I have in Consul just a random design (could be wrong here) -
{"path":"path-to-new-config", "machines":"ip1:ip2:ip3:ip4:ip5:ip6:ip7:ip8:ip9:ip10", ...}
where path will have new zip file reference and machines could be a key here where I can have list of all machines so now I can put each machine ip address as soon as they have downloaded the file successfully in that key? And once machines key list has size of 10 then I can say we are ready to switch? If yes, then how can I atomically update machines key in that node? Maybe this logic is wrong here but I just wanted to throw out something. And also need to clean up all those machines list after switch since for the next config update I need to do similar exercise.
Can someone outline the logic on how can I efficiently manage configuration on all my app machines dynamically and also avoid inconsistency issue at the same time? Maybe I need one more node as status which can have details about each machine config, when it downloaded, when it switched and other details?
I can think of several possible solutions, depending on your scenario.
The simplest solution is not to store your config in memory and files at all, just store the config directly in the consul kv store. And I'm not talking about a single key that maps to the entire json (I'm assuming your json is big, otherwise you wouldn't zip it), but extracting smaller key/value sets from the json (this way you won't need to pull the whole thing every time you make a query to consul).
If you get the config directly from consul, your consistency guarantees match consul consistency guarantees. I'm guessing you're worried about performance if you lose your in-memory config, that's something you need to measure. If you can tolerate the performance loss, though, this will save you a lot of pain.
If performance is a problem here, a variation on this might be to use fsconsul. With this, you'll still extract your json into multiple key/value sets in consul, and then fsconsul will map that to files for your apps.
If that's off the table, then the question is how much inconsistencies are you willing to tolerate.
If you can stand a few seconds of inconsistencies, your best bet might be to put a TTL (time-to-live) on your in-memory config. You'll still have the watch on consul but you combine it with evicting your in-memory cache every few seconds, as a fallback in case the watch fails (or stalls) for some reason. This should give you a worst-case few seconds inconsistencies (depending on the value you set for your TTL), but normal case (I think) should be fast.
If that's not acceptable (does downloading the zip take a lot of time, maybe?), you can go down the route you mentioned. To update a value atomically you can use their cas (check-and-set) operation. It will give you an error if an update had happened between the time you sent the request and the time consul tried to apply it. Then you need to pull the list of machines, and apply your change again and retry (until it succeeds).
I don't see why you would need 2 directories, but maybe I'm misunderstanding the question: when your app starts, before you do anything else, you check if there's a new config and if there is you download it and load it to memory. So you shouldn't have a "default config" if you want to be consistent. After you downloaded the config on startup, you're up and alive. When your watch signals a key change you can download the config to directly override your old config. This is assuming you're running the watch triggered code on a single thread, so you're not going to be downloading the file multiple times in parallel. If the download failed, it's not like you're going to load the corrupt file to your memory. And if you crashed mid-download, then you'll download again on startup, so should be fine.

How do I make a snapshot of my boot disk?

I've read multiple times that I can cause read/write errors if I create a snapshot. Is it possible to create a snapshot of the disk my machine is booted off of?
It depends on what you mean by "snapshot".
A snapshot is not a backup, it is a way of temporarily capturing the state of a system so you can make changes test the results and revert back to the previously known good state if the changes cause issues.
How to take a snapshot varies depending on the OS you're using, whether you're talking about a physical system or a virtual system, what virtualization platform, you're using, what image types you're using for disks within a given virtualization platform etc. etc. etc.
Once you have a snapshot, then you can make a real backup from the snapshot. You'll want to make sure that if it's a database server that you've flushed everything to disk and then write lock it for the time it takes to make the snapshot (typically seconds). For other systems you'll similarly need to address things in a way that ensures that you have a consistent state.
If you want to make a complete backup of your system drive, directly rather than via a snapshot then you want to shut down and boot off an alternate boot device like a CD or an external drive.
If you don't do that, and try to directly back up a running system then you will be leaving yourself open to all manner of potential issues. It might work some of the time, but you won't know until you try and restore it.
If you can provide more details about the system in question, then you'll get more detailed answers.
As far as moving apps and data to different drives, data is easy provided you can shut down whatever is accessing the data. If it's a database, stop the database, move the data files, tell the database server where to find its files and start it up.
For applications, it depends. Often it doesn't matter and it's fine to leave it on the system disk. It comes down to how it's being installed.
It looks like that works a little differently. The first snapshot will create an entire copy of the disk and subsequent snapshots will act like ordinary snapshots. This means it might take a bit longer to do the first snapshot.
According to :
this you ideally want to shut down the system before taking a snapshot of your boot disk. If you can't do that for whatever reason, then you want to minimize the amount of writes hitting the disk and then take the snapshot. Assuming you're using a journaling filesystem (ext3, ext4, xfs etc.) it should be able to recover without issue.
You an use the GCE APIs. Use the Disks:insert API to create the Persistence disk. you have some code examples on how to start an instance using Python, but Google has libraries for other programming languages like Java, PHP and other

Using Chef/Puppet and managing hand-made changes

I'm running a complex server setup for a defacto high-availability service. So far it takes me about two days to set everything up so I would like to automate the provisioning.
However I do a quite a lot of manual changes to (running) server(s). A typical example is changing a firewall configuration to cope with various hacking attempts, packet floods etc. Being able to work on active nodes quickly is important. Also the server maintains a lot of active TCP connections and loosing those for a simple config change is out of question.
I don't understand if either Chef or Puppet is designed to deal with this. Once I change some system config, I would like to store it somewhere and use it while the next instance is being provisioned. Should I stick with one of those tools or choose a different one?
Hand made changes and provisioning don't take hands. They don't even drink tea together.
At work we use puppet to manage all arquitecture, and as you we need to do hand made changes in a hurry due to performance bottlenecks, attacks, etc.
What we do is first make sure puppet is able to setup every single part of the arquitecture ready to be delivered without any specific tuning.
Then when we need to do hand made changes, if in a hurry as long you don't mess with files managed by puppet there's no risk, if it's a puppet managed file what we need to change then we just stop puppet agent and do whatever we need.
After hurry ended, we proceed as follows:
These changes should be applied to all servers with same symptoms ?
If so, then you can develop what puppet call 'facts' which is code that it's run on the agent on each run and save results in variables available in all your puppet modules, so if for example you changed ip conntrack max value because a firewall was not able to deal with all connections, you could easily (ten lines of code) have in puppet on each run a variable with current conntrack count value, and so tell puppet to set a max value related to current usage. Then all other servers will benefit for this tunning and likely you won't ever have to deal with conntrack issues anymore (as long you keep running puppet with a short frequency which is the default)
These changes should be always applied by hand on given emergencies?
If configuration is managed by puppet, find a way to make configuration include other file and tell puppet to ignore it. This is the easiest way, however it's not always possible (e.g. /etc/network/interfaces does not support includes). If it's not possible, then you will have to stop puppet agent during emergencies to be able to change puppet files without risk of being removed on next puppet run.
Are this changes only for this host and no other host will ever need it?
Add it to puppet anyway! Place a sweet if $fqdn == my.very.specific.host and put inside whatever you need. Even for a single case it's always beneficial (and time consuming) to migrate all changes you do to a server, as will allow you to do a full restore of server setup if for some reason your server crash to a not recoverable state (e.g. hardware issues)
In summary:
For me the trick in dealing with hand made changes it's putting a lot of effort in reasoning how you decided to do the change and after emergency is over move that logic into puppet. If you felt something was wrong because for a given software slots all were used but free memory was still available on the server so to deal with the traffic peak was reasonable to allow more slots to be run, then spend some time moving that logic into puppet. Very carefully of course, and as time consuming as the amount of different scenarios on your architecture you want to test it against, but at the end it's very, VERY rewarding.
I would like to complete Valor's excellent answer.
puppet is a tool to enforce a configuration. So you must think of it this way :
on the machine I run puppet onto...
I ask puppet client...
to ensure that the config of the current machine...
is as specified in the puppet config...
which is taken from a puppet server, or directly from a bunch of puppet files (easier)
So to answer one of your questions, puppet doesn't require a machine or a service reboot. But if a change in a config file you set with puppet requires a reboot of the corresponding service/daemon/app, then there is no way to avoid it. There are method in puppet to tell that a service needs to be relaunched in case of config change. Of course, puppet will not relaunch the service if it sees that nothing changed.
Valor is assuming you use puppet in client/server way, with (for example) puppet clients polling a puppet server for config every hours. But it is also possible to move your puppet files from machine to machine, for example with git, and launch puppet manually. This way is :
far simpler than the client/server technique (authentication is a headache)
only enforce config change when you explicitely ask for it, thus avoiding any overwrite of your handmade changes
This is obviously not the best way to use puppet if you manage a lot of machines, but it may be a good start or a good transition.
And also, puppet is very hard to learn at an interesting level. It took me 2 weeks to be able to automatically install an AWS server from scratch. I don't regret it, but you may want to know that fact if you must convince a boss to allocate you time.

Git environment setup. Advice needed

Background info:
We are currently 3 web programmers (good, real-life friends, no distrust issues).
Each programmers SSH into the single Linux server, where the code resides, under their own username with sudo powers.
We all use work on the different files at one time. We ask the question "Are you in the file __?" sometimes. We use Vim so we know if the file is opened or not.
Our development code (no production yet) resides in /var/www/
Our remote repo is hosted on bitbucket.
I am *very* new to Git. I used subversion before but I was basically spoon-fed instructions and was told exactly what to type to sync up codes and commit.
I read about half of Scott Chacon's Pro Git and that's the extent to most of my Git knowledge.
In case it matters, we run Ubuntu 11.04, Apache 2.2.17, and Git 1.7.4.1.
So Jan Hudec gave me some advice in the previous question. He told me that a good practice to do the following:
Each developer have their own repo on their local computer.
Let the /var/www/ be the repo on the server. Set the .git folder to permission 770.
That would mean that each developer's computer need to have their own LAMP stack (or at least Apache, PHP, MySQL, and Python installed).
The codes are mostly JavaScript and PHP files so it's not a big deal to clone it over. However how do we locally manage the database?
In this case, we only have two tables and it'll be simple to recreate the entire database locally (at least for testing). But in the future when the database gets too big, then should we just remotely log on the MySQL database on the server or should we just have a "sample" data for developing and testing purposes?
What you're doing is transitioning from "everybody works together in one environment" to "everybody has their own development environment". The major benefit is everybody won't be stepping on each other's feet.
Other benefits include a heterogeneous development environment, that is if everyone is developing on the same machine the software will become dependent on that one setup because developers are lazy. If everyone develops in different environments, even just with slightly different versions of the same stuff, they'll be forced to write more robust code to deal with that.
The main drawback, as you've noticed, is setting up the environment is harder. In particular, making sure the database works.
First, each developer should have their own database. This doesn't mean they all have to have their own database server (though its good for heterogeneous purposes) but they should have their own database instance which they control.
Second, you should have a schema and not just whatever's in the database. It should be in a version controlled file.
Third, setting up a fresh database should be automatic. This lets developers set up a clean database with no hassle.
Fourth, you'll need to get interesting test data into that database. Here's where things get interesting...
You have several routes to do that.
First is to make a dump of an existing database which contains realistic data, sanitized of course. This is easy, and provides realistic data, but it is very brittle. Developers will have to hunt around to find interesting data to do their testing. That data may change in the next dump, breaking their tests. Or it just might not exist at all.
Second is to write "test fixtures". Basically each test populates the database with the test data it needs. This has the benefit of allowing the developer to get precisely the data they want, and know precisely the state the database is in. The drawbacks are that it can be very time consuming, and often the data is too clean. The data will not contain all the gritty real data that can cause real bugs.
Third is to not access the database at all and instead "mock" all the database calls. You trick all the methods which normally query a database into instead returning testing data. This is much like writing test fixtures, and has most of the same drawbacks and benefits, but it's FAR more invasive. It will be difficult to do unless your system has been designed to do it. It also never actually tests if your database calls work.
Finally, you can build up a set of libraries which generate semi-random data for you. I call this "The Sims Technique" after the video game where you create fake families, torture them and then throw them away. For example, lets say you have User object who needs a name, an age, a Payment object and a Session object. To test a User you might want users with different names, ages, ability to pay and login status. To control all that you need to generate test data for names, ages, Payments and Sessions. So you write a function to generate names and one to generate ages. These can be as simple as picking randomly from a list. Then you write one to make you a Payment object and one a Session object. By default, all the attributes will be random, but valid... unless you specify otherwise. For example...
# Generate a random login session, but guarantee that it's logged in.
session = Session.sim( logged_in = true )
Then you can use this to put together an interesting User.
# A user who is logged in but has an invalid Visa card
# Their name and age will be random but valid
user = User.sim(
session = Session.sim( logged_in = true ),
payment = Payment.sim( invalid = true, type = "Visa" ),
);
This has all the advantages of test fixtures, but since some of the data is unpredictable it has some of the advantages of real data. Adding "interesting" data to your default sim and rand functions will have wide ranging repercussions. For example, adding a Unicode name to random_name will likely discover all sorts of interesting bugs! It unfortunately is expensive and time consuming to build up.
There you have it. Unfortunately there's no easy answer to the database problem, but I implore you to not simply copy the production database as it's a losing proposition in the long run. You'll likely do a hybrid of all the choices: copying, fixtures, mocking, semi-random data.
A few options, in order of increasing complexity:
You all connect to the live master DB, read/write permissions. This is risky, but I guess you're already doing it. Make sure you have backups!
Use test fixtures to populate a local test DB and just use it. Not sure what tools there are for this in the PHP world.
Copy (mysqldump) the master database and import it into your local machines' MySQL instances, then set up your dev environments to connect to your local MySQL. Repeat the dump/import as necessary
Set up one-way replication from the master to your local instances.
Optionally, set up a read-only user on the main DB, and configure your app to let you switch to a read-only connection to the real master DB in case you can't wait for that next copy of the master data.
Own repo does not mean own Staging server (this config is hardly maintained and extremely bad scaled to 10-20-100 developers)
It's always better to have as soon as possible (semi-)automated build-system, which convert repository-stored source-data to live system (less handwork - less changes to make non-code errors) and (maybe) some type of Continuos Integration (test often, find bugs fast). For build-system (DB-part) you have only to prepare initial data (tables structures, data-dumps) as (versioned) texts, which are
easy mergeable between merges
handled and processed and converted to final usable object by code, not by hand - no human errors, no operation's interferences