After spending some time with coreos and etcd on ec2, I’ve come away with a few lessons learned:
- Provide stable IP addressing for the etcd nodes. In AWS, this means using a VPC or clustering with public (elastic) ips. After stopping and re-starting all 3 instances, I was unable to recover the original cluster built from the cloudformation stack since all 3 nodes changed IPs. I feel that there should be a way to recover a cluster with the data dirs (conf + log) intact, however I couldn’t make it work. More research is needed.
- You can (perhaps, should) bootstrap your cluster without a discovery service. The etcd discovery api went down while I was working on rebuilding my cluster. This provided me a great opportunity to learn about how etcd clustering works, and further increased my suspicion of relying on external services for managing infrastructure.
- If you haven’t already, take some time to get familiar with systemd before diving into coreos. I hadn’t spent much time with systemd, and ended up detouring to learn the basics so I could troubleshoot simple things like syslogs and init scripts.
For a better understanding of etcd clustering, I recommend reading these docs a few times:
Also, I’ve created an etcd reference page based on my explorations.
This blog post records the process I went through while trying to get etcd starting (and re-starting) in AWS. It’s unlikely you’ll need to perform these steps yourself since there is little benefit to manually wiping a cluster and starting over, as opposed to deploying new instances and building a new cluster from scratch. Also, if you use static IPs on your coreos nodes, you will likely not find yourself in a similar position.
I wanted to start playing around with etcd, coreos, and fleet while on vacation, but my macbook air lacked the RAM to do this effectively. So I turned to EC2. However, not wanting to commit to running 3 m3.medium instances continuously, I wanted to make sure I could stop and start all instances without any issues or loss of data. With this goal in mind, I turned to the coreos docs to get started.
Getting started with CloudFormation
The CoreOS docs provide a guide and CloudFormation stack to get started. Getting up and running was easy. After generating a discovery token via etcd’s discovery service, I added this to the cloud-init section and booted the cluster. Feeling proud of my accomplishment, I shut my instances down and went back to my vacation.
Upon restarting my instances, I noticed that the cluster was unable to start.
Each of the nodes was attempting to connect to the peers found during initial discovery, but all 3 instances had changed IPs in the meantime. And thus began my journey of (re-)discovery.
My first thought was to update the discovery url and start a new cluster. It took some digging to find out where the discovery url was being set. Here’s the general path I took.
After peeking through the named directories, I found that the etcd parameters were being set by cloud-init in /run/systemd/system/etcd.service.d/20-cloudinit.conf.
Completely forgetting how cloud-init worked, I started out by generating a new discovery token and dropping it straight into 20-cloudinit.conf. After the change, I was prompted to reload systemd and I complied before restarting etcd.
This appeared to work after doing this on all 3 nodes. Checking the output of the discovery service, I saw that all 3 of my hosts had entered. Once again satisfied with my progress, I shut down my instances and returned to my vacation. Upon restarting my instances, however, I was once again greeted with a broken cluster.
After some digging, I was able to determine that the discovery url had reverted to the original token. After seeing the reverted cloudinit.conf, I realized that the cloud-init that I provided with the cloudformation stack was getting reapplied in /run/ after each reboot. To properly update the discovery url, the userdata would need to be updated within cloudformation.
Reverting to manual discovery
Around this time, I noticed that https://discovery.etcd.io was no longer responding or handing out new tokens. The docs outline a procedure for generating your own token and using your own discovery cluster, but this is problematic if you’re trying to launch your first etcd cluster. No problem, let’s do it the old-fashioned way.
Since the cloudformation stack is designed to start via discovery url, I decided to delete the stack and build the cluster manually. After launching the 3 instances, and logging in, all 3 nodes were leaders of their own clusters. This made sense, since no existing cluster information was supplied, either via logs, discovery service, or explicit peers.
Reading through the etcd docs on clustering and cluster discovery, it looked like manually specifying peers was what I wanted. My plan was to stop the etcd.service via systemd and then manually launch the etcd binary with the correct parameters to update the configuration and join the new cluster. Once everything was re-configured, I would restart the etcd service and, hopefully, have a working cluster.
In my first attempts, I chose to leave the logs and config intact hoping that etcd would fall through to the peers options and rebuild. This did not appear to work, and in the end, I was unable to find a way to rebuild the cluster and was forced to wipe the conf/log information before I could get things to work.
Here are the steps and log output I captured from the last successful iteration.
On the first node, stop etcd, wipe the existing config and restart etcd. By default, etcd should start a new cluster if it is not able to join an existing one.
To test the clustering, I set a key on the first node and tracked when it showed up on the others.
On to the second node. You can cat the cloudinit.conf to quickly spit out the configured etcd_name and addresses. You’ll want these to match so that the etcd service will start with the same parameters. I later found that you can start etcd with -f (force) instead of manually wiping the conf and log files in /var/lib/etcd. Note that starting etcd as sudo means the conf and log files are owned by root. These need to be set back to etcd:etcd before restarting the service.
Repeat once more for node 3. And on all three nodes, you should be able to verify that the machines list and leader are the same: