CoreOS — How to Set Up a Cluster

Last week we’ve learned about CoreOS’ components and technologies within the ecosystem. This week, we’re directly diving into the practical part and get our hands dirty. If you read the first post, you already know that CoreOS is an operating system geared towards high-availability. Setting up a cluster of nodes requires some configuration and we’ll guide you through the necessary steps to set up your own CoreOS cluster.

CoreOS Series Overview

This Video is Only Available for Future Students

Sorry, only Future Students can view this video.

Enroll to receive exclusive content or sign in if you’re already a Future Student.

Enroll me for $15/mo

Introduction

Within the CoreOS documentation, you can find various guides to run the operating system in different environments. If you want to run CoreOS on DigitalOcean, Amazon EC2, OpenStack, RackSpace, Google Compute Engine, or just bare metal, just go ahead and check out the available docs. Of course, you can run CoreOS in virtualization environments. Since we’re just getting started with CoreOS, we might break things and therefore we use Vagrant as our tool of choice.

Just to make sure: please break things! It’s the best way to understand how components interact within CoreOS’ system. Learning from try-and-error is highly appreciated :)

Preparation

CoreOS maintains the coreos-vagrant repository on GitHub which provides a solid basis to get a cluster up and running within minutes. We’ll use this repository to create a local CoreOS cluster. Of course, you can use this tutorial to configure and run your CoreOS cluster on any cloud platform. We decided to use vagrant, because it’s much easier to work locally on your machine when getting started with a new system.

First, make sure you have the requirements installed:

  1. VirtualBox
  2. Vagrant

If you don’t have git installed, download the code from coreos-vagrant repository on GitHub as an archived file (like zip). Unpack the archive and cd from the command line into the recently unpacked folder.

If you have git installed, clone the coreos-vagrant repository and cd into it.

With VirtualBox and Vagrant installed, we’re ready to go.

Configuration

Using CoreOS is only useful when running at least three machines. Only then you’ll benefit from one of CoreOS’s main goals: high-availability. Running a cluster consisting of 2 machines, CoreOS isn’t able to decide on a leader. Because both machines will submit their vote for a leader and can’t find a decision since there is no majority (50-50).

CoreOS uses etcd to connect machines within the cluster. Additionally, etcd selects a cluster leader automatically. Every machine that is not a leader is a follower and can accept the leader role if the cluster leader fails due to hardware issues or whatever reason. We’ll explain etcd in more detail within the upcoming article.

Obtain an Etcd Discovery Token

To spin up a cluster easily, etcd uses a discovery token. etcd will use an existing cluster to create a new one and optains a cluster token from the exising one to connect machines within the new cluster.

You can use the exising etcd functionality on CoreOS’s etcd cluster. They expose a url to optain a new discovery token from their exising cluster. You need to predefine the size of your new cluster. etcd will use a default value of 3 if you don’t pass a proper cluster size as query parameter when using CoreOS’ discovery service.

You can optain a discovery token with a cluster size of 3 when just using the following url. The token value is the alphanumeric string at the end of the returned url.

https://discovery.etcd.io/new  

Pass the size of your cluster as a query parameter to the endpoint. Use ?size=n and replace n with your desired cluster size. In this guide, we’ll use a cluster size of 4.

https://discovery.etcd.io/new?size=4  

The returned url including the discovery token:

https://discovery.etcd.io/638fa2b0a1ff50075e170080046c8649  

We’re going to use the discovery url within our #cloud-config. The following section explains the #cloud-config in more detail.

Cloud-Config

CoreOS uses the #cloud-config to configure parameters for services and machines, launch systemd units on system boot. The coreos-vagrant repository has an exising user-data.sample file with a predefined #cloud-config content. The project will recognize a user-data file within the root directory. That means, you need to either copy the user-data.sample over to user-data or just create a new user-data file.

The content of the user-data file for this guide:

#cloud-config

coreos:  
  etcd2:
    # generate a new token for each unique cluster 
    # from https://discovery.etcd.io/new?size=n where n = cluster size
    # discovery url to bootstrap the cluster
    discovery: https://discovery.etcd.io/638fa2b0a1ff50075e170080046c8649
    # multi-region and multi-cloud deployments need to use $public_ipv4
    # list of member’s client urls to advertise information to the rest of the cluster
    advertise-client-urls: http://$public_ipv4:2379
    # this address is used to communicate etcd data around the cluster
    initial-advertise-peer-urls: http://$private_ipv4:2380
    # listen on both the official ports and the legacy ports
    # legacy ports can be omitted if your application doesn't depend on them
    # url to listen for client traffic
    listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001
    # url to listen for peer traffic
    listen-peer-urls: http://$private_ipv4:2380,http://$private_ipv4:7001
  fleet:
    public-ip: $public_ipv4
  flannel:
    interface: $public_ipv4
  units:
    - name: etcd2.service
      command: start
    - name: fleet.service
      command: start

If you copy the config above, make sure you replace the <token> with your discovery token value. The $private_ipv4 and $public_ipv4 variables are substitution variables which will be replaced by vagrant with the actual machine specific values.

config.rb

The basic coreos-vagrant repository has an existing config.rb.sample file for further cluster configuration. Actually, we don’t need to copy over the config.rb.sample to config.rb and perform further cluster configuration. The only property we’re going to change is the number of cluster machine instances within the cluster. We define the cluster size value within the Vagrantfile.

Vagrantfile

If you previously worked with Vagrant, you know the syntax and options within a Vagrantfile and ways to configure machines to your needs. If you’re currently losing your vagrant virginity, take a look at the Vagrantfile docs to get a basic understanding.

The Vagrantfile within the coreos-vagrant repository is quite complex, that’s why you need at least some fundamentals to understand the details going on with your machines.

However, if you don’t want to mess with options for Vagrantfiles, just go ahead and open the file. We’re just changing two values and afterwards kick off the cluster.

Find and change the following variables within your Vagrantfile:

$num_instances = 5
$update_channel = "stable"

The $num_instances variable define the cluster size. We’re starting 5 etcd instances, even though we defined a cluster size of 4 previously when optaining the etcd discovery token. The extra CoreOS instance will fall back to being a proxy node by default.

CoreOS offers three update channels: stable, beta, alpha. To be honest, it doesn’t really matter which channel you choose when just spinning up the first cluster. Nevertheless, we stay on save paths and go with the stable version of CoreOS.

Start Your Cluster

We’ve finished the required configuration to get our CoreOS cluster up and running. Using the vagrant default provider VirtualBox, we start the cluster using the vagrant up command.

The command line output will look the this:

$ vagrant up
Bringing machine 'core-01' up with 'virtualbox' provider...  
Bringing machine 'core-02' up with 'virtualbox' provider...  
Bringing machine 'core-03' up with 'virtualbox' provider...  
Bringing machine 'core-04' up with 'virtualbox' provider...  
Bringing machine 'core-05' up with 'virtualbox' provider...  
==> core-01: Importing base box 'coreos-stable'...
==> core-01: Matching MAC address for NAT networking...
==> core-01: Checking if box 'coreos-stable' is up to date...
==> core-01: A newer version of the box 'coreos-stable' is available! You currently
==> core-01: have version '717.3.0'. The latest is version '723.3.0'. Run
==> core-01: `vagrant box update` to update.
==> core-01: Setting the name of the VM: coreos-vagrant_core-01_1438938703047_7521
==> core-01: Clearing any previously set network interfaces...
==> core-01: Preparing network interfaces based on configuration...
    core-01: Adapter 1: nat
    core-01: Adapter 2: hostonly
==> core-01: Forwarding ports...
    core-01: 22 => 2222 (adapter 1)
==> core-01: Running 'pre-boot' VM customizations...
==> core-01: Booting VM...
…

Once all 5 machines within the cluster are created and booted by vagrant, you can check their status:

$ vagrant status
Current machine states:

core-01                   running (virtualbox)  
core-02                   running (virtualbox)  
core-03                   running (virtualbox)  
core-04                   running (virtualbox)  
core-05                   running (virtualbox)

This environment represents multiple VMs. The VMs are all listed  
above with their current state. For more information about a specific  
VM, run `vagrant status NAME`.  

Every machine is running. Great :)

Let’s check the cluster and machine status within etcd and fleet. You can use vagrant ssh <machine-name> to ssh into any of the created and booted machines.

Cluster Members

etcd is responsible to connect all machines within the cluster. It stores information about the cluster members and automatically selects a leader.

The following command is executed from within a CoreOS system. SSH into one of the machines and execute the commands. Show the list of cluster members with the command etcd member list and inspect if every machine joined correctly during boot.

$ etcdctl member list
bc265403c17a8873: name=01a2ce2426014b6285fc87dc9c2ff8b0 peerURLs=http://172.17.8.101:2380 clientURLs=http://172.17.8.101:2379  
bd41d18de4cae191: name=ce70e12e334045469a392d1900a6f0dd peerURLs=http://172.17.8.103:2380 clientURLs=http://172.17.8.103:2379  
e14b97603ae78a2d: name=ca89e84c086e4b459fec4d9b458b1e6b peerURLs=http://172.17.8.104:2380 clientURLs=http://172.17.8.104:2379  
efd857606dbfcd01: name=c919f394360c4fa78f518f28562af511 peerURLs=http://172.17.8.102:2380 clientURLs=http://172.17.8.102:2379  

The list prints 4 cluster members. Remember that we defined a cluster size of 4 machines while obtaining the discovery token. Etcd automatically let’s 4 nodes join the cluster and every additional machine falls back to a proxy node.

Machines Within the Cluster

Since we defined created 5 CoreOS machines, let’s check whether all nodes are booted correctly and are available. Even though our cluster consists of 4 machines, there is 1 proxy node staying in the information loop of etcd. All etcd cluster data is also passed to the proxy node.

Use the fleetctl command line utility to show the list of machines available.

$ fleetctl list-machines --full=true
01a2ce2426014b6285fc87dc9c2ff8b0    172.17.8.101    -  
10fb48847b94440dae94054d3b88f44a    172.17.8.105    -  
c919f394360c4fa78f518f28562af511    172.17.8.102    -  
ca89e84c086e4b459fec4d9b458b1e6b    172.17.8.104    -  
ce70e12e334045469a392d1900a6f0dd    172.17.8.103    -  

We use the --full=true option to show the full id of each machine. This way, we can compare the machines within the etcd cluster and machines generally available (including proxy nodes).


Problem: etcd2 Not Running or Machines Missing Within the Cluster

When starting out with CoreOS, etcd and fleet, we directly ran into the issue that etcd couldn’t connect to the cluster or other machines. At first, we didn’t know what to do, because we didn’t understand why this error occurs.

$ fleetctl list-machines
Error retrieving list of active machines: googleapi: Error 503: fleet server unable to communicate with etcd  

fleet server unable to communicate with etcd

We couldn’t get the machines connected to each other. The first thing we didn’t keep track of: when copying the user-data.sample over to user-data, there is a config definition for etcd and etcd2.

Every new CoreOS release ships with etcd2. Verify if you start etcd2 within the #cloud-config and delete the etcd lines.

This issue can occur due to another reason: there are not enough machines within your cluster. You need at least as many machines within the cluster as defined when obtaining the discovery token. Defining a cluster size of 5 machines requires you to start and connect at least 5 etcd instances to the cluster. Only then is your cluster in healthy state.

Outlook

This guide shows you how to set up your local CoreOS cluster with the help of vagrant. Don’t hesitate to crash any CoreOS instance or misconfigure the cluster. Make use of the benefits that come with vagrant.

Next week, we’ll dive more into etcd. We’ll have a look at its internal architecture, configuration options, and the role etcd plays within the CoreOS ecosystem.


Additional Resources

Explore the Library

Find interesting tutorials and solutions for your problems.