Setting up a Nomad cluster on Linux

We talked earlier about our decision to build a platform using Nomad and the importance of having a dedicated team to provide support for the development process. After some weeks of work, we now have a firm understanding of the moving parts that make Nomad function, and where its main strengths lie.

Nomad is a simple but flexible orchestrator that stands as an alternative to Kubernetes. Its simplicity is a consequence of its architecture and scope: Nomad schedules jobs and it provides little beyond the set of management tasks necessary to do that. In this sense, it might appear that Nomad is lackluster but the trick is that it integrates very well with other products from Hashicorp, namely Consul and Vault, in order to provide a set of very powerful, cohesive features.

Here is a brief summary of each program:

  • Nomad: Cluster manager that works as job scheduler/orchestrator. It exposes a set of management tasks through many clients (CLI, REST API, Web UI).
  • Consul: Tool that works as the control plane of the service mesh. It provides service discovery, mesh service, and various health checks. Some Nomad features (clustering, service registration, etc.) depend on the nodes being able to reach the Consul agents.
  • Vault: Shared datastore that helps creation, rotation, and distribution of sensitive data and secrets such as tokens, API keys, etc.

All systems are cluster-aware, and Nomad agents can run as servers and/or clients. The main difference between server and clients is that only servers get to decide on allocations (based on information available about the clients at a specific point in time), whereas clients only run scheduled jobs. There are strict network requirements for servers in order to ensure availability and high-throughput responses. Due to this, it is expected for production clusters to be formed by only 3-5 server nodes, and as many clients as are needed by the business requirements. For the purposes of this tutorial, we will be setting up a simple 3-node cluster, where each node will run both as Nomad server and client, alongside Consul. Vault integration and topology considerations for a production-ready cluster will be left for future posts.

We will set up our servers using Arch Linux. The only requirements will be that the nodes are all within the same LAN and that we have SSH access to them.

Prepare certificates for Consul

In order to communicate, clients will need a symmetric key that we can generate before setting up Consul on the cluster. We are also going to generate 2 certificates and distribute them to the nodes.

The following can be executed on any computer with Consul, it does not have to be within the cluster LAN. For this guide, we are installing Consul on a local computer and using that to generate the requirements before jumping on the cluster nodes:

brew install consul
consul keygen

Do take note of the result and then, run

consul tls ca create

to generate the CA (Certificate Authority) files. You should now have the following files: consul-agent-ca.pem and consul-agent-ca-key.pem. Quoting this documentation:

The CA certificate, consul-agent-ca.pem, contains the public key necessary to validate Consul certificates and therefore must be distributed to every node that runs a consul agent.
The CA key, consul-agent-ca-key.pem, will be used to sign certificates for Consul nodes and must be kept private. Possession of this key allows anyone to run Consul as a trusted server or generate new valid certificates for the datacenter and obtain access to all Consul data, including ACL tokens.

We can copy these files to the nodes using scp or any similar tool:

scp consul-agent-ca* root@<server_ip>:~/

Installing Consul

We are now ready to jump on the cluster nodes running Linux. Let's start by installing the Consul binaries:

export CONSUL_VERSION="1.12.3"
export CONSUL_URL="https://releases.hashicorp.com/consul"

curl --silent --remote-name ${CONSUL_URL}/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_linux_amd64.zip
unzip consul_${CONSUL_VERSION}_linux_amd64.zip
chown root:root consul
mv consul /usr/local/bin/

We want to create a user and create the required directories and service files:

useradd --system --home /etc/consul.d --shell /bin/false consul
mkdir --parents /opt/consul
chown --recursive consul:consul /opt/consul

mkdir --parents /etc/consul.d

# Create config file, which we will write later
touch /etc/consul.d/consul.hcl
chown --recursive consul:consul /etc/consul.d
chmod 640 /etc/consul.d/consul.hcl

Notice that as part of the script, we created a file called consul.hcl.

You can have multiple configuration fields in different files. In fact, it is recommended to do so. However, for simplicity, we are going to use just one in this guide. Recall that we are trying to run the nodes both as clients and servers, so it should contain appropriate settings for this goal.

consul.hcl should contain the following minimal configuration:

server = true
bootstrap_expect = 3

datacenter = "dc1"
data_dir = "/opt/consul"
encrypt = "<your-symmetric-encryption-key>"
ca_file = "/etc/consul.d/consul-agent-ca.pem"
cert_file = "/etc/consul.d/dc1-server-consul-0.pem"
key_file = "/etc/consul.d/dc1-server-consul-0-key.pem"
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true
retry_join = ["192.168.100.2"]
bind_addr = "{{ GetPrivateInterfaces | include \"network\" \"192.168.100.0/24\" | attr \"address\" }}"

acl = {
  enabled = true
  default_policy = "allow"
  enable_token_persistence = true
}

performance {
  raft_multiplier = 1
}

You should replace <your-symmetric-encryption-key> with the key we generated before, and both bind_addr and retry_join (which will ensure that if any server loses connection with the datacenter for any reason, including the node restarting, it can rejoin when it comes back)  should be tailored to your LAN configuration. Some of the remaining fields refer to certificates we have not yet created.

You might have noticed the {{ }} syntax. Nomad uses this to delimit Go templates. You can learn more about it here. In this case, we are utilizing it to specify the IP address that we want to bind for cluster communication.

Finally, let's create a systemd service file for Consul, etc/systemd/system/consul.service:

[Unit]
Description="HashiCorp Consul - A service mesh solution"
Documentation=https://www.consul.io/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/etc/consul.d/consul.hcl

[Service]
Type=exec
User=consul
Group=consul
ExecStart=/usr/local/bin/consul agent -config-dir=/etc/consul.d/
ExecReload=/bin/kill --signal HUP $MAINPID
KillMode=process
KillSignal=SIGTERM
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Next, let's create certificates for the Consul server. You need to be in the directory in which you previously copied consul-agent-ca.pem and consul-agent-ca-key.pem for this, since it will make use of the CA key to make a valid certificate.

consul tls cert create -server -dc dc1

cp consul-agent-ca.pem /etc/consul.d/
cp dc1-server-consul-0.pem /etc/consul.d/
cp dc1-server-consul-0-key.pem /etc/consul.d

Let's go ahead and enable Consul service:

systemctl enable consul
systemctl start consul

If everything goes well, after doing this on every server and waiting a few seconds, running consul members should return a list with all nodes. Congrats, we are past the tougher part of the process!

Installing Nomad

We are only missing one service: Nomad! Let's install it and configure it. The process will be very similar to Consul's:

export NOMAD_VERSION="1.3.2"
curl --silent --remote-name https://releases.hashicorp.com/nomad/${NOMAD_VERSION}/nomad_${NOMAD_VERSION}_linux_amd64.zip
unzip nomad_${NOMAD_VERSION}_linux_amd64.zip
chown root:root nomad
mv nomad /usr/local/bin/

mkdir --parents /opt/nomad
mkdir --parents /etc/nomad.d
chmod 700 /etc/nomad.d

# Create config files, which we will write later
touch /etc/nomad.d/nomad.hcl
touch /etc/systemd/system/nomad.service

For the nomad.service file, we have:

[Unit]
Description=Nomad
Documentation=https://www.nomadproject.io/docs
Wants=network-online.target
After=network-online.target

[Service]
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/nomad agent -config /etc/nomad.d
EnvironmentFile=/etc/nomad.d/nomad.env
KillMode=process
KillSignal=SIGINT
LimitNOFILE=infinity
LimitNPROC=infinity
Restart=on-failure
RestartSec=2
StartLimitBurst=3
StartLimitIntervalSec=10
TasksMax=infinity

[Install]
WantedBy=multi-user.target

The config directory is set up to be /etc/nomad.d, where we created nomad.hcl. Let's see how it is configured:

client {
  enabled = true

  network_interface = "{{ GetPrivateInterfaces | include \"network\" \"192.168.100.0/24\" | attr \"name\" }}"

  host_network "public" {
    interface = "{{ GetPublicInterfaces | attr \"name\" }}"
  }

  host_network "private" {
    interface = "{{ GetPrivateInterfaces | include \"network\" \"192.168.100.0/24\" | attr \"name\" }}"
  }
}

acl {
  enabled = true
}

vault {
  enabled = false
  address = "http://vault.service.consul:8200"
}

server {
  enabled = true
  bootstrap_expect = 3
}

Again we see that we have configurations related to both client and server configuration, since our Nomad cluster is formed by nodes that act as both. You can split these configurations into several files and contain them in the same directory we already specified.

We included the vault stanza, but we are disabling it for now. In future posts, we will install it and enable the integration with Nomad, so that we can store secrets for our jobs to use in a secured manner.

We are now able to start the Nomad service:

systemctl enable nomad
systemctl start nomad

Congratulations! You now have a highly available Nomad cluster, with great features such as service discovery, service mesh, secret sharing, and job orchestration. As with Consul, you should be able to run Nomad commands (either on your cluster or your local computer by exporting NOMAD_ADDR and NOMAD_TOKEN accordingly):

$ nomad status

ID        DC   Name                 Class   Drain  Eligibility  Status
3ff57184  dc1  entropy-cluster-A-3  <none>  false  eligible     ready
e61d47d4  dc1  entropy-cluster-A-1  <none>  false  eligible     ready
18e5af96  dc1  entropy-cluster-A-2  <none>  false  eligible     ready

You should also be able to access Nomad's web UI by accessing your hosts on port 4646.

Wrapping up

We now have a cluster running Nomad and Consul. In the next post of this series, we'll configure Vault, and use the cluster to run a web application and its database as our first Nomad jobs.

Show Comments