Seamless Docker Multihost Overlay Networking on DigitalOcean With Machine, Swarm, and Compose ft. RethinkDB
There have been a lot of good articles popping up lately on the new Docker networking features and how to use them with existing Docker tools. So far, most guides will get you through setting up VirtualBox, which is great for getting started, but nothing beats the feeling of getting your hands on an enormous supply of seamlessly networked computing power. So, this article uses Docker Machine, Swarm, and Compose to take it to the cloud and put that power in your hands. I hope to stimulate your imagination as well as set the gears turning for you on some complications and potential solutions for actually putting this stuff out there in the real world.
Today we’re going to:
- Spin up a Swarm cluster on DigitalOcean using Docker Machine
- Provision the created nodes using Ansible containers
- Run a 4-node RethinkDB cluster across those nodes
- Use Docker Machine SSH port forwarding to access the RethinkDB admin panel without exposing it publicly
You can clone this repo to follow along at home with the relevant files, and a convenient script including all of the outlined commands, if desired.
Initial Setup
First install:
- The latest versions of Docker Machine and Docker Compose
- The Docker 1.9.1 client binary
And create an account with DigitalOcean if you don’t have one already. Next,
ensure that the DIGITALOCEAN_ACCESS_TOKEN environment variable is set with
your DigitalOcean API token.
$ export DIGITALOCEAN_ACCESS_TOKEN=asdfasdfasdfasdfasdfasdfasdfasdfWe’re going to use Debian 8 for our host operating system today, so let’s
configure Machine to expect that for this terminal session as well (Docker
Machine recently added support for Ubuntu >=15.04, which is needed for the
overlay driver of libnetwork, but that won’t be released until about a week
after the time of writing). We’ll also make sure to active the private
networking feature on the created servers. This will come in handy later as
we’re setting up our overlay network. I also set the DIGITALOCEAN_REGION to
sfo1 (San Francisco), but you could set it to a region of your choice.
$ export DIGITALOCEAN_IMAGE=debian-8-x64
$ export DIGITALOCEAN_PRIVATE_NETWORKING=true
$ export DIGITALOCEAN_REGION=sfo1The Key-Value Store
First, let’s create a host to contain the key-value store. This will be used by both Swarm and libnetwork to bootstrap and communicate the shared state of the cluster across nodes.
$ docker-machine create -d digitalocean kvstore
...When that’s finished, take a look at the output of the ifconfig command on
the created instance. It should look something like this.
$ docker-machine ssh kvstore ifconfig
docker0 Link encap:Ethernet HWaddr 02:42:88:cb:4f:b9
inet addr:172.17.0.1 Bcast:0.0.0.0 Mask:255.255.0.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
eth0 Link encap:Ethernet HWaddr 04:01:87:b3:66:01
inet addr:159.203.108.236 Bcast:159.203.111.255 Mask:255.255.240.0
inet6 addr: fe80::601:87ff:feb3:6601/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:12869 errors:0 dropped:0 overruns:0 frame:0
TX packets:6125 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:16411203 (15.6 MiB) TX bytes:566452 (553.1 KiB)
eth1 Link encap:Ethernet HWaddr 04:01:87:b3:66:02
inet addr:10.132.231.52 Bcast:10.132.255.255 Mask:255.255.0.0
inet6 addr: fe80::601:87ff:feb3:6602/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:8 errors:0 dropped:0 overruns:0 frame:0
TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:648 (648.0 B) TX bytes:728 (728.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)You can see that Docker has created its usual bridge, docker0, and there is
an interface eth0 which allows inbound and outbound access to the Internet.
There is also an interface eth1 which allows private networking between nodes
in the same datacenter. We will use this in this walkthrough to ensure that we
at least don’t expose our key value store to the entire Internet.
You can most likely verify this private-vs-public address assertion by using
ping on your local computer.
The public address:
$ ping -c 1 $(docker-machine ssh kvstore 'ifconfig eth0 | grep "inet addr:" | cut -d: -f2 | cut -d" " -f1')
PING 159.203.108.236 (159.203.108.236): 56 data bytes
64 bytes from 159.203.108.236: icmp_seq=0 ttl=48 time=79.571 ms
--- 159.203.108.236 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 79.571/79.571/79.571/0.000 msThe private address (note that we use eth0 instead of eth1 here):
$ ping -c 1 $(docker-machine ssh kvstore 'ifconfig eth1 | grep "inet addr:" | cut -d: -f2 | cut -d" " -f1')
PING 10.132.231.52 (10.132.231.52): 56 data bytes
--- 10.132.231.52 ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet lossDIGRESSION: What to expect when you’re expecting (libnetwork & swarm)
The overlay driver for libnetwork has a few expectations of you before it
will do its magic sufficiently advanced technological trick. Likewise,
Swarm has at least one thing it needs configured to be able to properly
schedule based on resources.
I’ve set everything up properly in this article, but if you’re following along at home and deviating from the specifically prescribed commands here you ABSOLUTELY MUST ensure that:
- Your Linux kernel version is greater than or equal to 3.16.
- The ports for Serf and VXLAN are available for
inbound connections for TCP and UDP-based traffic. These are
:7946and:4789respectively. - Memory accounting is enabled on the created
instances
(this is needed for Swarm to schedule properly with
-m)
Swarm
So, we have an IP that we can use to talk to other servers in the same data center, and we will use this to bootstrap our cluster. Let’s go ahead and save that into a shell environment variable so we don’t have to execute that lengthy command each time we want to use it.
$ export KV_IP=$(docker-machine ssh kvstore 'ifconfig eth1 | grep "inet addr:" | cut -d: -f2 | cut -d" " -f1')In the future, hopefully this might be available from some type of
docker-machine ip --private command or similar.
Now let’s run Consul, a key-value store which enables
discovery of nodes for Docker. docker run’s -p flag accepts an optional
parameter to specify the interface that the exposed container port should be
forwarded to. E.g., you can specify to expose port 8080 from the container
only on localhost, instead of on 0.0.0.0 (the default), using docker run -p 127.0.0.1:8080:8080.
So, naturally, in our Consul container, we will forward the port to our private networking interface mentioned above so that only machines in the same datacenter can access it.
$ eval $(docker-machine env kvstore)
$ docker run -d \
-p ${KV_IP}:8500:8500 \
-h consul \
--restart always \
progrium/consul -server -bootstrap
...Now we’ll set up the Swarm master box (I like to think of it as a “queen bee”,
hence the name). --swarm and --swarm-master flags are hopefully
self-explanatory. But take a look at those other flags. They’re where the fun
bits happen.
$ docker-machine create \
-d digitalocean \
--swarm \
--swarm-master \
--swarm-discovery="consul://${KV_IP}:8500" \
--engine-opt="cluster-store=consul://${KV_IP}:8500" \
--engine-opt="cluster-advertise=eth1:2376" \
queenbee--swarm-discovery instructs the created Swarm worker container to look for
the created key-value store using the specified address and protocol
(consul:// here, but it also works for Docker Hub discovery using token://,
ZooKeeper using zk://, and so on). This
allows the instances of the Swarm to find and communicate with each other.
--engine-opt allows us to set Docker daemon flags without needing to edit the
configuration files manually. Here we have two flags that we’re setting:
--cluster-store and --cluster-advertise.
--cluster-storetells the Docker daemon which KV store to use for libnetwork’s needed coordination, similar to the--swarm-discoveryoption outlined above.--cluster-advertiseallows us to specify an address that the created Docker daemon should “advertise” as connectable to the cluster using the KV store.
After the queen bee, we create at least one worker bee node. E.g.:
$ export NUM_WORKERS=3; for i in $(seq 1 $NUM_WORKERS); do
docker-machine create \
-d digitalocean \
--swarm \
--swarm-discovery="consul://${KV_IP}:8500" \
--engine-opt="cluster-store=consul://${KV_IP}:8500" \
--engine-opt="cluster-advertise=eth1:2376" \
workerbee-${i} &
done;
waitYou should now be able to verify that the swarm has been created.
$ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM
default - virtualbox Saved
kvstore - digitalocean Running tcp://159.203.108.236:2376
queenbee * digitalocean Running tcp://159.203.105.26:2376 queenbee (master)
workerbee-1 - digitalocean Running tcp://159.203.116.251:2376 queenbee
workerbee-2 - digitalocean Running tcp://159.203.77.141:2376 queenbee
workerbee-3 - digitalocean Running tcp://159.203.71.235:2376 queenbeeSet the environment variables for connection to the swarm master:
$ eval $(docker-machine env --swarm queenbee)And verify connectivity of the swarm by running docker info. You should see
something like this:
$ docker info
Containers: 5
Images: 4
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 4
queenbee: 159.203.105.26:2376
└ Containers: 2
└ Reserved CPUs: 0 / 1
└ Reserved Memory: 0 B / 519.2 MiB
└ Labels: executiondriver=native-0.2, kernelversion=3.16.0-4-amd64, operatingsystem=Debian GNU/Linux 8 (jessie), provider=digitalocean, storagedriver=aufs
workerbee-1: 159.203.116.251:2376
└ Containers: 1
└ Reserved CPUs: 0 / 1
└ Reserved Memory: 0 B / 519.2 MiB
└ Labels: executiondriver=native-0.2, kernelversion=3.16.0-4-amd64, operatingsystem=Debian GNU/Linux 8 (jessie), provider=digitalocean, storagedriver=aufs
workerbee-2: 159.203.77.141:2376
└ Containers: 1
└ Reserved CPUs: 0 / 1
└ Reserved Memory: 0 B / 519.2 MiB
└ Labels: executiondriver=native-0.2, kernelversion=3.16.0-4-amd64, operatingsystem=Debian GNU/Linux 8 (jessie), provider=digitalocean, storagedriver=aufs
workerbee-3: 159.203.71.235:2376
└ Containers: 1
└ Reserved CPUs: 0 / 1
└ Reserved Memory: 0 B / 519.2 MiB
└ Labels: executiondriver=native-0.2, kernelversion=3.16.0-4-amd64, operatingsystem=Debian GNU/Linux 8 (jessie), provider=digitalocean, storagedriver=aufs
CPUs: 4
Total Memory: 2.028 GiB
Name: 37a57749a3b9Note that you can see all of the Swarm containers:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2091ccf25804 swarm:latest "/swarm join --advert" 12 minutes ago Up 12 minutes 2375/tcp workerbee-1/swarm-agent
097f32b1d435 swarm:latest "/swarm join --advert" 12 minutes ago Up 12 minutes 2375/tcp workerbee-2/swarm-agent
4eb1fc84d399 swarm:latest "/swarm join --advert" 12 minutes ago Up 12 minutes 2375/tcp workerbee-3/swarm-agent
d6adead23e97 swarm:latest "/swarm join --advert" 20 minutes ago Up 20 minutes 2375/tcp queenbee/swarm-agent
37a57749a3b9 swarm:latest "/swarm manage --tlsv" 20 minutes ago Up 20 minutes 2375/tcp, 159.203.105.26:3376->3376/tcp queenbee/swarm-agent-masterBootstrapping node configuration
We can use the Ansible trick from this article to bootstrap some basic node configuration once they have been created. This will set up some firewalls, install some sysadmin-friendly software on the created instances and configure the GRUB profile to activate memory accounting.
To make the previous article’s trick work with Swarm, we can update the
definition of the provision service to have anti-affinity with other
containers of the same type using something like this:
The label and Swarm scheduling constraint set through the environment variable
will ensure that no two provision service containers are scheduled on the
same host.
To provision, this will do:
$ for i in $(seq 0 ${NUM_WORKERS}); do docker-compose run -d provision; done
Pulling provision (nathanleclaire/ansibleprovision:latest)...
workerbee-2-nathanleclaire-11-07-2015: Pulling nathanleclaire/ansibleprovision:latest... : downloaded
queenbee-nathanleclaire-11-07-2015: Pulling nathanleclaire/ansibleprovision:latest... : downloaded
workerbee-1-nathanleclaire-11-07-2015: Pulling nathanleclaire/ansibleprovision:latest... : downloaded
workerbee-3-nathanleclaire-11-07-2015: Pulling nathanleclaire/ansibleprovision:latest... : downloaded
ansible_provision_run_1
ansible_provision_run_2
ansible_provision_run_3
ansible_provision_run_4(Note that the master has been accounted for here).
While the Ansible containers are running, you can actually look at them and
check up on them using docker ps and docker logs:
Provisioning the kvstore node in a similar fashion is left as an exercise for
the reader.
Since we installed it with Ansible, you can invoke htop over SSH using Docker
Machine on any given host like so:
$ docker-machine ssh queenbee -t htop
Don’t forget to clean up the provisioning containers. They shouldn’t be left around due to their highly privileged nature.
$ docker rm $(docker ps -aq --filter label=com.docker.compose.service=provision)You have to restart the machines if you enabled memory accounting as well:
$ docker-machine restart queenbee workerbee-{1..3}Fun With Cross-Host Networking.
Now that we have the Swarm / libnetwork cluster up and running and lightly provisioned, let’s do something fun. We’ll run and scale a RethinkDB cluster which communicates seamlessly across hosts using the new libnetwork changes.
Our service definition in docker-compose.yml for this RethinkDB cluster is as
follows:
There’s a few noteworthy things going on here so let’s take a second to discuss
why it’s set up this way. We have two services, leader and follower. The
leader service starts a RethinkDB instance which is listening on all ports
(admin interface, client connection, and intracluster connection) and available
to accept connection from other RethinkDB instances. The container name will
be used as a hostname when the follower instances connect so I’ve set
container_name explicitly to rethinkleader in order to avoid having to rely
on Compose’s automatic container naming.
For each running instance of the service, we reserve memory using a mem_limit
option of 450m (megabytes). This ensures that the RethinkDB instances are
spread evenly across the cluster (the DigitalOcean servers in this walkthough
have ~500m- YMMV if you’re using a different instance type). On the leader
node, we expose 8080 (the RethinkDB admin interface panel) to localhost of
the instance where it will end up.
Now, note the two remaining properties of the follower service. The first is
that the command has been set to rethinkdb --join rethinkleader (RethinkDB
will default to attempting to connect to 29015, the default port for
intracluster communication, when invoked with --join host flag). Because
Compose will automatically create an overlay network if it’s pointed at a
Swarm cluster and --x-networking is set, the rethinkleader container will
be available at that same hostname. Therefore, the follower container(s) will
be scheduled on different hosts and be able to transparently access the leader
using the Docker overlay network!
The second important property is that restart: always has been specified as a
restart policy for the container. It’s kind of a hack, but this is used to
ensure that if Compose starts the follower before the leader, it re-tries
connection until the leader is up as well (usually just one or two times). The
Compose maintainers insist that they do not want to add custom ordering of
service start (arguing that services should be resilient to this type of
failure on their own), and I think it’s a pretty reasonable position. Restart
policies with a maximum number of failures, and/or more robust entrypoint
scripts, may be a better option for real-world use cases.
Once you have this Compose file set up, ensure that your Docker environment variables are set to talk to the Swarm master, then:
$ docker-compose --x-networking up -dOnce the docker-compose up finishes running, we can view the created services
like so:
$ docker-compose ps
Name Command State Ports
------------------------------------------------------------------------------------------------------------
mhswarm_follower_1 rethinkdb --join rethinkleader Up 28015/tcp, 29015/tcp, 8080/tcp
rethinkleader rethinkdb --bind all Up 28015/tcp, 29015/tcp, 127.0.0.1:8080->8080/tcpNote you can also see in docker ps the nodes where they were scheduled:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
980a0eb54b86 rethinkdb "rethinkdb --bind all" 20 seconds ago Up 18 seconds 28015/tcp, 127.0.0.1:8080->8080/tcp, 29015/tcp workerbee-1/rethinkleader
f118f5d53f2e rethinkdb "rethinkdb --join ret" 22 seconds ago Restarting (1) 20 seconds ago 8080/tcp, 28015/tcp, 29015/tcp workerbee-3/rethinkdb_follower_1You can see as well that docker-compose created an overlay network named
rethinkdb automatically:
$ docker network ls | grep overlay
0456b3c548eb rethinkdb overlayYou can then scale out to 3 total follower nodes (so, one instance of RethinkDB per host):
$ docker-compose --x-networking scale follower=3
Creating and starting 2 ... done
Creating and starting 3 ... done
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
794b4a2549ec rethinkdb "rethinkdb --join ret" 57 seconds ago Up 55 seconds 8080/tcp, 28015/tcp, 29015/tcp workerbee-1/rethinkdb_follower_3
113a02566687 rethinkdb "rethinkdb --join ret" 58 seconds ago Up 56 seconds 8080/tcp, 28015/tcp, 29015/tcp queenbee/rethinkdb_follower_2
980a0eb54b86 rethinkdb "rethinkdb --bind all" 3 minutes ago Up 3 minutes 28015/tcp, 127.0.0.1:8080->8080/tcp, 29015/tcp workerbee-2/rethinkleader
f118f5d53f2e rethinkdb "rethinkdb --join ret" 3 minutes ago Restarting (1) 3 minutes ago 8080/tcp, 28015/tcp, 29015/tcp workerbee-3/rethinkdb_follower_1RethinkDB comes with that very nice admin interface available at port 8080 of
the leader, so let’s fork off an SSH tunnel to forward it to our client
computer’s localhost. Find which machine it’s on (e.g. workerbee2), then:
$ docker-machine ssh workerbee-2 -fN -L 8080:localhost:8080This way, we can open an SSH tunnel to the instance running in the cloud,
without needing to expose the port publicly on the Internet. You should now be
able to access the RethinkDB admin console at localhost:8080 on your local
workstation.
See how it says “Servers / 4 Connected” in the above image? They’re all running on different host nodes! Time to do some load testing.
You could expand to an arbitrary number of worker nodes as desired. Just
docker-machine create like we did above and now you have access to as much
computing power as you’re willing to go in for (or that the cloud providers can
handle, which is generally a lot).
Exercises For The Reader
Some things to chew on:
- We used private networking to ensure that our key-value store wasn’t accessible by the Internet at large, but it may be possible for naughty neighbor nodes we don’t own in the same data center to access it. What steps can we take to ensure that our key-value store, and its dependents, are properly secured and protected? (hint: one possible answer begins with T and ends with S)
- Design an application architecture based on this system which will automatically load balance new instances of an application as new containers are added. Consider that reloading configuration for load balancers such as HAproxy can be a resource-intensive operation. (hint)
- Applications often rely heavily on knowing “secrets” such as API tokens. Describe a simple architecture for sharing and handling secrets with this model. Your answer may not include the words “Vault”, “Keywhiz”, or “Sneaker”.
- What can be done about stateful applications (e.g. databases) in this model?
Sketch out a
docker volumeplugin which might help. What kind of data structure might help with sharing business-logic (i.e. not meant to be kept in a key-value store) related information across nodes? (hint and hint)
fin
Hope you have fun and learned something new.
Until next time, stay sassy Internet.
- Nathan
nathan leclaire