Peter Goodman bio photo

Peter Goodman

A software engineer and leader living in Auckland building products and teams. Originally from Derry, Ireland.

Twitter Google+ LinkedIn Github

Lately we’ve been running Concourse CI for a bunch of our builds. We really love Concourse for the pipeline features, ease of configuration, and docker primitives. However, operating and feeding Concourse can be a voyage of discovery and sometimes sadness.

One of the issues with Concourse is that it doesn’t really like it when workers disappear on it. The workers will appear as stalled if you run fly workers. This means that any resources that are performing check operations for new versions will be stuck and not trigger builds. You then need to prune-worker if you want your builds to keep working.

This post aims to give you the basics for getting lifecycle management a bit better so you can simply roll the instances in your worker pool Auto-Scaling Group (ASG) when you want to get some fresh ones without incurring the annoyance of having to clear out those stalled workers.

Lifecycle Hook

Hopefully you are running your Concourse workers in an Auto-Scaling Group. When your ASG removes these instances nothing will tell Concourse that they are dead. To make this happen you need to create an Auto-Scaling Lifecycle Hook.

Create a lifecycle hook for termination called worker-terminating.

Add the following script in a CRON job run every minute or two.

#!/bin/bash

# Need this path to allow aws command to work
PATH=$PATH:/usr/local/bin

instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id/)

lifecycleState=$(aws autoscaling describe-auto-scaling-instances --instance-ids $instance_id --query 'AutoScalingInstances[0].LifecycleState' --output text --region us-west-2)

if [ "$lifecycleState" == "Terminating:Wait" ]; then
  asg=$(aws autoscaling describe-auto-scaling-instances --instance-ids $instance_id --query 'AutoScalingInstances[0].AutoScalingGroupName' --output text --region us-west-2)
  
  # We store the TSA Host parameter
  TSA_HOST="my.tsa.host"

  concourse retire-worker \
  	--name $(hostname) \
  	--tsa-host $TSA_HOST \
  	--tsa-public-key /path/to/tsa-public-key \
  	--tsa-worker-private-key /path/to/tsa-worker-private-key

  # Sleep for 10 minutes to let the builds finish. I know, not ideal but it works for now
  sleep 10m

  service concourse-worker stop

  aws autoscaling complete-lifecycle-action \
    --instance-id $instance_id \
    --auto-scaling-group-name $asg \
    --lifecycle-hook-name "worker-terminating" \
    --lifecycle-action-result "CONTINUE" \
    --region us-west-2
fi