ahume · December 8, 2021 20:40 · Dec 28, 2019 · Dec 28, 2019 · Mar 1, 2018 · Feb 4, 2018
diff --git a/concourse.md b/concourse.md
@@ -12,7 +12,7 @@ Comments/questions welcome below.
 * nginx-ingress-controller (TLS termination, routing)
 * kube-lego (letsencrypt certificates)
 * preemptible-killer (controlled shutdown of preemptible VM instances)
-* worker-pruner (periodically checks for and kills stalled workers)
+* delete-stalled-concourse-workers (periodically checks for and kills stalled workers)
 
 ## GKE/Kubernetes
 
@@ -69,4 +69,4 @@ We have experimented with shutdown scripts on preemptible nodes, but cannot get
 
 ### stalled worker cleanup
 
-The controller runs in the cluster and every minute checks for stalled workers via the Concourse API. If it finds any it prunes them. Can release code for this if useful.
+We run [delete-stalled-concourse-workers](https://github.com/ahume/delete-stalled-concourse-workers) in the cluster which every minute checks for stalled workers via the Concourse API. If it finds any it prunes them.
diff --git a/concourse.md b/concourse.md
@@ -1,8 +1,8 @@
 # Concourse on Kubernetes
 
-This document outlines Brandwatch's Concourse installation running on Kubernetes. The full configuration can be found at https://github.com/BrandwatchLtd/concourse-ops (internal only currently).
+This document outlines Brandwatch's Concourse installation running on Kubernetes. The full configuration can be found at https://github.com/BrandwatchLtd/concourse-ops (internal only currently). It's a fairly new installation (1-2 weeks) and we're slowly migrating work from our existing BOSH installation to this.
 
-Please ask questions in comments below.
+Comments/questions welcome below.
 
 ## Summary
 
@@ -16,7 +16,7 @@ Please ask questions in comments below.
 
 ## GKE/Kubernetes
 
-* Kubernetes nodes run Ubuntu images, to allow for overlay baggageclaimDriver. We did not find any configuration that could run successfully on COS instances.
+* Kubernetes nodes run Ubuntu images, to allow for `overlay` baggageclaimDriver. We did not find any configuration that could run successfully on COS instances.
 * Runs across 2 AZs (so we can run minimum 2 nodes in a node-pool)
 * Cluster split into two node-pools
     * node-pool for Concourse Workers (auto-scaling). n1-standard-4 machines. We’ve generally found much better behaviour from workers once they have around 4CPUs available.
@@ -40,6 +40,7 @@ The [Nginx Ingress Controller](https://github.com/kubernetes/ingress-nginx) is a
 * v0.9.0
 * 2 replicas
 * kube-system/default-http-backend
+* Service bound to Google Network Load Balancer IP
 
 ## Prometheus
 

diff --git a/concourse.md b/concourse.md
@@ -1,3 +1,5 @@
+# Concourse on Kubernetes
+
 This document outlines Brandwatch's Concourse installation running on Kubernetes. The full configuration can be found at https://github.com/BrandwatchLtd/concourse-ops (internal only currently).
 
 Please ask questions in comments below.

diff --git a/concourse.md b/concourse.md
@@ -50,14 +50,20 @@ Prometheus is installed via the [Prometheus operator](https://github.com/coreos/
 
 [kube-lego](https://github.com/jetstack/kube-lego) process runs in the cluster and finds Ingress objects requiring TLS certificates. It deals with letsencrypt and setting up the HTTP challenge. Installed by the helm [stable chart](https://github.com/kubernetes/charts/tree/master/stable/kube-lego).
 
-## preemptible-killer
+## Preemptible work arounds
+
+There's a bunch of clutter related to wanting to run workers on preemptible GKE instances. Preemptbile GKE instances cost approx 30% the price of standard instances but can be preempted (shutdown) at any time, and at least once every 24h.
+
+If you are happy paying for non-preemptible instances you'll likely get more stability of workers without any of these work arounds. On the other hand, you never know when a node will die underneath you for other reasons, so this is a more general problem which would be good to solve.
+
+### preemptible-killer
 
 https://github.com/estafette/estafette-gke-preemptible-killer
 
-A basic attempt to control preemptible VM shutdowns. The controller adds annotations to preemptible nodes and within 24 hours does a controlled termination of all pods and shuts down the VM. This is preferable to the VM dying underneath us with no warning, which leads to stalled workers.
+A basic attempt to control preemptible VM shutdowns. The controller adds annotations to preemptible nodes and within 24 hours does a controlled termination of all pods and shuts down the VM. This is preferable to the VM dying underneath us with no warning, which leads to stalled workers. Will likely adapt this to force restart of preemptible VMs just prior to working hours, to reduce chance of forced restarts during working hours.
 
 We have experimented with shutdown scripts on preemptible nodes, but cannot get them to successfully delete worker pods during the shutdown phase. More experimentation required here, because I don’t understand why it’s not possible. We currently work around this problem with…
 
-## stalled worker cleanup
+### stalled worker cleanup
 
-The controller runs in the cluster and every minute checks for stalled workers via the Concourse API. If it finds any it prunes them.
+The controller runs in the cluster and every minute checks for stalled workers via the Concourse API. If it finds any it prunes them. Can release code for this if useful.
diff --git a/concourse.md b/concourse.md
@@ -0,0 +1,63 @@
+This document outlines Brandwatch's Concourse installation running on Kubernetes. The full configuration can be found at https://github.com/BrandwatchLtd/concourse-ops (internal only currently).
+
+Please ask questions in comments below.
+
+## Summary
+
+* Google GKE
+* ConcourseCI (from [stable/concourse](https://github.com/kubernetes/charts/tree/master/stable/concourse) chart)
+* Prometheus / Alert Manager (Metrics, monitoring, alerting)
+* nginx-ingress-controller (TLS termination, routing)
+* kube-lego (letsencrypt certificates)
+* preemptible-killer (controlled shutdown of preemptible VM instances)
+* worker-pruner (periodically checks for and kills stalled workers)
+
+## GKE/Kubernetes
+
+* Kubernetes nodes run Ubuntu images, to allow for overlay baggageclaimDriver. We did not find any configuration that could run successfully on COS instances.
+* Runs across 2 AZs (so we can run minimum 2 nodes in a node-pool)
+* Cluster split into two node-pools
+    * node-pool for Concourse Workers (auto-scaling). n1-standard-4 machines. We’ve generally found much better behaviour from workers once they have around 4CPUs available.
+    * node-pool for everything else. n1-standard-2 machines.
+* All instances are currently preemptible, so we trade off *some* stability of workers for much reduced cost (but continue to work on increasing stability).
+
+## ConcourseCI
+
+Concourse is installed via the Helm charts.
+* Concourse v3.8.0 currently
+* baggageclaimDriver: overlay
+* Two web replicas
+* Between 2-6 workers (we scale up/down for work/non-work hours)
+* Service: clusterIP
+* Ingress (uses nginx-ingress-controller)
+
+## nginx-ingress-controller
+
+The [Nginx Ingress Controller](https://github.com/kubernetes/ingress-nginx) is a pretty vanilla, installed by the helm [stable chart](https://github.com/kubernetes/ingress-nginx).
+
+* v0.9.0
+* 2 replicas
+* kube-system/default-http-backend
+
+## Prometheus
+
+Prometheus is installed via the [Prometheus operator](https://github.com/coreos/prometheus-operator).
+
+* 1 replica
+* 2 alert-managers
+
+## kube-lego
+
+[kube-lego](https://github.com/jetstack/kube-lego) process runs in the cluster and finds Ingress objects requiring TLS certificates. It deals with letsencrypt and setting up the HTTP challenge. Installed by the helm [stable chart](https://github.com/kubernetes/charts/tree/master/stable/kube-lego).
+
+## preemptible-killer
+
+https://github.com/estafette/estafette-gke-preemptible-killer
+
+A basic attempt to control preemptible VM shutdowns. The controller adds annotations to preemptible nodes and within 24 hours does a controlled termination of all pods and shuts down the VM. This is preferable to the VM dying underneath us with no warning, which leads to stalled workers.
+
+We have experimented with shutdown scripts on preemptible nodes, but cannot get them to successfully delete worker pods during the shutdown phase. More experimentation required here, because I don’t understand why it’s not possible. We currently work around this problem with…
+
+## stalled worker cleanup
+
+The controller runs in the cluster and every minute checks for stalled workers via the Concourse API. If it finds any it prunes them.