stevenc81 · July 19, 2024 08:41 · Feb 28, 2019 · Feb 28, 2019
diff --git a/Upgrading Kubernetes Cluster with Kops, and Things to Watch Out For.md b/Upgrading Kubernetes Cluster with Kops, and Things to Watch Out For.md
@@ -56,7 +56,7 @@ This will throw 100000 requests with concurrency of 10 to the service. Let it ru
 
 
 ## Dry-run the upgrade
-###Cluster upgrade dry-run
+### Cluster upgrade dry-run
 
 ```
 kops edit cluster

diff --git a/Upgrading Kubernetes Cluster with Kops, and Things to Watch Out For.md b/Upgrading Kubernetes Cluster with Kops, and Things to Watch Out For.md
@@ -0,0 +1,143 @@
+# Upgrading Kubernetes Cluster with Kops, and Things to Watch Out For
+
+Alright! I'd like to apologize for the inactivity for over a year. Very embarrassingly, I totally dropped the good habit. Anyways, today I'd like to share a not so advanced and much shorter walkthrough on how to upgrade Kubernetes with kops.
+
+At Buffer, we host our own k8s (Kubernetes for short) cluster on AWS EC2 instances since we started our journey before [AWS EKS](https://aws.amazon.com/eks/). To do this effectively, we use [kops](https://github.com/kubernetes/kops). It's an amazing tool that manages pretty much all aspects of cluster management from creation, upgrade, updates and deletions. It never failed us.
+
+## How to start?
+Okay, upgrading a cluster always makes people nervous, especially a production cluster. Trust me, I've been there! There is a saying, **hope is not a strategy**. So instead of hoping things will go smoothly, I always have bias that shit will hit the fan if you skip testing. Plus, good luck explaining to people that the cluster is now down because someone decided to try it out and see what happens.
+
+So what do we do?
+
+Simple! We create a new cluster in the version the production cluster is running on, add a basic testing web application and throw traffic at it during the upgrade to make sure the services isn't interrupted. Let me try to break down the steps for you in more details.
+
+
+- Does the upgrade breaks any running application
+- Does the new version works well with the existing EC2 instance type
+  - There are cases when a upgraded version using a new AWS AMI that doesn’t work with a particular instance type, causing a cluster wide halt and no container could be created
+- From the versions we are upgrading `from` and `to`, does the underpinning services still work smoothly?
+  - We look into critical networking components like `flannel` and `kube-dns`. If we spot any hiccups we should take a step back to fully investigate any issues.
+
+## Creating a new testing cluster
+To create a cluster with kops on AWS is kind of tricky initially, as some critical AWS resources have to be created. Here is [the guide](https://github.com/kubernetes/kops/blob/master/docs/aws.md) on how to do that. You only need to do it once, ever. After that creating a cluster is as easy as using this command
+
+
+```
+kops create cluster \
+--name steven.k8s.com \
+--cloud aws \
+--master-size m4.large \
+--master-zones=us-east-1b,us-east-1c,us-east-1d \
+--node-size m4.xlarge \
+--zones=us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f \
+--node-count=3 \
+--kubernetes-version=1.11.6 \
+--vpc=vpc-1234567\
+--network-cidr=10.0.0.0/16 \
+--networking=flannel \
+--authorization=RBAC \
+--ssh-public-key="~/.ssh/kube_aws_rsa.pub" \
+--yes
+```
+
+It's important to note we are trying to upgrade a production cluster to a newer version so we will want the testing cluster to be as close to the production one. Please make sure `--kubernetes-version` is of the current production version and using the same overlay network in `--networking`. It's also good to keep the instance type the same to avoid any compatibility issues. I used to have some issue with m5 EC2 instance for master nodes. The point to is make a dry-run upgrade as if it's the production cluster.
+
+
+Once it's created. Deploy a basic web service to it. We will use it to establish some baseline test results.
+
+## Testing the service
+In this step we will want to get some baseline result to better determine if it's impacted during the cluster upgrade, than adjust some parameters (more on this later).
+
+I usually use [Apache Benchmark](https://httpd.apache.org/docs/2.4/programs/ab.html) for this
+
+`ab -n 100000 -c 10 -l http://<URL to the service>`
+
+This will throw 100000 requests with concurrency of 10 to the service. Let it run and copy the results down. We will compare them to the results during the cluster upgrade to ensure no service is impacted, which is a critical thing during the production upgrade.
+
+
+## Dry-run the upgrade
+###Cluster upgrade dry-run
+
+```
+kops edit cluster
+# Update the version number
+kops update cluster steven.k8s.com
+```
+
+### Cluster upgrade (w/o interval)
+```
+kops update cluster steven.k8s.com --yes
+kops rolling-update cluster
+kops rolling-update cluster --yes
+```
+
+Once the commands above are issued, the cluster will start a rolling update process that takes 1 node off each time. This is the time we have been waiting for. We want to know if the testing service continues to run during the process. Now, when the upgrade is underway, let’s throw some traffic at it
+
+
+`ab -n 100000 -c 10 -l http://<URL to the service>`
+
+The results should be similar to the baseline results, if not, we may consider adding a time interval to give it more time during node creation/termination. That will definitely help to reduce interruption but will prolong the entire process. That’s a trade-off I’m happy to make to ensure stability.
+
+
+We could use the following commands to add a time interval.
+
+### Cluster upgrade (w/ 10 minutes interval)
+```
+kops update cluster steven.k8s.com --yes
+kops rolling-update cluster
+kops rolling-update cluster --master-interval=10m --node-interval=10m --yes -v 10
+```
+
+
+### Final checkup
+
+Once we are confident that the upgrade won’t impact the existing services, we are nearly there for an actual upgrade. This is just a small checklist to make sure everything important is still working after the upgrade
+
+
+- kube-flannel
+- kube-dns
+- kube-dns-autoscaler
+- dns-controller
+- kube-apiserver
+- kube-controller-manager
+- kube-proxy-ip
+- etcd-server-events x 3
+- etcd-server-ip x 3
+
+If all the pods are operational. We should be ready for the real action.
+
+## Actual upgrade on the production cluster
+
+The duration of a upgrade is calculated by number of nodes x time interval in between. A more reliable time interval is 10 minutes so if we had a cluster of 50 nodes the time needed will be 500 minutes. Be sure to schedule that much time.
+
+Once the time is decided, and the testing steps have passed, we are ready!
+
+
+### Common issues
+
+Cluster not validating during upgrade
+This happens when there is one or more nodes refuse to go into Ready state, or having any Non-Ready pods inside the `kube-system` namespace. Once the issue is resolved we could use this command to verify.
+
+
+`kops validate cluster`
+
+It’s quite common having lingering failing pods inside `kube-system` going unnoticed. Deleting the failing pods or even the deployments will fix it.
+
+### Upgrade getting stuck
+The rolling upgrade by `kops` works by taking one node off at a time, to be replaced by a new one with the newer version. This approach generally works until when a particular deployment requires at least one running (available) pod to ensure a basic availability, and only one replica is set. Taking a node off with a running pod like that will result a an `availability budget` violation, causing eviction to fail thus leading to a cluster upgrade to halt. To tackle this, simply deleting the pod in question and the upgrade will continue evicting the node.
+
+We have been typically seeing this type of violation on nginx ingress controllers in various namespaces. Upgrading the cluster with verbose logging option will help to identify this type of violations (please note the `-v 10` option)
+
+
+`kops rolling-update cluster --master-interval=10m --node-interval=10m --yes -v 10`
+
+If the upgrade halts for some other unknown reason, I would recommend draining the node with this command
+
+
+`kubectl drain --ignore-daemonsets --delete-local-data --force <NODE>`
+
+Then you may terminate the node via AWS EC2 console manually.
+
+
+## Closing words
+In my experience with kops, upgrades are generally very safe because of the nature of rolling update. If you wish to revert to the original version in the process, you can simply kill the process (`kops rolling-update cluster`) and edit the version, then restart the `rolling-update` command. I only have the `kops` team to thank for all the smooth experience. Only because of this, Buffer as a company is able to have this level of confidence in Kubernetes and have migrated majority of our services to it.