Last active
          April 25, 2023 08:08 
        
      - 
      
 - 
        
Save rajivreddy/13f9ba5c66b863b391dfa37cd992af8f to your computer and use it in GitHub Desktop.  
Revisions
- 
        
Rajiv Reddy revised this gist
Apr 25, 2023 . 1 changed file with 140 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -260,4 +260,143 @@ spec: However, you might want to prevent users using **invalid hostnames**. The official documentation for the Open Policy Agent has [a tutorial on how to check Ingress resources as part of the validation webhook](https://www.openpolicyagent.org/docs/latest/kubernetes-tutorial/#4-define-a-policy-and-load-it-into-opa-via-kubernetes). ### Building Containers Avoid [common pitfalls and use best practices](https://gist.github.com/StevenACoffman/41fee08e8782b411a4a26b9700ad7af5) ### Deployment vs Pod Pods are the fundamental Kubernetes building block for your container and now you hear that you shouldn't use Pods directly but through an abstraction such as a `Deployment`. Why is that and what makes the difference? If you deploy a Pod directly to your Kubernetes cluster, your container(s) will run, but nothing takes care of its lifecycle. Once a node goes down, capacity on the current node is needed, etc the Pod will get lost forever. Thats the point where building blocks such as `ReplicaSet` and `Deployment` come into play. A `ReplicaSet` acts as a supervisor to the Pods it watches and recreates Pods that don´t exist anymore. `Deployments` are an even higher abstraction and create and manage `ReplicaSets` to enable the developer to use a declarative state instead of imperative commands (e.g. `kubectl rolling-update`). The real advantage is that Deployments will automatically do rolling-updates and always ensure a given target state instead of having to deal with imperative changes. #### Container image tags When creating Kubernetes Pods, it is for certain reasons **not advisable** to use `latest` tags on Container Images. The first reason for not using it is the fact that you can´t be 100% sure which exact version of your software you are running - lets dive a bit deeper. Once Kubernetes creates a Pod for you it assigns an `imagePullPolicy` to it. By default this will be `IfNotPresent` which means that the container runtime will only pull the image if it is not present on the node the Pod was assigned to. Once you use `latest` as an image tag this default behavior switches to `Always` resulting in the runtime pulling the image every time it starts up a container using that image in a Pod. There are two really important reasons why this is really bad thing to do: + You loose control over which exact code is running in your system + Rolling-Updates/Rollbacks are not possible anymore Lets dive deeper into this: Imagine you have version A of your Software, tag it with `latest`, test version A in a CI system, start a Pod on Node 1, tag version B of your Software again with `latest`, Node 1 goes down and your Pod is moved to Node 2 before version B was tested in a CI. + Which version of your Software will be running? Version B + Which version should be running? Version A + Can you immediately switch back to the previous version? NO! This simple scenario already shows quite clearly some problems that will arise but this is only the tip of the iceberg! You should always be able to see which tested version of your software is currently running, when it was deployed, which source code (e.g. commit ids) your image was built on and know how to switch to a previous version easily - ideally with the push of button. A container image built from source might be tagged `GITSHA-d670460b4b4aece5915caf5c68d12f560a9fe3e4` to allow you to easily link the source code state to the generated artifact (which will hopefully be immutable). A container image built from a java jar artifact might have a tag matching the maven release version `MVN-3.10.6.RELEASE`. ##### Use a non-root user inside the container ``` RUN groupadd -r nodejs RUN useradd -m -r -g nodejs nodejs USER nodejs ``` Enforce it! ``` apiVersion: v1 kind: Pod metadata: name: hello-world spec: containers: # specification of the pod’s containers # ... securityContext: runAsNonRoot: true ``` ##### Make the filesystem read-only ``` apiVersion: v1 kind: Pod metadata: name: hello-world spec: containers: # specification of the pod’s containers # ... securityContext: runAsNonRoot: true readOnlyRootFilesystem: true ``` ##### One process per container ##### Don’t restart on failure. Crash cleanly instead. ##### Log to stdout and stderr ##### Add "tini" or "dumb-init" to prevent zombie processes (Good News: No need to do this in K8s 1.7) ### Deployments + Use the “record” option for easier rollbacks ``` kubectl apply -f deployment.yaml --record kubectl rollout history deployments my-deployment ``` + Use plenty of descriptive labels + Use sidecar containers for proxies, watchers, etc + Don’t use sidecars for bootstrapping! + Use init containers instead! + Don’t use `:latest` or no tag (as above) + Readiness and Liveness probes are your friend ##### Health Checks Readiness → Is the app ready to start serving traffic? + Won’t be added to a service endpoint until it passes + Required for a “production app” in my opinion Liveness → Is the app still running? + Default is “process is running” + Possible that the process can be running but not working correctly + Good to define, might not be 100% necessary These can sometimes be the same endpoint, but not always ### Ingress and AWS We originally used the `Loadbalancer` service type, which uses the AWS functionality in k8s to create an ELB and connect it to all nodes in the cluster. We found this had a few limitations: + Creates an ELB for every single service – we quickly ran into ELB limits, and it’s also costly with hundreds of services. + Each service ELB attaches to every single node in the cluster, which causes a large amount of unnecessary traffic from health checks. It increases as `O(m*n)` where `m=#` services and `n=#` nodes. + Performs poorly in failure scenarios. If a single pod dies, the ELB might end up removing a significant portion of your cluster’s nodes because it doesn’t know where the unhealthy instance is. In modern k8s I would advise not using `LoadBalancer` type service, instead use `NodePort` type services, and use an ingress controller fronted by ELBs/NLBs. Ingress is a generic resource that defines how you want your service accessed externally. A controller is created in the cluster to watch ingress resources and then setup the external access. A benefit in AWS is we could create a single shared ELB/NLB and handle virtual dispatch in our cluster instead of outside of it, allowing us to avoid the drawbacks described above. I also would avoid using ALBs. The connection draining behaviour on them is broken when we last tried using them a few months ago. This prevents 0 downtime deployments of the nginx component. ### Organizing resource configurations Many applications require multiple resources to be created, such as a Deployment and a Service. Management of multiple resources can be simplified by grouping them together in the same file (separated by --- in YAML). While this is convenient for initially applying to the cluster, it is better to keep them seperated for several reasons: + generally the Deployment will need to be altered far more frequently than the Service. + You can gaplessly replace (delete and recreate) a Deployment but not a Service. Combining them makes for accidental service gaps. ### JVM in a container Most of our applications are JVM based. Running the hotspot JVM out of the box in a container is problematic. All the defaults, such as number of GC threads, sizing of memory pools, and so forth use the host system’s resources as a basis. This is not ideal since the container is usually far more constrained than the host. In our case, we use `m4.4xlarge` nodes which have 16 cores and 64Gi of ram, while running pods that are usually limited to a couple cores and couple Gi of ram. It took us a while to figure out the right balance of options to get the JVM to behave well: `-Xmx` to roughly half the size of the memory limits of the pod. There is a lot of memory overhead in hotspot. This value will depend on the application as well, so it takes some work to find the right amount of max heap to prevent an oom kill. `-XX:+AlwaysPreTouch` so we can more quickly ensure our max heap setting is reasonable. Otherwise we might only get oom killed under load. Never limit CPU on a JVM container, unless you really don’t care about long pause times. The CPU limit will cause hard throttling of your application whenever garbage collections happen. The right choice almost always is to use burstable CPU. This let’s the scheduler soft clamp the application if it uses too much CPU in relation to other applications running on the node, while also letting it use as much CPU as available on the machine. ### Layer 4 connectivity between pods Most of the applications prior to k8s use relied on external load balancers to proxy client connections. This meant applications could be sloppy with regards to graceful termination, since they could rely on the load balancer to gracefully drain off connections. Most Java web frameworks do not shutdown in a non disruptive way. Instead, they do similar to nginx where in flight requests are processed (although even that is/was bugged in many frameworks) and then client connections are dropped without any further ado. Once we moved these applications inside of k8s they became responsible for their own graceful termination. The solution, after much trial and error, was to add a filter with a conditional that gets activated at shutdown. When activated, it adds a `connection: close` header to all responses, and delays shutdown for a drain duration. As long as a request comes in during the drain time, the client connection gets closed due to the header. This was relatively straightforward to implement, although the implementation varied based on the framework used. This solved all of our issues with errors on application deployment, giving us zero downtime application deployments. Other alternatives include using things like linkerd or istio to proxy all connections between pods, much like having a load balancer between each pod. We experimented with these, but none were as simple or reliable as simply having the application handle shutdown properly. Using an alternative RPC mechanism instead of http/1.1 would probably take care of this as well.  - 
        
Rajiv Reddy created this gist
Apr 25, 2023 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,263 @@ # Governance Best practices for creating, managing and administering namespaces. ## Namespace limits When you decide to segregate your cluster in namespaces, you should protect against misuses in resources. You shouldn't allow your user to use more resources than what you agreed in advance. Cluster administrators can set constraints to limit the number of objects or amount of computing resources that are used in your project with quotas and limit ranges. You should check out the official documentation if you need a refresher on [limit ranges](https://kubernetes.io/docs/concepts/policy/limit-range/) ### Namespaces have LimitRange Containers without limits can lead to resource contention with other containers and unoptimized consumption of computing resources. Kubernetes has two features for constraining resource utilisation: ResourceQuota and LimitRange. With the LimitRange object, you can define default values for resource requests and limits for individual containers inside namespaces. Any container created inside that namespace, without request and limit values explicitly specified, is assigned the default values. You should check out the official documentation if you need a refresher on [resource quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/). ### Namespaces have ResourceQuotas With ResourceQuotas, you can limit the total resource consumption of all containers inside a Namespace. Defining a resource quota for a namespace limits the total amount of CPU, memory or storage resources that can be consumed by all containers belonging to that namespace. You can also set quotas for other Kubernetes objects such as the number of Pods in the current namespace. If you're thinking that someone could exploit your cluster and create 20000 ConfigMaps, using the LimitRange is how you can prevent that. ## Pod security policies When a Pod is deployed into the cluster, you should guard against: - the container being compromised - the container using resources on the node that are not allowed such as process, network or file system More in general, you should restrict what the Pod can do to the bare minimum. ### Enable Pod Security Policies For example, you can use Kubernetes Pod security policies for restricting: - Access the host process or network namespace - Running privileged containers - The user that the container is running as - Access the host filesystem - Linux capabilities, Seccomp or SELinux profiles Choosing the right policy depends on the nature of your cluster. The following article explains some of the [Kubernetes Pod Security Policy best practices](https://resources.whitesourcesoftware.com/blog-whitesource/kubernetes-pod-security-policy) ### Disable privileged containers In a Pod, containers can run in "privileged" mode and have almost unrestricted access to resources on the host system. While there are specific use cases where this level of access is necessary, in general, it's a security risk to let your containers do this. Valid uses cases for privileged Pods include using hardware on the node such as GPUs. You can [learn more about security contexts and privileges containers from this article](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/). ### Use a read-only filesystem in containers Running a read-only file system in your containers forces your containers to be immutable. Not only does this mitigate some old (and risky) practices such as hot patching, but also helps you prevent the risks of malicious processes storing or manipulating data inside a container. Running containers with a read-only file system might sound straightforward, but it might come with some complexity. _What if you need to write logs or store files in a temporary folder?_ You can learn about the trade-offs in this article on [running containers securely in production](https://medium.com/@axbaretto/running-docker-containers-securely-in-production-98b8104ef68). ### Prevent containers from running as root A process running in a container is no different from any other process on the host, except it has a small piece of metadata that declares that it's in a container. Hence, root in a container is the same root (uid 0) as on the host machine. If a user manages to break out of an application running as root in a container, they may be able to gain access to the host with the same root user. Configuring containers to use unprivileged users, is the best way to prevent privilege escalation attacks. If you wish to learn more, the follow [article offers some detailed explanation examples of what happens when you run your containers as root](https://medium.com/@mccode/processes-in-containers-should-not-run-as-root-2feae3f0df3b). ### Limit capabilities Linux capabilities give processes the ability to do some of the many privileged operations only the root user can do by default. For example, `CAP_CHOWN` allows a process to "make arbitrary changes to file UIDs and GIDs". Even if your process doesn't run as `root`, there's a chance that a process could use those root-like features by escalating privileges. In other words, you should enable only the capabilities that you need if you don't want to be compromised. _But what capabilities should be enabled and why?_ The following two articles dive into the theory and practical best-practices about capabilities in the Linux Kernel: - [Linux Capabilities: Why They Exist and How They Work](https://blog.container-solutions.com/linux-capabilities-why-they-exist-and-how-they-work) - [Linux Capabilities In Practice](https://blog.container-solutions.com/linux-capabilities-in-practice) ### Prevent privilege escalation You should run your container with privilege escalation turned off to prevent escalating privileges using `setuid` or `setgid` binaries. ## Network policies A Kubernetes network must adhere to three basic rules: 1. **containers can talk to any other container in the network**, and there's no translation of addresses in the process — i.e. no NAT is involved 1. **nodes in the cluster can talk to any other container in the network and vice-versa**. Even in this case, there's no translation of addresses — i.e. no NAT 1. **a container's IP address is always the same**, independently if seen from another container or itself. The first rule isn't helping if you plan to segregate your cluster in smaller chunks and have isolation between namespaces. _Imagine if a user in your cluster were able to use any other service in the cluster._ Now, _imagine if a malicious user in the cluster were to obtain access to the cluster_ — they could make requests to the whole cluster. To fix that, you can define how Pods should be allowed to communicate in the current namespace and cross-namespace using Network Policies. ### Enable network policies Kubernetes network policies specify the access permissions for groups of pods, much like security groups in the cloud are used to control access to VM instances. In other words, it creates firewalls between pods running on a Kubernetes cluster. If you are not familiar with Network Policies, you can read [Securing Kubernetes Cluster Networking](https://ahmet.im/blog/kubernetes-network-policy/). ### There's a conservative NetworkPolicy in every namespace This repository contains various use cases of Kubernetes Network Policies and samples YAML files to leverage in your setup. If you ever wondered [how to drop/restrict traffic to applications running on Kubernetes](https://github.com/ahmetb/kubernetes-network-policy-recipes), read on. ## Role-Based Access Control (RBAC) policies Role-Based Access Control (RBAC) allows you to define policies on how to access resources in your cluster. It's common practice to give away the least permission needed, _but what is practical and how do you quantify the least privilege?_ Fine-grained policies provide greater security but require more effort to administrate. Broader grants can give unnecessary API access to service accounts but are easier to controls. _Should you create a single policy per namespace and share it?_ _Or perhaps it's better to have them on a more granular basis?_ There's no one-size-fits-all approach, and you should judge your requirements case by case. _But where do you start?_ If you start with a Role with empty rules, you can add all the resources that you need one by one and still be sure that you're not giving away too much. ### Disable auto-mounting of the default ServiceAccount Please note that [the default ServiceAccount is automatically mounted into the file system of all Pods](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#use-the-default-service-account-to-access-the-api-server). You might want to disable that and provide more granular policies. ### RBAC policies are set to the least amount of privileges necessary It's challenging to find good advice on how to set up your RBAC rules. In [3 realistic approaches to Kubernetes RBAC](https://thenewstack.io/three-realistic-approaches-to-kubernetes-rbac/), you can find three practical scenarios and practical advice on how to get started. ### RBAC policies are granular and not shared Zalando has a concise policy to define roles and ServiceAccounts. First, they describe their requirements: - Users should be able to deploy, but they shouldn't be allowed to read Secrets for example - Admins should get full access to all resources - Applications should not gain write access to the Kubernetes API by default - It should be possible to write to the Kubernetes API for some uses. The four requirements translate into five separate Roles: - ReadOnly - PowerUser - Operator - Controller - Admin You can read about [their decision in this link](https://kubernetes-on-aws.readthedocs.io/en/latest/dev-guide/arch/access-control/adr-004-roles-and-service-accounts.html). ## Custom policies Even if you're able to assign policies in your cluster to resources such as Secrets and Pods, there are some cases where Pod Security Policies (PSPs), Role-Based Access Control (RBAC), and Network Policies fall short. As an example, you might want to avoid downloading containers from the public internet and prefer to approve those containers first. Perhaps you have an internal registry, and only the images in this registry can be deployed in your cluster. _How do you enforce that only **trusted containers** can be deployed in the cluster?_ There's no RBAC policy for that. Network policies won't work. _What should you do?_ You could use the [Admission controller](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/) to vet resources that are submitted to the cluster. ### Allow deploying containers only from known registries One of the most common custom policies that you might want to consider is to restrict the images that can be deployed in your cluster. [The following tutorial explains how you can use the Open Policy Agent to restrict not approved images](https://blog.openpolicyagent.org/securing-the-kubernetes-api-with-open-policy-agent-ce93af0552c3#3c6e). ### Enforce uniqueness in Ingress hostnames When a user creates an Ingress manifest, they can use any hostname in it. ```yaml|highlight=7|title=ingress.yaml apiVersion: networking.k8s.io/v1beta1 kind: Ingress metadata: name: example-ingress spec: rules: - host: first.example.com http: paths: - backend: serviceName: service servicePort: 80 ``` However, you might want to prevent users using **the same hostname multiple times** and overriding each other. The official documentation for the Open Policy Agent has [a tutorial on how to check Ingress resources as part of the validation webhook](https://www.openpolicyagent.org/docs/latest/kubernetes-tutorial/#4-define-a-policy-and-load-it-into-opa-via-kubernetes). ### Only use approved domain names in the Ingress hostnames When a user creates an Ingress manifest, they can use any hostname in it. a ```yaml|highlight=7|title=ingress.yaml apiVersion: networking.k8s.io/v1beta1 kind: Ingress metadata: name: example-ingress spec: rules: - host: first.example.com http: paths: - backend: serviceName: service servicePort: 80 ``` However, you might want to prevent users using **invalid hostnames**. The official documentation for the Open Policy Agent has [a tutorial on how to check Ingress resources as part of the validation webhook](https://www.openpolicyagent.org/docs/latest/kubernetes-tutorial/#4-define-a-policy-and-load-it-into-opa-via-kubernetes).