### Topology and Affinity notes ```bash oc label nodes master-0 node=node0 zone=zoneA --overwrite oc label nodes master-1 node=node1 zone=zoneB --overwrite oc label nodes master-2 node=node2 zone=zoneC --overwrite ``` ``` +--------+ +--------+ +--------+ | | | | | | | ZONE A | | ZONE B | | ZONE C | | | | | | | +--------+ +--------+ +--------+ |_ api-ext-0 |_ api-ext-1 |_ api-ext-2 |_ api-int-0 |_ api-int-1 |_ api-int-2 ``` ```yaml --- apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-default-spread-pods namespace: openstack spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: service: glance ``` In this case, we can observe how the `TopologySpreadConstraints` matches with the `preferredAntiAffinityRules`, the system is stable and the scheduler is able to schedule `Pods` in each zone. The `topologySpreadConstraints` is applied to Pods that match the specified `labelSelector`, regardless of whether they are replicas of the same Pod or different Pods. A key point to understand is that in Kubernetes, there's no real concept of "replicas of the same Pod" at the scheduler level - what we commonly call "replicas" are actually individual Pods that share the same labels and are created by a controller (like a Deployment, StatefulSet, etc.). Each Pod is scheduled independently, even if they were created as part of the same set of replicas. The `topologySpreadConstraints` would apply to **ALL** 6 pods because they all match the `labelSelector` `service: glance`. The scheduler would try to spread all these pods across the nodes according to the constraint, treating them as a single group of 6 pods that need to be spread, not as separate groups of 2 Pods with 3 replicas. When we define a `TopologySpreadConstraints`, `maxSkew` plays an important role. In general, Kubernetes scheduler calculates pod spreading through this `maxSkew` parameter as follows: ``` skew = max(|actualPodsInZone - avgPodsPerZone|) ``` Where: - **actualPodsInZone**: the number of pods in a specific zone - **avgPodsPerZone**: total pods / number of zones For example, with `7 pods` and `3 zones`: ``` avgPodsPerZone = 7/3 ≈ 2.33 If distribution is [3,2,2], max skew is |3-2.33| = 0.67 If distribution is [4,2,1], max skew is |4-2.33| = 1.67 ``` The `maxSkew` parameter represents **the maximum allowed difference from the average**. If we set `maxSkew: 1`: ``` - [3,2,2] would be allowed (skew 0.67 < 1) - [4,2,1] would not be allowed (skew 1.67 > 1) ``` In summary: - Scheduler tries to minimize skew while respecting other constraints - Higher maxSkew allows more uneven distribution - Lower maxSkew enforces more balanced distribution - maxSkew: 1 is a common choice to reach a reasonable balance - whenUnsatisfiable: DoNotSchedule prevents exceeding maxSkew - whenUnsatisfiable: ScheduleAnyway allows exceeding if necessary To spread different types of Pods independently, we would need to use different labels and different topology spread constraints for each type. For example, we can select only a subset of `Pods` (e.g. a specific `GlanceAPI`, called `czone`), and spread the resulting Pods across `zoneA`, `zoneB` and `zoneC`. To select only the `czone` glanceAPI, we rely on `matchExpressions`, that fits well a context where we do not necessarily propagate the same label keys from the top level CR to the resulting Pods. The following example, selects the `glance-czone-edge-api` Pods, and spreads them across the existing kubernetes nodes. ```yaml apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-czone-spread-pods namespace: openstack spec: topologySpreadConstraint: - labelSelector: matchExpressions: - key: glanceAPI operator: In values: - czone maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule ``` To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef` called `glance-czone-spread-pods` has been applied. ``` +--------+ +--------+ +--------+ | | | | | | | ZONE A | | ZONE B | | ZONE C | | | | | | | +--------+ +--------+ +--------+ |_ api-ext-0 |_ api-ext-1 |_ api-ext-2 |_ api-int-0 |_ api-int-1 |_ api-int-2 |_ czone-edge-0 |_ czone-edge-1 |_ czone-edge-2 ``` However, a quite common scenario is to reach the exact opposite model, where `Pods`are scheduled in a particular/specific zone. In our example, the idea is to take all `Pods` that belong to `glanceAPI: glance-czone-edge` and schedule all of them in `ZoneC` (which is equals to `master-2` in a three nodes environment). To achieve this goal, we create the following topology CR: ```yaml apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-czone-node-affinity namespace: openstack spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - zoneC ``` The above establishes a **nodeAffinity**, to make sure we schedule `czone` glanceAPI Pods in `zoneC`. In other words, we **require** `czone` `Pods` to be scheduled on a node that has `zoneC` as label, and this condition is stronger than the `preferredAntiAffinityRules` applied by default to the statefulSet. Note that in this case we do not need any `TopologySpreadConstraint`, because we're not really interested in the Pods distribution, but we're trying to achieve isolation between AZs. ``` +--------+ +--------+ +--------+ | | | | | | | ZONE A | | ZONE B | | ZONE C | | | | | | | +--------+ +--------+ +--------+ |_ api-ext-0 |_ api-ext-1 |_ api-ext-2 |_ api-int-0 |_ api-int-1 |_ api-int-2 |_ czone-edge-0 |_ czone-edge-1 |_ czone-edge-2 ``` The picture above can be checked with the following: ``` Every 2.0s: oc get pods -l service=glance -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES glance-czone-edge-api-0 3/3 Running 0 3m13s 10.128.1.172 master-2 glance-czone-edge-api-1 3/3 Running 0 3m25s 10.128.1.171 master-2 glance-czone-edge-api-2 3/3 Running 0 3m37s 10.128.1.170 master-2 glance-default-external-api-0 3/3 Running 0 71m 10.129.0.72 master-0 glance-default-external-api-1 3/3 Running 0 72m 10.128.1.152 master-2 glance-default-external-api-2 3/3 Running 0 72m 10.130.0.208 master-1 glance-default-internal-api-0 3/3 Running 0 72m 10.128.1.153 master-2 glance-default-internal-api-1 3/3 Running 0 72m 10.129.0.70 master-0 glance-default-internal-api-2 3/3 Running 0 72m 10.130.0.207 master-1 ``` We can use the same approach to apply `nodeAffinity` to `bzone` and `azone` glanceAPIs and observe Pods being scheduled on the nodes that belong to `zoneB` and `zoneA`. We create the following CRs: ```yaml apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-bzone-node-affinity namespace: openstack spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - zoneB ``` and: ```yaml apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-azone-node-affinity namespace: openstack spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - zoneA ``` ``` $ oc get topology NAME glance-azone-node-affinity glance-bzone-node-affinity glance-czone-node-affinity glance-czone-spread-pods glance-default-spread-pods ``` ``` NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES glance-azone-edge-api-0 3/3 Running 0 39s 10.129.0.161 master-0 glance-azone-edge-api-1 3/3 Running 0 51s 10.129.0.160 master-0 glance-azone-edge-api-2 3/3 Running 0 64s 10.129.0.159 master-0 glance-bzone-edge-api-0 3/3 Running 0 15m 10.130.0.239 master-1 glance-bzone-edge-api-1 3/3 Running 0 15m 10.130.0.238 master-1 glance-bzone-edge-api-2 3/3 Running 0 15m 10.130.0.237 master-1 glance-czone-edge-api-0 3/3 Running 0 124m 10.128.1.172 master-2 glance-czone-edge-api-1 3/3 Running 0 124m 10.128.1.171 master-2 glance-czone-edge-api-2 3/3 Running 0 124m 10.128.1.170 master-2 glance-default-external-api-0 3/3 Running 0 3h12m 10.129.0.72 master-0 glance-default-external-api-1 3/3 Running 0 3h13m 10.128.1.152 master-2 glance-default-external-api-2 3/3 Running 0 3h13m 10.130.0.208 master-1 glance-default-internal-api-0 3/3 Running 0 3h13m 10.128.1.153 master-2 glance-default-internal-api-1 3/3 Running 0 3h13m 10.129.0.70 master-0 glance-default-internal-api-2 3/3 Running 0 3h14m 10.130.0.207 master-1 ``` ``` +--------+ +--------+ +--------+ | | | | | | | ZONE A | | ZONE B | | ZONE C | | (m0) | | (m1) | | (m2) | | | | | | | +--------+ +--------+ +--------+ |_ api-ext-0 |_ api-ext-1 |_ api-ext-2 |_ api-int-0 |_ api-int-1 |_ api-int-2 | | | |_ azone-edge-0 |_ bzone-edge-0 |_ czone-edge-0 |_ azone-edge-1 |_ bzone-edge-1 |_ czone-edge-1 |_ azone-edge-2 |_ bzone-edge-2 |_ czone-edge-2 ```