fmount · January 28, 2025 16:08 · Jan 28, 2025 · Jan 28, 2025 · Jan 8, 2025 · Jan 8, 2025
diff --git a/affinity.md b/affinity.md
@@ -166,7 +166,7 @@ Pods in `zoneC`.
 In other words, we **require** `czone` `Pods` to be scheduled on a node that
 has `zoneC` as label, and this condition is stronger than the
 `preferredAntiAffinityRules` applied by default to the statefulSet.
-Note that in this case we do not need any `TopologySpreadConstraint`, because
+Note that in this case we do not need any `TopologySpreadConstraints`, because
 we're not really interested in the Pods distribution, but we're trying to
 achieve isolation between AZs.
 

diff --git a/affinity.md b/affinity.md
@@ -25,7 +25,7 @@ metadata:
   name: glance-default-spread-pods
   namespace: openstack
 spec:
-  topologySpreadConstraint:
+  topologySpreadConstraints:
   - maxSkew: 1
     topologyKey: zone
     whenUnsatisfiable: DoNotSchedule
@@ -34,22 +34,62 @@ spec:
         service: glance
 ```
 
-In this case, we can observe how the `TopologySpreadConstraint` matches with
+In this case, we can observe how the `TopologySpreadConstraints` matches with
 the `preferredAntiAffinityRules`, the system is stable and the scheduler is
 able to schedule `Pods` in each zone.
-The topologySpreadConstraint is applied to Pods that match the specified
-labelSelector, regardless of whether they are replicas of the same Pod or
+The `topologySpreadConstraints` is applied to Pods that match the specified
+`labelSelector`, regardless of whether they are replicas of the same Pod or
 different Pods. A key point to understand is that in Kubernetes, there's no
 real concept of "replicas of the same Pod" at the scheduler level - what we
 commonly call "replicas" are actually individual Pods that share the same
 labels and are created by a controller (like a Deployment, StatefulSet, etc.).
 Each Pod is scheduled independently, even if they were created as part of the
 same set of replicas.
-The `topologySpreadConstraint` would apply to **ALL** 6 pods because they all
+The `topologySpreadConstraints` would apply to **ALL** 6 pods because they all
 match the `labelSelector` `service: glance`.
 The scheduler would try to spread all these pods across the nodes according to
 the constraint, treating them as a single group of 6 pods that need to be
 spread, not as separate groups of 2 Pods with 3 replicas.
+When we define a `TopologySpreadConstraints`, `maxSkew` plays an important role.
+In general, Kubernetes scheduler calculates pod spreading through this `maxSkew`
+parameter as follows:
+
+```
+skew = max(|actualPodsInZone - avgPodsPerZone|)
+```
+
+Where:
+
+- **actualPodsInZone**: the number of pods in a specific zone
+- **avgPodsPerZone**: total pods / number of zones
+
+For example, with `7 pods` and `3 zones`:
+
+```
+avgPodsPerZone = 7/3 ≈ 2.33
+If distribution is [3,2,2], max skew is |3-2.33| = 0.67
+If distribution is [4,2,1], max skew is |4-2.33| = 1.67
+```
+
+The `maxSkew` parameter represents **the maximum allowed difference from the average**.
+
+If we set `maxSkew: 1`:
+
+```
+- [3,2,2] would be allowed (skew 0.67 < 1)
+- [4,2,1] would not be allowed (skew 1.67 > 1)
+```
+
+In summary:
+
+- Scheduler tries to minimize skew while respecting other constraints
+- Higher maxSkew allows more uneven distribution
+- Lower maxSkew enforces more balanced distribution
+- maxSkew: 1 is a common choice to reach a reasonable balance
+- whenUnsatisfiable: DoNotSchedule prevents exceeding maxSkew
+- whenUnsatisfiable: ScheduleAnyway allows exceeding if necessary
+
+
 To spread different types of Pods independently, we would need to use different
 labels and different topology spread constraints for each type.
 For example, we can select only a subset of `Pods` (e.g. a specific `GlanceAPI`,

diff --git a/affinity.md b/affinity.md
@@ -1,4 +1,4 @@
-### Layout 0
+### Topology and Affinity notes
 
 ```bash
 oc label nodes master-0 node=node0 zone=zoneA --overwrite

diff --git a/affinity.md b/affinity.md
@@ -250,7 +250,6 @@ glance-default-internal-api-2   3/3     Running   0          3h14m   10.130.0.20
 |        |         |        |          |        |
 +--------+         +--------+          +--------+
  |_ api-ext-0       |_ api-ext-1        |_ api-ext-2
- |                  |                   |
  |_ api-int-0       |_ api-int-1        |_ api-int-2
  |                  |                   |
  |_ azone-edge-0    |_ bzone-edge-0     |_ czone-edge-0

diff --git a/affinity.md b/affinity.md
@@ -82,6 +82,19 @@ spec:
 
 To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef`
 called `glance-czone-spread-pods` has been applied.
+
+```
++--------+      +--------+      +--------+
+|        |      |        |      |        |
+| ZONE A |      | ZONE B |      | ZONE C |
+|        |      |        |      |        |
++--------+      +--------+      +--------+
+ |_ api-ext-0    |_ api-ext-1    |_ api-ext-2
+ |_ api-int-0    |_ api-int-1    |_ api-int-2
+ |_ czone-edge-0 |_ czone-edge-1 |_ czone-edge-2
+
+```
+
 However, a quite common scenario is to reach the exact opposite model, where
 `Pods`are scheduled in a particular/specific zone.
 In our example, the idea is to take all `Pods` that belong to `glanceAPI:

diff --git a/affinity.md b/affinity.md
@@ -22,7 +22,7 @@ oc label nodes master-2 node=node2 zone=zoneC --overwrite
 apiVersion: topology.openstack.org/v1beta1
 kind: Topology
 metadata:
-  name: glance-default
+  name: glance-default-spread-pods
   namespace: openstack
 spec:
   topologySpreadConstraint:
@@ -203,7 +203,7 @@ glance-azone-node-affinity
 glance-bzone-node-affinity
 glance-czone-node-affinity
 glance-czone-spread-pods
-glance-default
+glance-default-spread-pods
 ```
 
 

diff --git a/affinity.md b/affinity.md
@@ -65,7 +65,7 @@ across the existing kubernetes nodes.
 apiVersion: topology.openstack.org/v1beta1
 kind: Topology
 metadata:
-  name: glance-czone
+  name: glance-czone-spread-pods
   namespace: openstack
 spec:
   topologySpreadConstraint:
@@ -81,8 +81,7 @@ spec:
 ```
 
 To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef`
-called `glance-czone` has been applied.
-
+called `glance-czone-spread-pods` has been applied.
 However, a quite common scenario is to reach the exact opposite model, where
 `Pods`are scheduled in a particular/specific zone.
 In our example, the idea is to take all `Pods` that belong to `glanceAPI:
@@ -182,7 +181,7 @@ and:
 apiVersion: topology.openstack.org/v1beta1
 kind: Topology
 metadata:
-  name: glance-bzone-node-affinity
+  name: glance-azone-node-affinity
   namespace: openstack
 spec:
   affinity:
@@ -193,7 +192,7 @@ spec:
               - key: zone
                 operator: In
                 values:
-                  - zoneB
+                  - zoneA
 ```
 
 ```
@@ -202,8 +201,8 @@ $ oc get topology
 NAME
 glance-azone-node-affinity
 glance-bzone-node-affinity
-glance-czone
 glance-czone-node-affinity
+glance-czone-spread-pods
 glance-default
 ```
 

diff --git a/affinity.md b/affinity.md
@@ -0,0 +1,248 @@
+### Layout 0
+
+```bash
+oc label nodes master-0 node=node0 zone=zoneA --overwrite
+oc label nodes master-1 node=node1 zone=zoneB --overwrite
+oc label nodes master-2 node=node2 zone=zoneC --overwrite
+```
+
+```
++--------+      +--------+      +--------+
+|        |      |        |      |        |
+| ZONE A |      | ZONE B |      | ZONE C |
+|        |      |        |      |        |
++--------+      +--------+      +--------+
+ |_ api-ext-0    |_ api-ext-1    |_ api-ext-2
+ |_ api-int-0    |_ api-int-1    |_ api-int-2
+
+```
+
+```yaml
+---
+apiVersion: topology.openstack.org/v1beta1
+kind: Topology
+metadata:
+  name: glance-default
+  namespace: openstack
+spec:
+  topologySpreadConstraint:
+  - maxSkew: 1
+    topologyKey: zone
+    whenUnsatisfiable: DoNotSchedule
+    labelSelector:
+      matchLabels:
+        service: glance
+```
+
+In this case, we can observe how the `TopologySpreadConstraint` matches with
+the `preferredAntiAffinityRules`, the system is stable and the scheduler is
+able to schedule `Pods` in each zone.
+The topologySpreadConstraint is applied to Pods that match the specified
+labelSelector, regardless of whether they are replicas of the same Pod or
+different Pods. A key point to understand is that in Kubernetes, there's no
+real concept of "replicas of the same Pod" at the scheduler level - what we
+commonly call "replicas" are actually individual Pods that share the same
+labels and are created by a controller (like a Deployment, StatefulSet, etc.).
+Each Pod is scheduled independently, even if they were created as part of the
+same set of replicas.
+The `topologySpreadConstraint` would apply to **ALL** 6 pods because they all
+match the `labelSelector` `service: glance`.
+The scheduler would try to spread all these pods across the nodes according to
+the constraint, treating them as a single group of 6 pods that need to be
+spread, not as separate groups of 2 Pods with 3 replicas.
+To spread different types of Pods independently, we would need to use different
+labels and different topology spread constraints for each type.
+For example, we can select only a subset of `Pods` (e.g. a specific `GlanceAPI`,
+called `czone`), and spread the resulting Pods across `zoneA`, `zoneB` and `zoneC`.
+To select only the `czone` glanceAPI, we rely on `matchExpressions`, that fits
+well a context where we do not necessarily propagate the same label keys from the
+top level CR to the resulting Pods.
+The following example, selects the `glance-czone-edge-api` Pods, and spreads them
+across the existing kubernetes nodes.
+
+
+```yaml
+apiVersion: topology.openstack.org/v1beta1
+kind: Topology
+metadata:
+  name: glance-czone
+  namespace: openstack
+spec:
+  topologySpreadConstraint:
+  - labelSelector:
+      matchExpressions:
+      - key: glanceAPI
+        operator: In
+        values:
+        - czone
+    maxSkew: 1
+    topologyKey: zone
+    whenUnsatisfiable: DoNotSchedule
+```
+
+To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef`
+called `glance-czone` has been applied.
+
+However, a quite common scenario is to reach the exact opposite model, where
+`Pods`are scheduled in a particular/specific zone.
+In our example, the idea is to take all `Pods` that belong to `glanceAPI:
+glance-czone-edge` and schedule all of them in `ZoneC` (which is equals to
+`master-2` in a three nodes environment).
+To achieve this goal, we create the following topology CR:
+
+
+```yaml
+apiVersion: topology.openstack.org/v1beta1
+kind: Topology
+metadata:
+  name: glance-czone-node-affinity
+  namespace: openstack
+spec:
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+          - matchExpressions:
+              - key: zone
+                operator: In
+                values:
+                  - zoneC
+```
+
+The above establishes a **nodeAffinity**, to make sure we schedule `czone` glanceAPI
+Pods in `zoneC`.
+In other words, we **require** `czone` `Pods` to be scheduled on a node that
+has `zoneC` as label, and this condition is stronger than the
+`preferredAntiAffinityRules` applied by default to the statefulSet.
+Note that in this case we do not need any `TopologySpreadConstraint`, because
+we're not really interested in the Pods distribution, but we're trying to
+achieve isolation between AZs.
+
+
+```
++--------+      +--------+      +--------+
+|        |      |        |      |        |
+| ZONE A |      | ZONE B |      | ZONE C |
+|        |      |        |      |        |
++--------+      +--------+      +--------+
+ |_ api-ext-0    |_ api-ext-1    |_ api-ext-2
+ |_ api-int-0    |_ api-int-1    |_ api-int-2
+                                 |_ czone-edge-0
+                                 |_ czone-edge-1
+                                 |_ czone-edge-2
+
+```
+
+The picture above can be checked with the following:
+
+```
+Every 2.0s: oc get pods -l service=glance -o wide
+
+NAME                            READY   STATUS    RESTARTS   AGE     IP             NODE       NOMINATED NODE   READINESS GATES
+glance-czone-edge-api-0         3/3     Running   0          3m13s   10.128.1.172   master-2   <none>           <none>
+glance-czone-edge-api-1         3/3     Running   0          3m25s   10.128.1.171   master-2   <none>           <none>
+glance-czone-edge-api-2         3/3     Running   0          3m37s   10.128.1.170   master-2   <none>           <none>
+
+glance-default-external-api-0   3/3     Running   0          71m     10.129.0.72    master-0   <none>           <none>
+glance-default-external-api-1   3/3     Running   0          72m     10.128.1.152   master-2   <none>           <none>
+glance-default-external-api-2   3/3     Running   0          72m     10.130.0.208   master-1   <none>           <none>
+
+glance-default-internal-api-0   3/3     Running   0          72m     10.128.1.153   master-2   <none>           <none>
+glance-default-internal-api-1   3/3     Running   0          72m     10.129.0.70    master-0   <none>           <none>
+glance-default-internal-api-2   3/3     Running   0          72m     10.130.0.207   master-1   <none>           <none>
+```
+
+We can use the same approach to apply `nodeAffinity` to `bzone` and `azone`
+glanceAPIs and observe Pods being scheduled on the nodes that belong to
+`zoneB` and `zoneA`.
+
+We create the following CRs:
+
+```yaml
+apiVersion: topology.openstack.org/v1beta1
+kind: Topology
+metadata:
+  name: glance-bzone-node-affinity
+  namespace: openstack
+spec:
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+          - matchExpressions:
+              - key: zone
+                operator: In
+                values:
+                  - zoneB
+```
+
+and:
+
+```yaml
+apiVersion: topology.openstack.org/v1beta1
+kind: Topology
+metadata:
+  name: glance-bzone-node-affinity
+  namespace: openstack
+spec:
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+          - matchExpressions:
+              - key: zone
+                operator: In
+                values:
+                  - zoneB
+```
+
+```
+$ oc get topology
+
+NAME
+glance-azone-node-affinity
+glance-bzone-node-affinity
+glance-czone
+glance-czone-node-affinity
+glance-default
+```
+
+
+```
+NAME                            READY   STATUS    RESTARTS   AGE     IP             NODE       NOMINATED NODE   READINESS GATES
+glance-azone-edge-api-0         3/3     Running   0          39s     10.129.0.161   master-0   <none>           <none>
+glance-azone-edge-api-1         3/3     Running   0          51s     10.129.0.160   master-0   <none>           <none>
+glance-azone-edge-api-2         3/3     Running   0          64s     10.129.0.159   master-0   <none>           <none>
+
+glance-bzone-edge-api-0         3/3     Running   0          15m     10.130.0.239   master-1   <none>           <none>
+glance-bzone-edge-api-1         3/3     Running   0          15m     10.130.0.238   master-1   <none>           <none>
+glance-bzone-edge-api-2         3/3     Running   0          15m     10.130.0.237   master-1   <none>           <none>
+
+glance-czone-edge-api-0         3/3     Running   0          124m    10.128.1.172   master-2   <none>           <none>
+glance-czone-edge-api-1         3/3     Running   0          124m    10.128.1.171   master-2   <none>           <none>
+glance-czone-edge-api-2         3/3     Running   0          124m    10.128.1.170   master-2   <none>           <none>
+
+glance-default-external-api-0   3/3     Running   0          3h12m   10.129.0.72    master-0   <none>           <none>
+glance-default-external-api-1   3/3     Running   0          3h13m   10.128.1.152   master-2   <none>           <none>
+glance-default-external-api-2   3/3     Running   0          3h13m   10.130.0.208   master-1   <none>           <none>
+glance-default-internal-api-0   3/3     Running   0          3h13m   10.128.1.153   master-2   <none>           <none>
+glance-default-internal-api-1   3/3     Running   0          3h13m   10.129.0.70    master-0   <none>           <none>
+glance-default-internal-api-2   3/3     Running   0          3h14m   10.130.0.207   master-1   <none>           <none>
+```
+
+```
++--------+         +--------+          +--------+
+|        |         |        |          |        |
+| ZONE A |         | ZONE B |          | ZONE C |
+|  (m0)  |         |  (m1)  |          |  (m2)  |
+|        |         |        |          |        |
++--------+         +--------+          +--------+
+ |_ api-ext-0       |_ api-ext-1        |_ api-ext-2
+ |                  |                   |
+ |_ api-int-0       |_ api-int-1        |_ api-int-2
+ |                  |                   |
+ |_ azone-edge-0    |_ bzone-edge-0     |_ czone-edge-0
+ |_ azone-edge-1    |_ bzone-edge-1     |_ czone-edge-1
+ |_ azone-edge-2    |_ bzone-edge-2     |_ czone-edge-2
+
+```