Last active
January 28, 2025 16:08
-
-
Save fmount/2c5e7b99d3e1bcc1a2afdb619c7ad9d6 to your computer and use it in GitHub Desktop.
Revisions
-
fmount revised this gist
Jan 28, 2025 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -166,7 +166,7 @@ Pods in `zoneC`. In other words, we **require** `czone` `Pods` to be scheduled on a node that has `zoneC` as label, and this condition is stronger than the `preferredAntiAffinityRules` applied by default to the statefulSet. Note that in this case we do not need any `TopologySpreadConstraints`, because we're not really interested in the Pods distribution, but we're trying to achieve isolation between AZs. -
fmount revised this gist
Jan 28, 2025 . 1 changed file with 45 additions and 5 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -25,7 +25,7 @@ metadata: name: glance-default-spread-pods namespace: openstack spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule @@ -34,22 +34,62 @@ spec: service: glance ``` In this case, we can observe how the `TopologySpreadConstraints` matches with the `preferredAntiAffinityRules`, the system is stable and the scheduler is able to schedule `Pods` in each zone. The `topologySpreadConstraints` is applied to Pods that match the specified `labelSelector`, regardless of whether they are replicas of the same Pod or different Pods. A key point to understand is that in Kubernetes, there's no real concept of "replicas of the same Pod" at the scheduler level - what we commonly call "replicas" are actually individual Pods that share the same labels and are created by a controller (like a Deployment, StatefulSet, etc.). Each Pod is scheduled independently, even if they were created as part of the same set of replicas. The `topologySpreadConstraints` would apply to **ALL** 6 pods because they all match the `labelSelector` `service: glance`. The scheduler would try to spread all these pods across the nodes according to the constraint, treating them as a single group of 6 pods that need to be spread, not as separate groups of 2 Pods with 3 replicas. When we define a `TopologySpreadConstraints`, `maxSkew` plays an important role. In general, Kubernetes scheduler calculates pod spreading through this `maxSkew` parameter as follows: ``` skew = max(|actualPodsInZone - avgPodsPerZone|) ``` Where: - **actualPodsInZone**: the number of pods in a specific zone - **avgPodsPerZone**: total pods / number of zones For example, with `7 pods` and `3 zones`: ``` avgPodsPerZone = 7/3 ≈ 2.33 If distribution is [3,2,2], max skew is |3-2.33| = 0.67 If distribution is [4,2,1], max skew is |4-2.33| = 1.67 ``` The `maxSkew` parameter represents **the maximum allowed difference from the average**. If we set `maxSkew: 1`: ``` - [3,2,2] would be allowed (skew 0.67 < 1) - [4,2,1] would not be allowed (skew 1.67 > 1) ``` In summary: - Scheduler tries to minimize skew while respecting other constraints - Higher maxSkew allows more uneven distribution - Lower maxSkew enforces more balanced distribution - maxSkew: 1 is a common choice to reach a reasonable balance - whenUnsatisfiable: DoNotSchedule prevents exceeding maxSkew - whenUnsatisfiable: ScheduleAnyway allows exceeding if necessary To spread different types of Pods independently, we would need to use different labels and different topology spread constraints for each type. For example, we can select only a subset of `Pods` (e.g. a specific `GlanceAPI`, -
fmount revised this gist
Jan 8, 2025 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,4 @@ ### Topology and Affinity notes ```bash oc label nodes master-0 node=node0 zone=zoneA --overwrite -
fmount revised this gist
Jan 8, 2025 . 1 changed file with 0 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -250,7 +250,6 @@ glance-default-internal-api-2 3/3 Running 0 3h14m 10.130.0.20 | | | | | | +--------+ +--------+ +--------+ |_ api-ext-0 |_ api-ext-1 |_ api-ext-2 |_ api-int-0 |_ api-int-1 |_ api-int-2 | | | |_ azone-edge-0 |_ bzone-edge-0 |_ czone-edge-0 -
fmount revised this gist
Jan 8, 2025 . 1 changed file with 13 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -82,6 +82,19 @@ spec: To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef` called `glance-czone-spread-pods` has been applied. ``` +--------+ +--------+ +--------+ | | | | | | | ZONE A | | ZONE B | | ZONE C | | | | | | | +--------+ +--------+ +--------+ |_ api-ext-0 |_ api-ext-1 |_ api-ext-2 |_ api-int-0 |_ api-int-1 |_ api-int-2 |_ czone-edge-0 |_ czone-edge-1 |_ czone-edge-2 ``` However, a quite common scenario is to reach the exact opposite model, where `Pods`are scheduled in a particular/specific zone. In our example, the idea is to take all `Pods` that belong to `glanceAPI: -
fmount revised this gist
Jan 8, 2025 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -22,7 +22,7 @@ oc label nodes master-2 node=node2 zone=zoneC --overwrite apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-default-spread-pods namespace: openstack spec: topologySpreadConstraint: @@ -203,7 +203,7 @@ glance-azone-node-affinity glance-bzone-node-affinity glance-czone-node-affinity glance-czone-spread-pods glance-default-spread-pods ``` -
fmount revised this gist
Jan 8, 2025 . 1 changed file with 5 additions and 6 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -65,7 +65,7 @@ across the existing kubernetes nodes. apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-czone-spread-pods namespace: openstack spec: topologySpreadConstraint: @@ -81,8 +81,7 @@ spec: ``` To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef` called `glance-czone-spread-pods` has been applied. However, a quite common scenario is to reach the exact opposite model, where `Pods`are scheduled in a particular/specific zone. In our example, the idea is to take all `Pods` that belong to `glanceAPI: @@ -182,7 +181,7 @@ and: apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-azone-node-affinity namespace: openstack spec: affinity: @@ -193,7 +192,7 @@ spec: - key: zone operator: In values: - zoneA ``` ``` @@ -202,8 +201,8 @@ $ oc get topology NAME glance-azone-node-affinity glance-bzone-node-affinity glance-czone-node-affinity glance-czone-spread-pods glance-default ``` -
fmount created this gist
Jan 8, 2025 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,248 @@ ### Layout 0 ```bash oc label nodes master-0 node=node0 zone=zoneA --overwrite oc label nodes master-1 node=node1 zone=zoneB --overwrite oc label nodes master-2 node=node2 zone=zoneC --overwrite ``` ``` +--------+ +--------+ +--------+ | | | | | | | ZONE A | | ZONE B | | ZONE C | | | | | | | +--------+ +--------+ +--------+ |_ api-ext-0 |_ api-ext-1 |_ api-ext-2 |_ api-int-0 |_ api-int-1 |_ api-int-2 ``` ```yaml --- apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-default namespace: openstack spec: topologySpreadConstraint: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: service: glance ``` In this case, we can observe how the `TopologySpreadConstraint` matches with the `preferredAntiAffinityRules`, the system is stable and the scheduler is able to schedule `Pods` in each zone. The topologySpreadConstraint is applied to Pods that match the specified labelSelector, regardless of whether they are replicas of the same Pod or different Pods. A key point to understand is that in Kubernetes, there's no real concept of "replicas of the same Pod" at the scheduler level - what we commonly call "replicas" are actually individual Pods that share the same labels and are created by a controller (like a Deployment, StatefulSet, etc.). Each Pod is scheduled independently, even if they were created as part of the same set of replicas. The `topologySpreadConstraint` would apply to **ALL** 6 pods because they all match the `labelSelector` `service: glance`. The scheduler would try to spread all these pods across the nodes according to the constraint, treating them as a single group of 6 pods that need to be spread, not as separate groups of 2 Pods with 3 replicas. To spread different types of Pods independently, we would need to use different labels and different topology spread constraints for each type. For example, we can select only a subset of `Pods` (e.g. a specific `GlanceAPI`, called `czone`), and spread the resulting Pods across `zoneA`, `zoneB` and `zoneC`. To select only the `czone` glanceAPI, we rely on `matchExpressions`, that fits well a context where we do not necessarily propagate the same label keys from the top level CR to the resulting Pods. The following example, selects the `glance-czone-edge-api` Pods, and spreads them across the existing kubernetes nodes. ```yaml apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-czone namespace: openstack spec: topologySpreadConstraint: - labelSelector: matchExpressions: - key: glanceAPI operator: In values: - czone maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule ``` To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef` called `glance-czone` has been applied. However, a quite common scenario is to reach the exact opposite model, where `Pods`are scheduled in a particular/specific zone. In our example, the idea is to take all `Pods` that belong to `glanceAPI: glance-czone-edge` and schedule all of them in `ZoneC` (which is equals to `master-2` in a three nodes environment). To achieve this goal, we create the following topology CR: ```yaml apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-czone-node-affinity namespace: openstack spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - zoneC ``` The above establishes a **nodeAffinity**, to make sure we schedule `czone` glanceAPI Pods in `zoneC`. In other words, we **require** `czone` `Pods` to be scheduled on a node that has `zoneC` as label, and this condition is stronger than the `preferredAntiAffinityRules` applied by default to the statefulSet. Note that in this case we do not need any `TopologySpreadConstraint`, because we're not really interested in the Pods distribution, but we're trying to achieve isolation between AZs. ``` +--------+ +--------+ +--------+ | | | | | | | ZONE A | | ZONE B | | ZONE C | | | | | | | +--------+ +--------+ +--------+ |_ api-ext-0 |_ api-ext-1 |_ api-ext-2 |_ api-int-0 |_ api-int-1 |_ api-int-2 |_ czone-edge-0 |_ czone-edge-1 |_ czone-edge-2 ``` The picture above can be checked with the following: ``` Every 2.0s: oc get pods -l service=glance -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES glance-czone-edge-api-0 3/3 Running 0 3m13s 10.128.1.172 master-2 <none> <none> glance-czone-edge-api-1 3/3 Running 0 3m25s 10.128.1.171 master-2 <none> <none> glance-czone-edge-api-2 3/3 Running 0 3m37s 10.128.1.170 master-2 <none> <none> glance-default-external-api-0 3/3 Running 0 71m 10.129.0.72 master-0 <none> <none> glance-default-external-api-1 3/3 Running 0 72m 10.128.1.152 master-2 <none> <none> glance-default-external-api-2 3/3 Running 0 72m 10.130.0.208 master-1 <none> <none> glance-default-internal-api-0 3/3 Running 0 72m 10.128.1.153 master-2 <none> <none> glance-default-internal-api-1 3/3 Running 0 72m 10.129.0.70 master-0 <none> <none> glance-default-internal-api-2 3/3 Running 0 72m 10.130.0.207 master-1 <none> <none> ``` We can use the same approach to apply `nodeAffinity` to `bzone` and `azone` glanceAPIs and observe Pods being scheduled on the nodes that belong to `zoneB` and `zoneA`. We create the following CRs: ```yaml apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-bzone-node-affinity namespace: openstack spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - zoneB ``` and: ```yaml apiVersion: topology.openstack.org/v1beta1 kind: Topology metadata: name: glance-bzone-node-affinity namespace: openstack spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - zoneB ``` ``` $ oc get topology NAME glance-azone-node-affinity glance-bzone-node-affinity glance-czone glance-czone-node-affinity glance-default ``` ``` NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES glance-azone-edge-api-0 3/3 Running 0 39s 10.129.0.161 master-0 <none> <none> glance-azone-edge-api-1 3/3 Running 0 51s 10.129.0.160 master-0 <none> <none> glance-azone-edge-api-2 3/3 Running 0 64s 10.129.0.159 master-0 <none> <none> glance-bzone-edge-api-0 3/3 Running 0 15m 10.130.0.239 master-1 <none> <none> glance-bzone-edge-api-1 3/3 Running 0 15m 10.130.0.238 master-1 <none> <none> glance-bzone-edge-api-2 3/3 Running 0 15m 10.130.0.237 master-1 <none> <none> glance-czone-edge-api-0 3/3 Running 0 124m 10.128.1.172 master-2 <none> <none> glance-czone-edge-api-1 3/3 Running 0 124m 10.128.1.171 master-2 <none> <none> glance-czone-edge-api-2 3/3 Running 0 124m 10.128.1.170 master-2 <none> <none> glance-default-external-api-0 3/3 Running 0 3h12m 10.129.0.72 master-0 <none> <none> glance-default-external-api-1 3/3 Running 0 3h13m 10.128.1.152 master-2 <none> <none> glance-default-external-api-2 3/3 Running 0 3h13m 10.130.0.208 master-1 <none> <none> glance-default-internal-api-0 3/3 Running 0 3h13m 10.128.1.153 master-2 <none> <none> glance-default-internal-api-1 3/3 Running 0 3h13m 10.129.0.70 master-0 <none> <none> glance-default-internal-api-2 3/3 Running 0 3h14m 10.130.0.207 master-1 <none> <none> ``` ``` +--------+ +--------+ +--------+ | | | | | | | ZONE A | | ZONE B | | ZONE C | | (m0) | | (m1) | | (m2) | | | | | | | +--------+ +--------+ +--------+ |_ api-ext-0 |_ api-ext-1 |_ api-ext-2 | | | |_ api-int-0 |_ api-int-1 |_ api-int-2 | | | |_ azone-edge-0 |_ bzone-edge-0 |_ czone-edge-0 |_ azone-edge-1 |_ bzone-edge-1 |_ czone-edge-1 |_ azone-edge-2 |_ bzone-edge-2 |_ czone-edge-2 ```