Skip to content

Instantly share code, notes, and snippets.

@fmount
Last active January 28, 2025 16:08
Show Gist options
  • Save fmount/2c5e7b99d3e1bcc1a2afdb619c7ad9d6 to your computer and use it in GitHub Desktop.
Save fmount/2c5e7b99d3e1bcc1a2afdb619c7ad9d6 to your computer and use it in GitHub Desktop.

Revisions

  1. fmount revised this gist Jan 28, 2025. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion affinity.md
    Original file line number Diff line number Diff line change
    @@ -166,7 +166,7 @@ Pods in `zoneC`.
    In other words, we **require** `czone` `Pods` to be scheduled on a node that
    has `zoneC` as label, and this condition is stronger than the
    `preferredAntiAffinityRules` applied by default to the statefulSet.
    Note that in this case we do not need any `TopologySpreadConstraint`, because
    Note that in this case we do not need any `TopologySpreadConstraints`, because
    we're not really interested in the Pods distribution, but we're trying to
    achieve isolation between AZs.

  2. fmount revised this gist Jan 28, 2025. 1 changed file with 45 additions and 5 deletions.
    50 changes: 45 additions & 5 deletions affinity.md
    Original file line number Diff line number Diff line change
    @@ -25,7 +25,7 @@ metadata:
    name: glance-default-spread-pods
    namespace: openstack
    spec:
    topologySpreadConstraint:
    topologySpreadConstraints:
    - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    @@ -34,22 +34,62 @@ spec:
    service: glance
    ```
    In this case, we can observe how the `TopologySpreadConstraint` matches with
    In this case, we can observe how the `TopologySpreadConstraints` matches with
    the `preferredAntiAffinityRules`, the system is stable and the scheduler is
    able to schedule `Pods` in each zone.
    The topologySpreadConstraint is applied to Pods that match the specified
    labelSelector, regardless of whether they are replicas of the same Pod or
    The `topologySpreadConstraints` is applied to Pods that match the specified
    `labelSelector`, regardless of whether they are replicas of the same Pod or
    different Pods. A key point to understand is that in Kubernetes, there's no
    real concept of "replicas of the same Pod" at the scheduler level - what we
    commonly call "replicas" are actually individual Pods that share the same
    labels and are created by a controller (like a Deployment, StatefulSet, etc.).
    Each Pod is scheduled independently, even if they were created as part of the
    same set of replicas.
    The `topologySpreadConstraint` would apply to **ALL** 6 pods because they all
    The `topologySpreadConstraints` would apply to **ALL** 6 pods because they all
    match the `labelSelector` `service: glance`.
    The scheduler would try to spread all these pods across the nodes according to
    the constraint, treating them as a single group of 6 pods that need to be
    spread, not as separate groups of 2 Pods with 3 replicas.
    When we define a `TopologySpreadConstraints`, `maxSkew` plays an important role.
    In general, Kubernetes scheduler calculates pod spreading through this `maxSkew`
    parameter as follows:

    ```
    skew = max(|actualPodsInZone - avgPodsPerZone|)
    ```

    Where:

    - **actualPodsInZone**: the number of pods in a specific zone
    - **avgPodsPerZone**: total pods / number of zones

    For example, with `7 pods` and `3 zones`:

    ```
    avgPodsPerZone = 7/3 ≈ 2.33
    If distribution is [3,2,2], max skew is |3-2.33| = 0.67
    If distribution is [4,2,1], max skew is |4-2.33| = 1.67
    ```

    The `maxSkew` parameter represents **the maximum allowed difference from the average**.

    If we set `maxSkew: 1`:

    ```
    - [3,2,2] would be allowed (skew 0.67 < 1)
    - [4,2,1] would not be allowed (skew 1.67 > 1)
    ```

    In summary:

    - Scheduler tries to minimize skew while respecting other constraints
    - Higher maxSkew allows more uneven distribution
    - Lower maxSkew enforces more balanced distribution
    - maxSkew: 1 is a common choice to reach a reasonable balance
    - whenUnsatisfiable: DoNotSchedule prevents exceeding maxSkew
    - whenUnsatisfiable: ScheduleAnyway allows exceeding if necessary


    To spread different types of Pods independently, we would need to use different
    labels and different topology spread constraints for each type.
    For example, we can select only a subset of `Pods` (e.g. a specific `GlanceAPI`,
  3. fmount revised this gist Jan 8, 2025. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion affinity.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    ### Layout 0
    ### Topology and Affinity notes

    ```bash
    oc label nodes master-0 node=node0 zone=zoneA --overwrite
  4. fmount revised this gist Jan 8, 2025. 1 changed file with 0 additions and 1 deletion.
    1 change: 0 additions & 1 deletion affinity.md
    Original file line number Diff line number Diff line change
    @@ -250,7 +250,6 @@ glance-default-internal-api-2 3/3 Running 0 3h14m 10.130.0.20
    | | | | | |
    +--------+ +--------+ +--------+
    |_ api-ext-0 |_ api-ext-1 |_ api-ext-2
    | | |
    |_ api-int-0 |_ api-int-1 |_ api-int-2
    | | |
    |_ azone-edge-0 |_ bzone-edge-0 |_ czone-edge-0
  5. fmount revised this gist Jan 8, 2025. 1 changed file with 13 additions and 0 deletions.
    13 changes: 13 additions & 0 deletions affinity.md
    Original file line number Diff line number Diff line change
    @@ -82,6 +82,19 @@ spec:

    To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef`
    called `glance-czone-spread-pods` has been applied.

    ```
    +--------+ +--------+ +--------+
    | | | | | |
    | ZONE A | | ZONE B | | ZONE C |
    | | | | | |
    +--------+ +--------+ +--------+
    |_ api-ext-0 |_ api-ext-1 |_ api-ext-2
    |_ api-int-0 |_ api-int-1 |_ api-int-2
    |_ czone-edge-0 |_ czone-edge-1 |_ czone-edge-2

    ```
    However, a quite common scenario is to reach the exact opposite model, where
    `Pods`are scheduled in a particular/specific zone.
    In our example, the idea is to take all `Pods` that belong to `glanceAPI:
  6. fmount revised this gist Jan 8, 2025. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions affinity.md
    Original file line number Diff line number Diff line change
    @@ -22,7 +22,7 @@ oc label nodes master-2 node=node2 zone=zoneC --overwrite
    apiVersion: topology.openstack.org/v1beta1
    kind: Topology
    metadata:
    name: glance-default
    name: glance-default-spread-pods
    namespace: openstack
    spec:
    topologySpreadConstraint:
    @@ -203,7 +203,7 @@ glance-azone-node-affinity
    glance-bzone-node-affinity
    glance-czone-node-affinity
    glance-czone-spread-pods
    glance-default
    glance-default-spread-pods
    ```


  7. fmount revised this gist Jan 8, 2025. 1 changed file with 5 additions and 6 deletions.
    11 changes: 5 additions & 6 deletions affinity.md
    Original file line number Diff line number Diff line change
    @@ -65,7 +65,7 @@ across the existing kubernetes nodes.
    apiVersion: topology.openstack.org/v1beta1
    kind: Topology
    metadata:
    name: glance-czone
    name: glance-czone-spread-pods
    namespace: openstack
    spec:
    topologySpreadConstraint:
    @@ -81,8 +81,7 @@ spec:
    ```

    To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef`
    called `glance-czone` has been applied.

    called `glance-czone-spread-pods` has been applied.
    However, a quite common scenario is to reach the exact opposite model, where
    `Pods`are scheduled in a particular/specific zone.
    In our example, the idea is to take all `Pods` that belong to `glanceAPI:
    @@ -182,7 +181,7 @@ and:
    apiVersion: topology.openstack.org/v1beta1
    kind: Topology
    metadata:
    name: glance-bzone-node-affinity
    name: glance-azone-node-affinity
    namespace: openstack
    spec:
    affinity:
    @@ -193,7 +192,7 @@ spec:
    - key: zone
    operator: In
    values:
    - zoneB
    - zoneA
    ```
    ```
    @@ -202,8 +201,8 @@ $ oc get topology
    NAME
    glance-azone-node-affinity
    glance-bzone-node-affinity
    glance-czone
    glance-czone-node-affinity
    glance-czone-spread-pods
    glance-default
    ```

  8. fmount created this gist Jan 8, 2025.
    248 changes: 248 additions & 0 deletions affinity.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,248 @@
    ### Layout 0

    ```bash
    oc label nodes master-0 node=node0 zone=zoneA --overwrite
    oc label nodes master-1 node=node1 zone=zoneB --overwrite
    oc label nodes master-2 node=node2 zone=zoneC --overwrite
    ```

    ```
    +--------+ +--------+ +--------+
    | | | | | |
    | ZONE A | | ZONE B | | ZONE C |
    | | | | | |
    +--------+ +--------+ +--------+
    |_ api-ext-0 |_ api-ext-1 |_ api-ext-2
    |_ api-int-0 |_ api-int-1 |_ api-int-2
    ```

    ```yaml
    ---
    apiVersion: topology.openstack.org/v1beta1
    kind: Topology
    metadata:
    name: glance-default
    namespace: openstack
    spec:
    topologySpreadConstraint:
    - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
    matchLabels:
    service: glance
    ```
    In this case, we can observe how the `TopologySpreadConstraint` matches with
    the `preferredAntiAffinityRules`, the system is stable and the scheduler is
    able to schedule `Pods` in each zone.
    The topologySpreadConstraint is applied to Pods that match the specified
    labelSelector, regardless of whether they are replicas of the same Pod or
    different Pods. A key point to understand is that in Kubernetes, there's no
    real concept of "replicas of the same Pod" at the scheduler level - what we
    commonly call "replicas" are actually individual Pods that share the same
    labels and are created by a controller (like a Deployment, StatefulSet, etc.).
    Each Pod is scheduled independently, even if they were created as part of the
    same set of replicas.
    The `topologySpreadConstraint` would apply to **ALL** 6 pods because they all
    match the `labelSelector` `service: glance`.
    The scheduler would try to spread all these pods across the nodes according to
    the constraint, treating them as a single group of 6 pods that need to be
    spread, not as separate groups of 2 Pods with 3 replicas.
    To spread different types of Pods independently, we would need to use different
    labels and different topology spread constraints for each type.
    For example, we can select only a subset of `Pods` (e.g. a specific `GlanceAPI`,
    called `czone`), and spread the resulting Pods across `zoneA`, `zoneB` and `zoneC`.
    To select only the `czone` glanceAPI, we rely on `matchExpressions`, that fits
    well a context where we do not necessarily propagate the same label keys from the
    top level CR to the resulting Pods.
    The following example, selects the `glance-czone-edge-api` Pods, and spreads them
    across the existing kubernetes nodes.


    ```yaml
    apiVersion: topology.openstack.org/v1beta1
    kind: Topology
    metadata:
    name: glance-czone
    namespace: openstack
    spec:
    topologySpreadConstraint:
    - labelSelector:
    matchExpressions:
    - key: glanceAPI
    operator: In
    values:
    - czone
    maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    ```

    To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef`
    called `glance-czone` has been applied.

    However, a quite common scenario is to reach the exact opposite model, where
    `Pods`are scheduled in a particular/specific zone.
    In our example, the idea is to take all `Pods` that belong to `glanceAPI:
    glance-czone-edge` and schedule all of them in `ZoneC` (which is equals to
    `master-2` in a three nodes environment).
    To achieve this goal, we create the following topology CR:


    ```yaml
    apiVersion: topology.openstack.org/v1beta1
    kind: Topology
    metadata:
    name: glance-czone-node-affinity
    namespace: openstack
    spec:
    affinity:
    nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
    - matchExpressions:
    - key: zone
    operator: In
    values:
    - zoneC
    ```

    The above establishes a **nodeAffinity**, to make sure we schedule `czone` glanceAPI
    Pods in `zoneC`.
    In other words, we **require** `czone` `Pods` to be scheduled on a node that
    has `zoneC` as label, and this condition is stronger than the
    `preferredAntiAffinityRules` applied by default to the statefulSet.
    Note that in this case we do not need any `TopologySpreadConstraint`, because
    we're not really interested in the Pods distribution, but we're trying to
    achieve isolation between AZs.


    ```
    +--------+ +--------+ +--------+
    | | | | | |
    | ZONE A | | ZONE B | | ZONE C |
    | | | | | |
    +--------+ +--------+ +--------+
    |_ api-ext-0 |_ api-ext-1 |_ api-ext-2
    |_ api-int-0 |_ api-int-1 |_ api-int-2
    |_ czone-edge-0
    |_ czone-edge-1
    |_ czone-edge-2

    ```
    The picture above can be checked with the following:
    ```
    Every 2.0s: oc get pods -l service=glance -o wide

    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    glance-czone-edge-api-0 3/3 Running 0 3m13s 10.128.1.172 master-2 <none> <none>
    glance-czone-edge-api-1 3/3 Running 0 3m25s 10.128.1.171 master-2 <none> <none>
    glance-czone-edge-api-2 3/3 Running 0 3m37s 10.128.1.170 master-2 <none> <none>

    glance-default-external-api-0 3/3 Running 0 71m 10.129.0.72 master-0 <none> <none>
    glance-default-external-api-1 3/3 Running 0 72m 10.128.1.152 master-2 <none> <none>
    glance-default-external-api-2 3/3 Running 0 72m 10.130.0.208 master-1 <none> <none>

    glance-default-internal-api-0 3/3 Running 0 72m 10.128.1.153 master-2 <none> <none>
    glance-default-internal-api-1 3/3 Running 0 72m 10.129.0.70 master-0 <none> <none>
    glance-default-internal-api-2 3/3 Running 0 72m 10.130.0.207 master-1 <none> <none>
    ```
    We can use the same approach to apply `nodeAffinity` to `bzone` and `azone`
    glanceAPIs and observe Pods being scheduled on the nodes that belong to
    `zoneB` and `zoneA`.
    We create the following CRs:
    ```yaml
    apiVersion: topology.openstack.org/v1beta1
    kind: Topology
    metadata:
    name: glance-bzone-node-affinity
    namespace: openstack
    spec:
    affinity:
    nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
    - matchExpressions:
    - key: zone
    operator: In
    values:
    - zoneB
    ```

    and:

    ```yaml
    apiVersion: topology.openstack.org/v1beta1
    kind: Topology
    metadata:
    name: glance-bzone-node-affinity
    namespace: openstack
    spec:
    affinity:
    nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
    - matchExpressions:
    - key: zone
    operator: In
    values:
    - zoneB
    ```
    ```
    $ oc get topology

    NAME
    glance-azone-node-affinity
    glance-bzone-node-affinity
    glance-czone
    glance-czone-node-affinity
    glance-default
    ```


    ```
    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    glance-azone-edge-api-0 3/3 Running 0 39s 10.129.0.161 master-0 <none> <none>
    glance-azone-edge-api-1 3/3 Running 0 51s 10.129.0.160 master-0 <none> <none>
    glance-azone-edge-api-2 3/3 Running 0 64s 10.129.0.159 master-0 <none> <none>
    glance-bzone-edge-api-0 3/3 Running 0 15m 10.130.0.239 master-1 <none> <none>
    glance-bzone-edge-api-1 3/3 Running 0 15m 10.130.0.238 master-1 <none> <none>
    glance-bzone-edge-api-2 3/3 Running 0 15m 10.130.0.237 master-1 <none> <none>
    glance-czone-edge-api-0 3/3 Running 0 124m 10.128.1.172 master-2 <none> <none>
    glance-czone-edge-api-1 3/3 Running 0 124m 10.128.1.171 master-2 <none> <none>
    glance-czone-edge-api-2 3/3 Running 0 124m 10.128.1.170 master-2 <none> <none>
    glance-default-external-api-0 3/3 Running 0 3h12m 10.129.0.72 master-0 <none> <none>
    glance-default-external-api-1 3/3 Running 0 3h13m 10.128.1.152 master-2 <none> <none>
    glance-default-external-api-2 3/3 Running 0 3h13m 10.130.0.208 master-1 <none> <none>
    glance-default-internal-api-0 3/3 Running 0 3h13m 10.128.1.153 master-2 <none> <none>
    glance-default-internal-api-1 3/3 Running 0 3h13m 10.129.0.70 master-0 <none> <none>
    glance-default-internal-api-2 3/3 Running 0 3h14m 10.130.0.207 master-1 <none> <none>
    ```

    ```
    +--------+ +--------+ +--------+
    | | | | | |
    | ZONE A | | ZONE B | | ZONE C |
    | (m0) | | (m1) | | (m2) |
    | | | | | |
    +--------+ +--------+ +--------+
    |_ api-ext-0 |_ api-ext-1 |_ api-ext-2
    | | |
    |_ api-int-0 |_ api-int-1 |_ api-int-2
    | | |
    |_ azone-edge-0 |_ bzone-edge-0 |_ czone-edge-0
    |_ azone-edge-1 |_ bzone-edge-1 |_ czone-edge-1
    |_ azone-edge-2 |_ bzone-edge-2 |_ czone-edge-2
    ```