### Topology and Affinity notes

```bash
oc label nodes master-0 node=node0 zone=zoneA --overwrite
oc label nodes master-1 node=node1 zone=zoneB --overwrite
oc label nodes master-2 node=node2 zone=zoneC --overwrite
```

```
+--------+      +--------+      +--------+
|        |      |        |      |        |
| ZONE A |      | ZONE B |      | ZONE C |
|        |      |        |      |        |
+--------+      +--------+      +--------+
 |_ api-ext-0    |_ api-ext-1    |_ api-ext-2
 |_ api-int-0    |_ api-int-1    |_ api-int-2

```

```yaml
---
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
  name: glance-default-spread-pods
  namespace: openstack
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        service: glance
```

In this case, we can observe how the `TopologySpreadConstraints` matches with
the `preferredAntiAffinityRules`, the system is stable and the scheduler is
able to schedule `Pods` in each zone.
The `topologySpreadConstraints` is applied to Pods that match the specified
`labelSelector`, regardless of whether they are replicas of the same Pod or
different Pods. A key point to understand is that in Kubernetes, there's no
real concept of "replicas of the same Pod" at the scheduler level - what we
commonly call "replicas" are actually individual Pods that share the same
labels and are created by a controller (like a Deployment, StatefulSet, etc.).
Each Pod is scheduled independently, even if they were created as part of the
same set of replicas.
The `topologySpreadConstraints` would apply to **ALL** 6 pods because they all
match the `labelSelector` `service: glance`.
The scheduler would try to spread all these pods across the nodes according to
the constraint, treating them as a single group of 6 pods that need to be
spread, not as separate groups of 2 Pods with 3 replicas.
When we define a `TopologySpreadConstraints`, `maxSkew` plays an important role.
In general, Kubernetes scheduler calculates pod spreading through this `maxSkew`
parameter as follows:

```
skew = max(|actualPodsInZone - avgPodsPerZone|)
```

Where:

- **actualPodsInZone**: the number of pods in a specific zone
- **avgPodsPerZone**: total pods / number of zones

For example, with `7 pods` and `3 zones`:

```
avgPodsPerZone = 7/3 ≈ 2.33
If distribution is [3,2,2], max skew is |3-2.33| = 0.67
If distribution is [4,2,1], max skew is |4-2.33| = 1.67
```

The `maxSkew` parameter represents **the maximum allowed difference from the average**.

If we set `maxSkew: 1`:

```
- [3,2,2] would be allowed (skew 0.67 < 1)
- [4,2,1] would not be allowed (skew 1.67 > 1)
```

In summary:

- Scheduler tries to minimize skew while respecting other constraints
- Higher maxSkew allows more uneven distribution
- Lower maxSkew enforces more balanced distribution
- maxSkew: 1 is a common choice to reach a reasonable balance
- whenUnsatisfiable: DoNotSchedule prevents exceeding maxSkew
- whenUnsatisfiable: ScheduleAnyway allows exceeding if necessary


To spread different types of Pods independently, we would need to use different
labels and different topology spread constraints for each type.
For example, we can select only a subset of `Pods` (e.g. a specific `GlanceAPI`,
called `czone`), and spread the resulting Pods across `zoneA`, `zoneB` and `zoneC`.
To select only the `czone` glanceAPI, we rely on `matchExpressions`, that fits
well a context where we do not necessarily propagate the same label keys from the
top level CR to the resulting Pods.
The following example, selects the `glance-czone-edge-api` Pods, and spreads them
across the existing kubernetes nodes.


```yaml
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
  name: glance-czone-spread-pods
  namespace: openstack
spec:
  topologySpreadConstraint:
  - labelSelector:
      matchExpressions:
      - key: glanceAPI
        operator: In
        values:
        - czone
    maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
```

To achieve the above, a glanceAPI called `czone` has been created, and a `topologyRef`
called `glance-czone-spread-pods` has been applied.

```
+--------+      +--------+      +--------+
|        |      |        |      |        |
| ZONE A |      | ZONE B |      | ZONE C |
|        |      |        |      |        |
+--------+      +--------+      +--------+
 |_ api-ext-0    |_ api-ext-1    |_ api-ext-2
 |_ api-int-0    |_ api-int-1    |_ api-int-2
 |_ czone-edge-0 |_ czone-edge-1 |_ czone-edge-2

```

However, a quite common scenario is to reach the exact opposite model, where
`Pods`are scheduled in a particular/specific zone.
In our example, the idea is to take all `Pods` that belong to `glanceAPI:
glance-czone-edge` and schedule all of them in `ZoneC` (which is equals to
`master-2` in a three nodes environment).
To achieve this goal, we create the following topology CR:


```yaml
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
  name: glance-czone-node-affinity
  namespace: openstack
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: zone
                operator: In
                values:
                  - zoneC
```

The above establishes a **nodeAffinity**, to make sure we schedule `czone` glanceAPI
Pods in `zoneC`.
In other words, we **require** `czone` `Pods` to be scheduled on a node that
has `zoneC` as label, and this condition is stronger than the
`preferredAntiAffinityRules` applied by default to the statefulSet.
Note that in this case we do not need any `TopologySpreadConstraint`, because
we're not really interested in the Pods distribution, but we're trying to
achieve isolation between AZs.


```
+--------+      +--------+      +--------+
|        |      |        |      |        |
| ZONE A |      | ZONE B |      | ZONE C |
|        |      |        |      |        |
+--------+      +--------+      +--------+
 |_ api-ext-0    |_ api-ext-1    |_ api-ext-2
 |_ api-int-0    |_ api-int-1    |_ api-int-2
                                 |_ czone-edge-0
                                 |_ czone-edge-1
                                 |_ czone-edge-2

```

The picture above can be checked with the following:

```
Every 2.0s: oc get pods -l service=glance -o wide

NAME                            READY   STATUS    RESTARTS   AGE     IP             NODE       NOMINATED NODE   READINESS GATES
glance-czone-edge-api-0         3/3     Running   0          3m13s   10.128.1.172   master-2   <none>           <none>
glance-czone-edge-api-1         3/3     Running   0          3m25s   10.128.1.171   master-2   <none>           <none>
glance-czone-edge-api-2         3/3     Running   0          3m37s   10.128.1.170   master-2   <none>           <none>

glance-default-external-api-0   3/3     Running   0          71m     10.129.0.72    master-0   <none>           <none>
glance-default-external-api-1   3/3     Running   0          72m     10.128.1.152   master-2   <none>           <none>
glance-default-external-api-2   3/3     Running   0          72m     10.130.0.208   master-1   <none>           <none>

glance-default-internal-api-0   3/3     Running   0          72m     10.128.1.153   master-2   <none>           <none>
glance-default-internal-api-1   3/3     Running   0          72m     10.129.0.70    master-0   <none>           <none>
glance-default-internal-api-2   3/3     Running   0          72m     10.130.0.207   master-1   <none>           <none>
```

We can use the same approach to apply `nodeAffinity` to `bzone` and `azone`
glanceAPIs and observe Pods being scheduled on the nodes that belong to
`zoneB` and `zoneA`.

We create the following CRs:

```yaml
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
  name: glance-bzone-node-affinity
  namespace: openstack
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: zone
                operator: In
                values:
                  - zoneB
```

and:

```yaml
apiVersion: topology.openstack.org/v1beta1
kind: Topology
metadata:
  name: glance-azone-node-affinity
  namespace: openstack
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: zone
                operator: In
                values:
                  - zoneA
```

```
$ oc get topology

NAME
glance-azone-node-affinity
glance-bzone-node-affinity
glance-czone-node-affinity
glance-czone-spread-pods
glance-default-spread-pods
```


```
NAME                            READY   STATUS    RESTARTS   AGE     IP             NODE       NOMINATED NODE   READINESS GATES
glance-azone-edge-api-0         3/3     Running   0          39s     10.129.0.161   master-0   <none>           <none>
glance-azone-edge-api-1         3/3     Running   0          51s     10.129.0.160   master-0   <none>           <none>
glance-azone-edge-api-2         3/3     Running   0          64s     10.129.0.159   master-0   <none>           <none>

glance-bzone-edge-api-0         3/3     Running   0          15m     10.130.0.239   master-1   <none>           <none>
glance-bzone-edge-api-1         3/3     Running   0          15m     10.130.0.238   master-1   <none>           <none>
glance-bzone-edge-api-2         3/3     Running   0          15m     10.130.0.237   master-1   <none>           <none>

glance-czone-edge-api-0         3/3     Running   0          124m    10.128.1.172   master-2   <none>           <none>
glance-czone-edge-api-1         3/3     Running   0          124m    10.128.1.171   master-2   <none>           <none>
glance-czone-edge-api-2         3/3     Running   0          124m    10.128.1.170   master-2   <none>           <none>

glance-default-external-api-0   3/3     Running   0          3h12m   10.129.0.72    master-0   <none>           <none>
glance-default-external-api-1   3/3     Running   0          3h13m   10.128.1.152   master-2   <none>           <none>
glance-default-external-api-2   3/3     Running   0          3h13m   10.130.0.208   master-1   <none>           <none>
glance-default-internal-api-0   3/3     Running   0          3h13m   10.128.1.153   master-2   <none>           <none>
glance-default-internal-api-1   3/3     Running   0          3h13m   10.129.0.70    master-0   <none>           <none>
glance-default-internal-api-2   3/3     Running   0          3h14m   10.130.0.207   master-1   <none>           <none>
```

```
+--------+         +--------+          +--------+
|        |         |        |          |        |
| ZONE A |         | ZONE B |          | ZONE C |
|  (m0)  |         |  (m1)  |          |  (m2)  |
|        |         |        |          |        |
+--------+         +--------+          +--------+
 |_ api-ext-0       |_ api-ext-1        |_ api-ext-2
 |_ api-int-0       |_ api-int-1        |_ api-int-2
 |                  |                   |
 |_ azone-edge-0    |_ bzone-edge-0     |_ czone-edge-0
 |_ azone-edge-1    |_ bzone-edge-1     |_ czone-edge-1
 |_ azone-edge-2    |_ bzone-edge-2     |_ czone-edge-2

```