Skip to content

Instantly share code, notes, and snippets.

@enixdark
Forked from srampal/ebpf-k8s-services.md
Created December 8, 2022 04:40
Show Gist options
  • Save enixdark/8dedd751d6cb6676d421e7976d2c5331 to your computer and use it in GitHub Desktop.
Save enixdark/8dedd751d6cb6676d421e7976d2c5331 to your computer and use it in GitHub Desktop.

Revisions

  1. @srampal srampal revised this gist Sep 14, 2022. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -6,7 +6,7 @@ Goals and Priorities

    * Build an eBpf based implementation of Kubernetes Services (ClusterIP, NodePort, LoadBalancer) to replace Kube-proxy/ iptables and CNI based implementations of Kubernetes services.
    * The goal is not "use as much eBpf" as possible but rather to use eBpf selectively and opportunistically and also to leverage standard kernel datapaths as much as possible unless there is a good reason to do otherwise.
    * Since iptables packages are being deprecated in RHEL, it is necessary to have an implementation of kube-proxy that does not depend on iptables. See [iptables deprecation](https://access.redhat.com/solutions/6739041)
    * Since iptables packages are being deprecated in the Linux kernel and RHEL, it is necessary to have an implementation of kube-proxy that does not depend on iptables. See [iptables deprecation](https://access.redhat.com/solutions/6739041)
    * Primary design requirement is to retain end user experience for stability and debuggability when replacing the kube-proxy/ iptables based datapath. This requirement is more important that flat out data plane performance if that comes at the cost of stability, debuggability and familiarity for end users.

    Approaches Evaluated
  2. @srampal srampal revised this gist Sep 13, 2022. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -16,6 +16,8 @@ Approaches Evaluated
    * A2: Leverage conntrack module and nat tables from Linux kernel but use new eBpf tc/ xdp programs to set these up
    * A3: Use Socket based load balancing and data path techniques to bypass kernel conntrack, netfilter and nat datapaths.

    Details of these approaches are documented separately but very briefly, approaches A1 and A2 try to mirror the data path logic of ipTables based Kube-Proxy implementation without actually using iptables. Approach A3 in contrast would rely on implementing a socket level L4 proxy function implemented in eBpf for terminating and re-initiating connections initiated by external clients to destination NodePorts.

    Based on analysis of pros/ cons of these options and the desired priotization of user experience and stability, we are currently planning on using approach A2 for this work although we continue to analyze approaches A1 and A3 and may opportunistically use some aspects of those approaches in the final implementation.

    As a side note, the [Cilium project](https://github.com/cilium/cilium) uses a design similar to approach A3 for implementing ClusterIP services using eBpf and a design similar to approach A1 for NodePort services. The eBpf helper functions needed for approach A2 are only recently getting developed and the current direction for this project is to use the early pre-GA versions of that infrastructure. These options were not available to the Cilium project. This also fits in line with our requirement to leverage the Linux kernel as much as possible and focus on stability, debuggability and familiarity for end users. However we will continue to track all 3 options and potential hybrid solutions that combine aspects of these different approaches, if that helps with our end goal requirements and priorities.
    @@ -70,4 +72,3 @@ struct nf_conn *bpf_ct_insert_entry(struct nf_conn___init *);
    Note: Review these and change the xdp versions to the skb/ tc versions based on code updates in progress upstream kernel.

    The logic of these functions is selef-explanatory (CRUD of CT/ connection tracking entries and NAT mappings in the Linux kernel). The current timeline for most of these helpers is to be available in kernel v6.0 and the remaining in kernel v6.1 both of which should GA before the end of CY 2022. Additional details to be added after completion of Phase 1.

  3. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -69,4 +69,5 @@ struct nf_conn *bpf_ct_insert_entry(struct nf_conn___init *);

    Note: Review these and change the xdp versions to the skb/ tc versions based on code updates in progress upstream kernel.

    The logic of these functions is selef-explanatory (CRUD of CT/ connection tracking entries and NAT mappings in the Linux kernel). The current timeline for most of these helpers is to be available in kernel v6.0 and the remaining in kernel v6.1 both of which should GA before the end of CY 2022.
    The logic of these functions is selef-explanatory (CRUD of CT/ connection tracking entries and NAT mappings in the Linux kernel). The current timeline for most of these helpers is to be available in kernel v6.0 and the remaining in kernel v6.1 both of which should GA before the end of CY 2022. Additional details to be added after completion of Phase 1.

  4. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -18,7 +18,7 @@ Approaches Evaluated

    Based on analysis of pros/ cons of these options and the desired priotization of user experience and stability, we are currently planning on using approach A2 for this work although we continue to analyze approaches A1 and A3 and may opportunistically use some aspects of those approaches in the final implementation.

    As a side note, the Cilium project uses a design similar to approach A3 for implementing ClusterIP services using eBpf and a design similar to approach A1 for NodePort services. The eBpf helper functions needed for approach A2 are only recently getting developed and the current direction for this project is to use the early pre-GA versions of that infrastructure. These options were not available to the Cilium project. This also fits in line with our requirement to leverage the Linux kernel as much as possible and focus on stability, debuggability and familiarity for end users. However we will continue to track all 3 options and potential hybrid solutions that combine aspects of these different approaches, if that helps with our end goal requirements and priorities.
    As a side note, the [Cilium project](https://github.com/cilium/cilium) uses a design similar to approach A3 for implementing ClusterIP services using eBpf and a design similar to approach A1 for NodePort services. The eBpf helper functions needed for approach A2 are only recently getting developed and the current direction for this project is to use the early pre-GA versions of that infrastructure. These options were not available to the Cilium project. This also fits in line with our requirement to leverage the Linux kernel as much as possible and focus on stability, debuggability and familiarity for end users. However we will continue to track all 3 options and potential hybrid solutions that combine aspects of these different approaches, if that helps with our end goal requirements and priorities.

    Phase 1
    -------
  5. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -18,12 +18,12 @@ Approaches Evaluated

    Based on analysis of pros/ cons of these options and the desired priotization of user experience and stability, we are currently planning on using approach A2 for this work although we continue to analyze approaches A1 and A3 and may opportunistically use some aspects of those approaches in the final implementation.

    As a side note, the Cilium project uses an approach similar to A3 for implementing ClusterIP services using eBpf and an approach similar to A1 for NodePort services. The eBpf helper functions needed for approach A2 are only recently getting developed and the current direction for this project is to use the early pre-GA versions of that infrastructure. These options were not available to the CIlium project. This also fits in line with our requirement to leverage the Linux kernel as much as possible and focus on stability, debuggability and familiarity for end users. However we will continue to track all 3 options and potential hybrid solutions that combine aspects of these different approaches, if that helps with our end goal requirements and priorities.
    As a side note, the Cilium project uses a design similar to approach A3 for implementing ClusterIP services using eBpf and a design similar to approach A1 for NodePort services. The eBpf helper functions needed for approach A2 are only recently getting developed and the current direction for this project is to use the early pre-GA versions of that infrastructure. These options were not available to the Cilium project. This also fits in line with our requirement to leverage the Linux kernel as much as possible and focus on stability, debuggability and familiarity for end users. However we will continue to track all 3 options and potential hybrid solutions that combine aspects of these different approaches, if that helps with our end goal requirements and priorities.

    Phase 1
    -------

    Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release. In Phase 1, we will leverage the Kube-Proxy NG (aka KPNG) project as the baseline controller for watching and processing K8s services. However this project will not be completely tied to KPNG or to a specific backend of backend and in Phase 2 we will make a call whether to continue with KPNG or not depending on the upstream readiness of KPNG and the appropriate KPNG backend at that time.
    Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release. In Phase 1, we will leverage the [Kube-Proxy NG](https://github.com/kubernetes-sigs/kpng) (aka KPNG) project as the baseline controller for watching and processing K8s services. However this project will not be completely tied to KPNG or to a specific backend of backend and in Phase 2 we will make a call whether to continue with KPNG or not depending on the upstream readiness of KPNG and the appropriate KPNG backend at that time.

    Functional Design for Conntrack based NodePort Service
    ------------------------------------------------------
  6. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -23,7 +23,7 @@ As a side note, the Cilium project uses an approach similar to A3 for implementi
    Phase 1
    -------

    Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release.
    Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release. In Phase 1, we will leverage the Kube-Proxy NG (aka KPNG) project as the baseline controller for watching and processing K8s services. However this project will not be completely tied to KPNG or to a specific backend of backend and in Phase 2 we will make a call whether to continue with KPNG or not depending on the upstream readiness of KPNG and the appropriate KPNG backend at that time.

    Functional Design for Conntrack based NodePort Service
    ------------------------------------------------------
  7. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 2 additions and 158 deletions.
    160 changes: 2 additions & 158 deletions ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -66,163 +66,7 @@ bpf_xdp_ct_alloc(struct xdp_md *, struct bpf_sock_tuple *,
    u32, struct bpf_ct_opts *, u32);
    struct nf_conn *bpf_ct_insert_entry(struct nf_conn___init *);
    ```
    Note: Review these and change the xdp versions to the skb/ tc versions based on code updates in progress upstream kernel.

    Note: Review these and change the xdp versions to the skb/ tc versions based on code updates in progress upstream kernel.

    An h1 header
    ============

    Paragraphs are separated by a blank line.

    2nd paragraph. *Italic*, **bold**, and `monospace`. Itemized lists
    look like:

    * this one
    * that one
    * the other one

    Note that --- not considering the asterisk --- the actual text
    content starts at 4-columns in.

    > Block quotes are
    > written like so.
    >
    > They can span multiple paragraphs,
    > if you like.
    Use 3 dashes for an em-dash. Use 2 dashes for ranges (ex., "it's all
    in chapters 12--14"). Three dots ... will be converted to an ellipsis.
    Unicode is supported. ☺



    An h2 header
    ------------

    Here's a numbered list:

    1. first item
    2. second item
    3. third item

    Note again how the actual text starts at 4 columns in (4 characters
    from the left side). Here's a code sample:

    # Let me re-iterate ...
    for i in 1 .. 10 { do-something(i) }

    As you probably guessed, indented 4 spaces. By the way, instead of
    indenting the block, you can use delimited blocks, if you like:

    ~~~
    define foobar() {
    print "Welcome to flavor country!";
    }
    ~~~

    (which makes copying & pasting easier). You can optionally mark the
    delimited block for Pandoc to syntax highlight it:

    ~~~python
    import time
    # Quick, count to ten!
    for i in range(10):
    # (but not *too* quick)
    time.sleep(0.5)
    print i
    ~~~



    ### An h3 header ###

    Now a nested list:

    1. First, get these ingredients:

    * carrots
    * celery
    * lentils

    2. Boil some water.

    3. Dump everything in the pot and follow
    this algorithm:

    find wooden spoon
    uncover pot
    stir
    cover pot
    balance wooden spoon precariously on pot handle
    wait 10 minutes
    goto first step (or shut off burner when done)

    Do not bump wooden spoon or it will fall.

    Notice again how text always lines up on 4-space indents (including
    that last line which continues item 3 above).

    Here's a link to [a website](http://foo.bar), to a [local
    doc](local-doc.html), and to a [section heading in the current
    doc](#an-h2-header). Here's a footnote [^1].

    [^1]: Footnote text goes here.

    Tables can look like this:

    size material color
    ---- ------------ ------------
    9 leather brown
    10 hemp canvas natural
    11 glass transparent

    Table: Shoes, their sizes, and what they're made of

    (The above is the caption for the table.) Pandoc also supports
    multi-line tables:

    -------- -----------------------
    keyword text
    -------- -----------------------
    red Sunsets, apples, and
    other red or reddish
    things.

    green Leaves, grass, frogs
    and other things it's
    not easy being.
    -------- -----------------------

    A horizontal rule follows.

    ***

    Here's a definition list:

    apples
    : Good for making applesauce.
    oranges
    : Citrus!
    tomatoes
    : There's no "e" in tomatoe.

    Again, text is indented 4 spaces. (Put a blank line between each
    term/definition pair to spread things out more.)

    Here's a "line block":

    | Line one
    | Line too
    | Line tree

    and images can be specified like so:

    ![example image](example-image.jpg "An exemplary image")

    Inline math equations go in like so: $\omega = d\phi / dt$. Display
    math should get its own line and be put in in double-dollarsigns:

    $$I = \int \rho R^{2} dV$$

    And note that you can backslash-escape any punctuation characters
    which you wish to be displayed literally, ex.: \`foo\`, \*bar\*, etc.
    The logic of these functions is selef-explanatory (CRUD of CT/ connection tracking entries and NAT mappings in the Linux kernel). The current timeline for most of these helpers is to be available in kernel v6.0 and the remaining in kernel v6.1 both of which should GA before the end of CY 2022.
  8. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 29 additions and 2 deletions.
    31 changes: 29 additions & 2 deletions ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -18,6 +18,8 @@ Approaches Evaluated

    Based on analysis of pros/ cons of these options and the desired priotization of user experience and stability, we are currently planning on using approach A2 for this work although we continue to analyze approaches A1 and A3 and may opportunistically use some aspects of those approaches in the final implementation.

    As a side note, the Cilium project uses an approach similar to A3 for implementing ClusterIP services using eBpf and an approach similar to A1 for NodePort services. The eBpf helper functions needed for approach A2 are only recently getting developed and the current direction for this project is to use the early pre-GA versions of that infrastructure. These options were not available to the CIlium project. This also fits in line with our requirement to leverage the Linux kernel as much as possible and focus on stability, debuggability and familiarity for end users. However we will continue to track all 3 options and potential hybrid solutions that combine aspects of these different approaches, if that helps with our end goal requirements and priorities.

    Phase 1
    -------

    @@ -33,13 +35,38 @@ The figures below illustrate the K8s NodePort service. A Kubernetes service has
    ![nodeport-ebpf-2](https://user-images.githubusercontent.com/8584400/189737841-75c6f7e5-d51d-44e4-b6c1-2c70c07e9e41.png "Figure 1b: Nodeport service Multi-node cluster")


    Approach A2 for NodePort services
    ---------------------------------

    The figures below illustrate the Linux Kernel and Netfilter data path for traffic received from external clients and destined to Kubernetes NodePort services.

    ![nodeport-ebpf-3](https://user-images.githubusercontent.com/8584400/189737902-7175da4d-9c8e-4a53-9d17-5eb1ac8563ad.png)

    ![nodeport-ebpf-4](https://user-images.githubusercontent.com/8584400/189737979-cea640ac-a810-4cdb-955c-18b47ccbfb51.png)



    We can see that traffic normally would need to go through a Conntrack module in order to imeplement stateful tracking and NAT operations. For approach A2, we could choose to have eBpf programs at the xdp or tc-ingress hook points as shown in the figure. Initially we will use tc-ingress based hooks and eventually add xdp (and possibly even socket lookup) based eBpf programs.

    With a tc-ingress based approach, the basic logic of the tc-eBpf program using approach A2 will be to intercept incoming traffic, check if this is destined to a NodePort service, and check if the Linux kernel alreay has created connection tracking state for this connection. If this traffic is for a new connection, the eBpf program will perform a load balancing operation to select a backend and then call the kernal to create connection tracking entries as well as NAT mapping according to its selected load balanced backend. The new eBpf helper functions will be used to perform these updates in the Linux kernel. The packet will then be allowed to use the normal Linux kernel datapath and will egt load balanced and DNATed (and possibly SNATed) according to the CT and NAT mapping entries setup by the tc-ingress eBpf program). Similarly reverse NAT entries will also be setup so that return traffic for the same connection is matched to the same connection state and un-NAT'ed accordingly.

    The new eBpf helper functions that will be used to implement this functionality include:

    ```
    struct nf_conn *
    bpf_xdp_ct_lookup(struct xdp_md *, struct bpf_sock_tuple *, u32,
    struct bpf_ct_opts *, u32);
    void bpf_ct_release(struct nf_conn *);
    void bpf_ct_set_timeout(struct nf_conn___init *, u32);
    int bpf_ct_set_status(const struct nf_conn___init *, u32 );
    int bpf_ct_change_timeout(struct nf_conn *, u32);
    int bpf_ct_set_nat_info(struct nf_conn___init *, union nf_inet_addr *,
    __be16 *, enum nf_nat_manip_type);
    struct nf_conn___init *
    bpf_xdp_ct_alloc(struct xdp_md *, struct bpf_sock_tuple *,
    u32, struct bpf_ct_opts *, u32);
    struct nf_conn *bpf_ct_insert_entry(struct nf_conn___init *);
    ```
    Note: Review these and change the xdp versions to the skb/ tc versions based on code updates in progress upstream kernel.


    An h1 header
  9. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 6 additions and 0 deletions.
    6 changes: 6 additions & 0 deletions ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -23,11 +23,17 @@ Phase 1

    Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release.

    Functional Design for Conntrack based NodePort Service
    ------------------------------------------------------

    The figures below illustrate the K8s NodePort service. A Kubernetes service has been exposed externally on port 31000 of all nodes of a Kubernetes cluster. In the first figure, we have a single node k8s cluster and all backend pods of this service are located on that one node. Traffic from external clients a.a.a.a and b.b.b.b are each load balanced (and NAT'ed) to one of these backend pods. In the second figure, we have a multi-node k8s cluster and backends are distributed across the nodes. In this case, traffic may come into the cluster via one node and the load balancing decision may pick a backend pod that resides on a different node, in which case this traffic will need to get re-routed back out to the right node in order to reach the selected nackend pod. There are several additional details here related to load balancing policy, handling of external traffic and return traffic etc which we do not list here but are documented elsewhere including upstream Kubernetes documentation.

    ![nodeport-ebpf-1](https://user-images.githubusercontent.com/8584400/189737740-3d34574e-bec0-4882-8a2a-b153b87a6ade.png "Figure 1a: Nodeport service 1 node cluster")

    ![nodeport-ebpf-2](https://user-images.githubusercontent.com/8584400/189737841-75c6f7e5-d51d-44e4-b6c1-2c70c07e9e41.png "Figure 1b: Nodeport service Multi-node cluster")



    ![nodeport-ebpf-3](https://user-images.githubusercontent.com/8584400/189737902-7175da4d-9c8e-4a53-9d17-5eb1ac8563ad.png)

    ![nodeport-ebpf-4](https://user-images.githubusercontent.com/8584400/189737979-cea640ac-a810-4cdb-955c-18b47ccbfb51.png)
  10. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -24,9 +24,9 @@ Phase 1
    Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release.


    ![nodeport-ebpf-1](https://user-images.githubusercontent.com/8584400/189737740-3d34574e-bec0-4882-8a2a-b153b87a6ade.png)
    ![nodeport-ebpf-1](https://user-images.githubusercontent.com/8584400/189737740-3d34574e-bec0-4882-8a2a-b153b87a6ade.png "Figure 1a: Nodeport service 1 node cluster")

    ![nodeport-ebpf-2](https://user-images.githubusercontent.com/8584400/189737841-75c6f7e5-d51d-44e4-b6c1-2c70c07e9e41.png)
    ![nodeport-ebpf-2](https://user-images.githubusercontent.com/8584400/189737841-75c6f7e5-d51d-44e4-b6c1-2c70c07e9e41.png "Figure 1b: Nodeport service Multi-node cluster")

    ![nodeport-ebpf-3](https://user-images.githubusercontent.com/8584400/189737902-7175da4d-9c8e-4a53-9d17-5eb1ac8563ad.png)

  11. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -24,7 +24,7 @@ Phase 1
    Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release.


    ![nodeport-ebpf-1](https://user-images.githubusercontent.com/8584400/189737740-3d34574e-bec0-4882-8a2a-b153b87a6ade.png, "Figure 1a: Basic Nodeport service 1-node")
    ![nodeport-ebpf-1](https://user-images.githubusercontent.com/8584400/189737740-3d34574e-bec0-4882-8a2a-b153b87a6ade.png)

    ![nodeport-ebpf-2](https://user-images.githubusercontent.com/8584400/189737841-75c6f7e5-d51d-44e4-b6c1-2c70c07e9e41.png)

  12. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -24,7 +24,7 @@ Phase 1
    Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release.


    ![nodeport-ebpf-1](https://user-images.githubusercontent.com/8584400/189737740-3d34574e-bec0-4882-8a2a-b153b87a6ade.png)
    ![nodeport-ebpf-1](https://user-images.githubusercontent.com/8584400/189737740-3d34574e-bec0-4882-8a2a-b153b87a6ade.png, "Figure 1a: Basic Nodeport service 1-node")

    ![nodeport-ebpf-2](https://user-images.githubusercontent.com/8584400/189737841-75c6f7e5-d51d-44e4-b6c1-2c70c07e9e41.png)

  13. @srampal srampal revised this gist Sep 12, 2022. 1 changed file with 4 additions and 7 deletions.
    11 changes: 4 additions & 7 deletions ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -24,16 +24,13 @@ Phase 1
    Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release.


    ![nodeport-ebpf-1](https://user-images.githubusercontent.com/8584400/189737740-3d34574e-bec0-4882-8a2a-b153b87a6ade.png)

    ![nodeport-ebpf-2](https://user-images.githubusercontent.com/8584400/189737841-75c6f7e5-d51d-44e4-b6c1-2c70c07e9e41.png)

    ![nodeport-1](./nodeport-ebpf-1.png "K8s NodePort Datapath w/ 1 node")
    ![nodeport-2](./nodeport-ebpf-2.png "K8s NodePort Datapath w/ 2-node cluster")
    ![nodeport-3](./nodeport-ebpf-3.png "K8s NodePort Diagram 3")
    ![nodeport-4](./nodeport-ebpf-4.png "K8s NodePort Diagram 4")



    ![nodeport-ebpf-3](https://user-images.githubusercontent.com/8584400/189737902-7175da4d-9c8e-4a53-9d17-5eb1ac8563ad.png)

    ![nodeport-ebpf-4](https://user-images.githubusercontent.com/8584400/189737979-cea640ac-a810-4cdb-955c-18b47ccbfb51.png)



  14. @srampal srampal revised this gist Sep 9, 2022. 1 changed file with 6 additions and 0 deletions.
    6 changes: 6 additions & 0 deletions ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -26,6 +26,12 @@ Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socke



    ![nodeport-1](./nodeport-ebpf-1.png "K8s NodePort Datapath w/ 1 node")
    ![nodeport-2](./nodeport-ebpf-2.png "K8s NodePort Datapath w/ 2-node cluster")
    ![nodeport-3](./nodeport-ebpf-3.png "K8s NodePort Diagram 3")
    ![nodeport-4](./nodeport-ebpf-4.png "K8s NodePort Diagram 4")





  15. @srampal srampal revised this gist Sep 9, 2022. No changes.
  16. @srampal srampal revised this gist Sep 9, 2022. 1 changed file with 8 additions and 3 deletions.
    11 changes: 8 additions & 3 deletions ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -12,11 +12,16 @@ Goals and Priorities
    Approaches Evaluated
    --------------------

    * A1: Write a complete new data path including new Connection tracking (conntrack), NAT and load balance modules in eBpf
    * A2: Leverage conntrack module and nat tables from Linux kernel but use eBpf to set these up
    * A1: Write a complete new data path including new Connection tracking (conntrack), NAT and load balance functions/ programs in eBpf
    * A2: Leverage conntrack module and nat tables from Linux kernel but use new eBpf tc/ xdp programs to set these up
    * A3: Use Socket based load balancing and data path techniques to bypass kernel conntrack, netfilter and nat datapaths.

    Based on analysis and the desired priotization of user experience and stability, we are currently planning on using approach A2 for this work although we continue to analyze approaches A1 and A3 and may opportunistically use some aspects of those approaches in the final implementation.
    Based on analysis of pros/ cons of these options and the desired priotization of user experience and stability, we are currently planning on using approach A2 for this work although we continue to analyze approaches A1 and A3 and may opportunistically use some aspects of those approaches in the final implementation.

    Phase 1
    -------

    Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release.



  17. @srampal srampal revised this gist Sep 9, 2022. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -7,7 +7,7 @@ Goals and Priorities
    * Build an eBpf based implementation of Kubernetes Services (ClusterIP, NodePort, LoadBalancer) to replace Kube-proxy/ iptables and CNI based implementations of Kubernetes services.
    * The goal is not "use as much eBpf" as possible but rather to use eBpf selectively and opportunistically and also to leverage standard kernel datapaths as much as possible unless there is a good reason to do otherwise.
    * Since iptables packages are being deprecated in RHEL, it is necessary to have an implementation of kube-proxy that does not depend on iptables. See [iptables deprecation](https://access.redhat.com/solutions/6739041)
    * Primary design requirement is to retain end user experience for stability and debuggability when replacing the kube-proxy/ iptables based datapath. This requirement is more important that flat out data plane performance if that comes at the cost of stability and manageability.
    * Primary design requirement is to retain end user experience for stability and debuggability when replacing the kube-proxy/ iptables based datapath. This requirement is more important that flat out data plane performance if that comes at the cost of stability, debuggability and familiarity for end users.

    Approaches Evaluated
    --------------------
  18. @srampal srampal revised this gist Sep 9, 2022. 1 changed file with 20 additions and 0 deletions.
    20 changes: 20 additions & 0 deletions ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,23 @@
    Design and Implementation of K8s Services Proxy using eBpf
    ==========================================================

    Goals and Priorities
    --------------------

    * Build an eBpf based implementation of Kubernetes Services (ClusterIP, NodePort, LoadBalancer) to replace Kube-proxy/ iptables and CNI based implementations of Kubernetes services.
    * The goal is not "use as much eBpf" as possible but rather to use eBpf selectively and opportunistically and also to leverage standard kernel datapaths as much as possible unless there is a good reason to do otherwise.
    * Since iptables packages are being deprecated in RHEL, it is necessary to have an implementation of kube-proxy that does not depend on iptables. See [iptables deprecation](https://access.redhat.com/solutions/6739041)
    * Primary design requirement is to retain end user experience for stability and debuggability when replacing the kube-proxy/ iptables based datapath. This requirement is more important that flat out data plane performance if that comes at the cost of stability and manageability.

    Approaches Evaluated
    --------------------

    * A1: Write a complete new data path including new Connection tracking (conntrack), NAT and load balance modules in eBpf
    * A2: Leverage conntrack module and nat tables from Linux kernel but use eBpf to set these up
    * A3: Use Socket based load balancing and data path techniques to bypass kernel conntrack, netfilter and nat datapaths.

    Based on analysis and the desired priotization of user experience and stability, we are currently planning on using approach A2 for this work although we continue to analyze approaches A1 and A3 and may opportunistically use some aspects of those approaches in the final implementation.




  19. @srampal srampal revised this gist Sep 9, 2022. 1 changed file with 10 additions and 0 deletions.
    10 changes: 10 additions & 0 deletions ebpf-k8s-services.md
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,13 @@










    An h1 header
    ============

  20. @srampal srampal renamed this gist Sep 9, 2022. 1 changed file with 0 additions and 0 deletions.
  21. @srampal srampal created this gist Sep 9, 2022.
    157 changes: 157 additions & 0 deletions Design and Implementation of K8s Services Proxy using eBpf
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,157 @@
    An h1 header
    ============

    Paragraphs are separated by a blank line.

    2nd paragraph. *Italic*, **bold**, and `monospace`. Itemized lists
    look like:

    * this one
    * that one
    * the other one

    Note that --- not considering the asterisk --- the actual text
    content starts at 4-columns in.

    > Block quotes are
    > written like so.
    >
    > They can span multiple paragraphs,
    > if you like.

    Use 3 dashes for an em-dash. Use 2 dashes for ranges (ex., "it's all
    in chapters 12--14"). Three dots ... will be converted to an ellipsis.
    Unicode is supported. ☺



    An h2 header
    ------------

    Here's a numbered list:

    1. first item
    2. second item
    3. third item

    Note again how the actual text starts at 4 columns in (4 characters
    from the left side). Here's a code sample:

    # Let me re-iterate ...
    for i in 1 .. 10 { do-something(i) }

    As you probably guessed, indented 4 spaces. By the way, instead of
    indenting the block, you can use delimited blocks, if you like:

    ~~~
    define foobar() {
    print "Welcome to flavor country!";
    }
    ~~~

    (which makes copying & pasting easier). You can optionally mark the
    delimited block for Pandoc to syntax highlight it:

    ~~~python
    import time
    # Quick, count to ten!
    for i in range(10):
    # (but not *too* quick)
    time.sleep(0.5)
    print i
    ~~~



    ### An h3 header ###

    Now a nested list:

    1. First, get these ingredients:

    * carrots
    * celery
    * lentils

    2. Boil some water.

    3. Dump everything in the pot and follow
    this algorithm:

    find wooden spoon
    uncover pot
    stir
    cover pot
    balance wooden spoon precariously on pot handle
    wait 10 minutes
    goto first step (or shut off burner when done)

    Do not bump wooden spoon or it will fall.

    Notice again how text always lines up on 4-space indents (including
    that last line which continues item 3 above).

    Here's a link to [a website](http://foo.bar), to a [local
    doc](local-doc.html), and to a [section heading in the current
    doc](#an-h2-header). Here's a footnote [^1].

    [^1]: Footnote text goes here.

    Tables can look like this:

    size material color
    ---- ------------ ------------
    9 leather brown
    10 hemp canvas natural
    11 glass transparent

    Table: Shoes, their sizes, and what they're made of

    (The above is the caption for the table.) Pandoc also supports
    multi-line tables:

    -------- -----------------------
    keyword text
    -------- -----------------------
    red Sunsets, apples, and
    other red or reddish
    things.

    green Leaves, grass, frogs
    and other things it's
    not easy being.
    -------- -----------------------

    A horizontal rule follows.

    ***

    Here's a definition list:

    apples
    : Good for making applesauce.
    oranges
    : Citrus!
    tomatoes
    : There's no "e" in tomatoe.

    Again, text is indented 4 spaces. (Put a blank line between each
    term/definition pair to spread things out more.)

    Here's a "line block":

    | Line one
    | Line too
    | Line tree

    and images can be specified like so:

    ![example image](example-image.jpg "An exemplary image")

    Inline math equations go in like so: $\omega = d\phi / dt$. Display
    math should get its own line and be put in in double-dollarsigns:

    $$I = \int \rho R^{2} dV$$

    And note that you can backslash-escape any punctuation characters
    which you wish to be displayed literally, ex.: \`foo\`, \*bar\*, etc.