Skip to content

Instantly share code, notes, and snippets.

@adammw
Created July 10, 2016 08:16
Show Gist options
  • Select an option

  • Save adammw/c044f2961fe60ee0bdb8b0461a5080a1 to your computer and use it in GitHub Desktop.

Select an option

Save adammw/c044f2961fe60ee0bdb8b0461a5080a1 to your computer and use it in GitHub Desktop.

Revisions

  1. adammw created this gist Jul 10, 2016.
    367 changes: 367 additions & 0 deletions dpif.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,367 @@

    # dpif, the DataPath InterFace.

    In Open vSwitch terminology, a "datapath" is a flow-based software switch.
    A datapath has no intelligence of its own. Rather, it relies entirely on
    its client to set up flows. The datapath layer is core to the Open vSwitch
    software switch: one could say, without much exaggeration, that everything
    in ovs-vswitchd above dpif exists only to make the correct decisions
    interacting with dpif.

    Typically, the client of a datapath is the software switch module in
    "ovs-vswitchd", but other clients can be written. The "ovs-dpctl" utility
    is also a (simple) client.


    Overview
    ========

    The terms written in quotes below are defined in later sections.

    When a datapath "port" receives a packet, it extracts the headers (the
    "flow"). If the datapath's "flow table" contains a "flow entry" matching
    the packet, then it executes the "actions" in the flow entry and increments
    the flow's statistics. If there is no matching flow entry, the datapath
    instead appends the packet to an "upcall" queue.


    Ports
    =====

    A datapath has a set of ports that are analogous to the ports on an Ethernet
    switch. At the datapath level, each port has the following information
    associated with it:

    - A name, a short string that must be unique within the host. This is
    typically a name that would be familiar to the system administrator,
    e.g. "eth0" or "vif1.1", but it is otherwise arbitrary.

    - A 32-bit port number that must be unique within the datapath but is
    otherwise arbitrary. The port number is the most important identifier
    for a port in the datapath interface.

    - A type, a short string that identifies the kind of port. On a Linux
    host, typical types are "system" (for a network device such as eth0),
    "internal" (for a simulated port used to connect to the TCP/IP stack),
    and "gre" (for a GRE tunnel).

    - A Netlink PID for each upcall reading thread (see "Upcall Queuing and
    Ordering" below).

    The dpif interface has functions for adding and deleting ports. When a
    datapath implements these (e.g. as the Linux and netdev datapaths do), then
    Open vSwitch's ovs-vswitchd daemon can directly control what ports are used
    for switching. Some datapaths might not implement them, or implement them
    with restrictions on the types of ports that can be added or removed
    (e.g. on ESX), on systems where port membership can only be changed by some
    external entity.

    Each datapath must have a port, sometimes called the "local port", whose
    name is the same as the datapath itself, with port number 0. The local port
    cannot be deleted.

    Ports are available as "struct netdev"s. To obtain a "struct netdev *" for
    a port named 'name' with type 'port_type', in a datapath of type
    'datapath_type', call netdev_open(name, dpif_port_open_type(datapath_type,
    port_type). The netdev can be used to get and set important data related to
    the port, such as:

    - MTU (netdev_get_mtu(), netdev_set_mtu()).

    - Ethernet address (netdev_get_etheraddr(), netdev_set_etheraddr()).

    - Statistics such as the number of packets and bytes transmitted and
    received (netdev_get_stats()).

    - Carrier status (netdev_get_carrier()).

    - Speed (netdev_get_features()).

    - QoS queue configuration (netdev_get_queue(), netdev_set_queue() and
    related functions.)

    - Arbitrary port-specific configuration parameters (netdev_get_config(),
    netdev_set_config()). An example of such a parameter is the IP
    endpoint for a GRE tunnel.


    Flow Table
    ==========

    The flow table is a collection of "flow entries". Each flow entry contains:

    - A "flow", that is, a summary of the headers in an Ethernet packet. The
    flow must be unique within the flow table. Flows are fine-grained
    entities that include L2, L3, and L4 headers. A single TCP connection
    consists of two flows, one in each direction.

    In Open vSwitch userspace, "struct flow" is the typical way to describe
    a flow, but the datapath interface uses a different data format to
    allow ABI forward- and backward-compatibility. datapath/README.md
    describes the rationale and design. Refer to OVS_KEY_ATTR_* and
    "struct ovs_key_*" in include/odp-netlink.h for details.
    lib/odp-util.h defines several functions for working with these flows.

    - A "mask" that, for each bit in the flow, specifies whether the datapath
    should consider the corresponding flow bit when deciding whether a
    given packet matches the flow entry. The original datapath design did
    not support matching: every flow entry was exact match. With the
    addition of a mask, the interface supports datapaths with a spectrum of
    wildcard matching capabilities, from those that only support exact
    matches to those that support bitwise wildcarding on the entire flow
    key, as well as datapaths with capabilities somewhere in between.

    Datapaths do not provide a way to query their wildcarding capabilities,
    nor is it expected that the client should attempt to probe for the
    details of their support. Instead, a client installs flows with masks
    that wildcard as many bits as acceptable. The datapath then actually
    wildcards as many of those bits as it can and changes the wildcard bits
    that it does not support into exact match bits. A datapath that can
    wildcard any bit, for example, would install the supplied mask, an
    exact-match only datapath would install an exact-match mask regardless
    of what mask the client supplied, and a datapath in the middle of the
    spectrum would selectively change some wildcard bits into exact match
    bits.

    Regardless of the requested or installed mask, the datapath retains the
    original flow supplied by the client. (It does not, for example, "zero
    out" the wildcarded bits.) This allows the client to unambiguously
    identify the flow entry in later flow table operations.

    The flow table does not have priorities; that is, all flow entries have
    equal priority. Detecting overlapping flow entries is expensive in
    general, so the datapath is not required to do it. It is primarily the
    client's responsibility not to install flow entries whose flow and mask
    combinations overlap.

    - A list of "actions" that tell the datapath what to do with packets
    within a flow. Some examples of actions are OVS_ACTION_ATTR_OUTPUT,
    which transmits the packet out a port, and OVS_ACTION_ATTR_SET, which
    modifies packet headers. Refer to OVS_ACTION_ATTR_* and "struct
    ovs_action_*" in include/odp-netlink.h for details. lib/odp-util.h
    defines several functions for working with datapath actions.

    The actions list may be empty. This indicates that nothing should be
    done to matching packets, that is, they should be dropped.

    (In case you are familiar with OpenFlow, datapath actions are analogous
    to OpenFlow actions.)

    - Statistics: the number of packets and bytes that the flow has
    processed, the last time that the flow processed a packet, and the
    union of all the TCP flags in packets processed by the flow. (The
    latter is 0 if the flow is not a TCP flow.)

    The datapath's client manages the flow table, primarily in reaction to
    "upcalls" (see below).


    Upcalls
    =======

    A datapath sometimes needs to notify its client that a packet was received.
    The datapath mechanism to do this is called an "upcall".

    Upcalls are used in two situations:

    - When a packet is received, but there is no matching flow entry in its
    flow table (a flow table "miss"), this causes an upcall of type
    DPIF_UC_MISS. These are called "miss" upcalls.

    - A datapath action of type OVS_ACTION_ATTR_USERSPACE causes an upcall of
    type DPIF_UC_ACTION. These are called "action" upcalls.

    An upcall contains an entire packet. There is no attempt to, e.g., copy
    only as much of the packet as normally needed to make a forwarding decision.
    Such an optimization is doable, but experimental prototypes showed it to be
    of little benefit because an upcall typically contains the first packet of a
    flow, which is usually short (e.g. a TCP SYN). Also, the entire packet can
    sometimes really be needed.

    After a client reads a given upcall, the datapath is finished with it, that
    is, the datapath doesn't maintain any lingering state past that point.

    The latency from the time that a packet arrives at a port to the time that
    it is received from dpif_recv() is critical in some benchmarks. For
    example, if this latency is 1 ms, then a netperf TCP_CRR test, which opens
    and closes TCP connections one at a time as quickly as it can, cannot
    possibly achieve more than 500 transactions per second, since every
    connection consists of two flows with 1-ms latency to set up each one.

    To receive upcalls, a client has to enable them with dpif_recv_set(). A
    datapath should generally support being opened multiple times (e.g. so that
    one may run "ovs-dpctl show" or "ovs-dpctl dump-flows" while "ovs-vswitchd"
    is also running) but need not support more than one of these clients
    enabling upcalls at once.


    Upcall Queuing and Ordering
    ---------------------------

    The datapath's client reads upcalls one at a time by calling dpif_recv().
    When more than one upcall is pending, the order in which the datapath
    presents upcalls to its client is important. The datapath's client does not
    directly control this order, so the datapath implementer must take care
    during design.

    The minimal behavior, suitable for initial testing of a datapath
    implementation, is that all upcalls are appended to a single queue, which is
    delivered to the client in order.

    The datapath should ensure that a high rate of upcalls from one particular
    port cannot cause upcalls from other sources to be dropped or unreasonably
    delayed. Otherwise, one port conducting a port scan or otherwise initiating
    high-rate traffic spanning many flows could suppress other traffic.
    Ideally, the datapath should present upcalls from each port in a "round
    robin" manner, to ensure fairness.

    The client has no control over "miss" upcalls and no insight into the
    datapath's implementation, so the datapath is entirely responsible for
    queuing and delivering them. On the other hand, the datapath has
    considerable freedom of implementation. One good approach is to maintain a
    separate queue for each port, to prevent any given port's upcalls from
    interfering with other ports' upcalls. If this is impractical, then another
    reasonable choice is to maintain some fixed number of queues and assign each
    port to one of them. Ports assigned to the same queue can then interfere
    with each other, but not with ports assigned to different queues. Other
    approaches are also possible.

    The client has some control over "action" upcalls: it can specify a 32-bit
    "Netlink PID" as part of the action. This terminology comes from the Linux
    datapath implementation, which uses a protocol called Netlink in which a PID
    designates a particular socket and the upcall data is delivered to the
    socket's receive queue. Generically, though, a Netlink PID identifies a
    queue for upcalls. The basic requirements on the datapath are:

    - The datapath must provide a Netlink PID associated with each port. The
    client can retrieve the PID with dpif_port_get_pid().

    - The datapath must provide a "special" Netlink PID not associated with
    any port. dpif_port_get_pid() also provides this PID. (ovs-vswitchd
    uses this PID to queue special packets that must not be lost even if a
    port is otherwise busy, such as packets used for tunnel monitoring.)

    The minimal behavior of dpif_port_get_pid() and the treatment of the Netlink
    PID in "action" upcalls is that dpif_port_get_pid() returns a constant value
    and all upcalls are appended to a single queue.

    The preferred behavior is:

    - Each port has a PID that identifies the queue used for "miss" upcalls
    on that port. (Thus, if each port has its own queue for "miss"
    upcalls, then each port has a different Netlink PID.)

    - "miss" upcalls for a given port and "action" upcalls that specify that
    port's Netlink PID add their upcalls to the same queue. The upcalls
    are delivered to the datapath's client in the order that the packets
    were received, regardless of whether the upcalls are "miss" or "action"
    upcalls.

    - Upcalls that specify the "special" Netlink PID are queued separately.

    Multiple threads may want to read upcalls simultaneously from a single
    datapath. To support multiple threads well, one extends the above preferred
    behavior:

    - Each port has multiple PIDs. The datapath distributes "miss" upcalls
    across the PIDs, ensuring that a given flow is mapped in a stable way
    to a single PID.

    - For "action" upcalls, the thread can specify its own Netlink PID or
    other threads' Netlink PID of the same port for offloading purpose
    (e.g. in a "round robin" manner).


    Packet Format
    =============

    The datapath interface works with packets in a particular form. This is the
    form taken by packets received via upcalls (i.e. by dpif_recv()). Packets
    supplied to the datapath for processing (i.e. to dpif_execute()) also take
    this form.

    A VLAN tag is represented by an 802.1Q header. If the layer below the
    datapath interface uses another representation, then the datapath interface
    must perform conversion.

    The datapath interface requires all packets to fit within the MTU. Some
    operating systems internally process packets larger than MTU, with features
    such as TSO and UFO. When such a packet passes through the datapath
    interface, it must be broken into multiple MTU or smaller sized packets for
    presentation as upcalls. (This does not happen often, because an upcall
    typically contains the first packet of a flow, which is usually short.)

    Some operating system TCP/IP stacks maintain packets in an unchecksummed or
    partially checksummed state until transmission. The datapath interface
    requires all host-generated packets to be fully checksummed (e.g. IP and TCP
    checksums must be correct). On such an OS, the datapath interface must fill
    in these checksums.

    Packets passed through the datapath interface must be at least 14 bytes
    long, that is, they must have a complete Ethernet header. They are not
    required to be padded to the minimum Ethernet length.


    Typical Usage
    =============

    Typically, the client of a datapath begins by configuring the datapath with
    a set of ports. Afterward, the client runs in a loop polling for upcalls to
    arrive.

    For each upcall received, the client examines the enclosed packet and
    figures out what should be done with it. For example, if the client
    implements a MAC-learning switch, then it searches the forwarding database
    for the packet's destination MAC and VLAN and determines the set of ports to
    which it should be sent. In any case, the client composes a set of datapath
    actions to properly dispatch the packet and then directs the datapath to
    execute those actions on the packet (e.g. with dpif_execute()).

    Most of the time, the actions that the client executed on the packet apply
    to every packet with the same flow. For example, the flow includes both
    destination MAC and VLAN ID (and much more), so this is true for the
    MAC-learning switch example above. In such a case, the client can also
    direct the datapath to treat any further packets in the flow in the same
    way, using dpif_flow_put() to add a new flow entry.

    Other tasks the client might need to perform, in addition to reacting to
    upcalls, include:

    - Periodically polling flow statistics, perhaps to supply to its own
    clients.

    - Deleting flow entries from the datapath that haven't been used
    recently, to save memory.

    - Updating flow entries whose actions should change. For example, if a
    MAC learning switch learns that a MAC has moved, then it must update
    the actions of flow entries that sent packets to the MAC at its old
    location.

    - Adding and removing ports to achieve a new configuration.


    Thread-safety
    =============

    Most of the dpif functions are fully thread-safe: they may be called from
    any number of threads on the same or different dpif objects. The exceptions
    are:

    - dpif_port_poll() and dpif_port_poll_wait() are conditionally
    thread-safe: they may be called from different threads only on
    different dpif objects.

    - dpif_flow_dump_next() is conditionally thread-safe: It may be called
    from different threads with the same 'struct dpif_flow_dump', but all
    other parameters must be different for each thread.

    - dpif_flow_dump_done() is conditionally thread-safe: All threads that
    share the same 'struct dpif_flow_dump' must have finished using it.
    This function must then be called exactly once for a particular
    dpif_flow_dump to finish the corresponding flow dump operation.

    - Functions that operate on 'struct dpif_port_dump' are conditionally
    thread-safe with respect to those objects. That is, one may dump ports
    from any number of threads at once, but each thread must use its own
    struct dpif_port_dump.