Attention

You are viewing an older version of the documentation. The latest version is v3.3.

eBPF Offload Native Mode XDP on Intel® Ethernet Linux Driver¶

Intel® ECI enables Linux* eXpress Data Path (XDP) Native Mode (for example, XDP_FLAGS_DRV_MODE) on Intel® Ethernet Linux drivers across multiple industry-graded Ethernet Controllers:

[Ethernet PCI 8086:7aac and 8086:7aad] 12th Gen Intel® Core™ S-Series [Alder Lake] Ethernet GbE Time-Sensitive Network Controller

[Ethernet PCI 8086:a0ac] 11th Gen Intel® Core™ U-Series and P-Series [Tiger Lake] Ethernet GbE Time-Sensitive Network Controller

[Ethernet PCI 8086:4b32 and 8086:4ba0] Intel® Atom® x6000 Series [Elkhart Lake] Ethernet GbE Time-Sensitive Network Controller

[Ethernet PCI 8086:15f2] Intel® Ethernet Controller I225-LM for Time-Sensitive Networking (TSN)

[Ethernet PCI 8086:157b, 8086:1533,…] Intel® Ethernet Controller I210-T1 for Time-Sensitive Networking (TSN)

Install Linux BPFTool¶

The following section is applicable to:

Install from individual Deb package

Setup the ECI APT repository

make sure ECI Linux intel image packages enabled with eBPF XDP features is [installed,local]:

$ sudo apt search linux-image-intel

For example, a Debian* 11 (Bullseye) distribution set up with ECI Deb packages repository will list the following:

Sorting... Done
Full Text Search... Done
linux-image-intel/now 5.10.115-bullseye-r0 amd64 [installed,local]
intel Linux kernel, version 5.10.115-intel-ese-standard-lts+

linux-image-intel-acrn-sos/now 5.10.115-bullseye-r0 amd64 [installed,local]
intel-acrn-sos Linux kernel, version 5.10.115-linux-intel-acrn-sos+

linux-image-intel-acrn-sos-dbg/unknown 5.10.115-bullseye-1 amd64
Linux kernel debugging symbols for 5.10.115-linux-intel-acrn-sos+

linux-image-intel-rt/now 5.10.115-rt67-bullseye-r0 amd64 [installed,local]
intel-rt Linux kernel, version 5.10.115-rt67-intel-ese-standard-lts-rt+

linux-image-intel-rt-dbg/unknown 5.10.115-rt67-bullseye-1 amd64
Linux kernel debugging symbols for 5.10.115-rt67-intel-ese-standard-lts-rt+

linux-image-intel-xenomai/now 5.10.100-bullseye-r0 amd64 [installed,local]
intel-xenomai Linux kernel, version 5.10.100-intel-ese-standard-lts-dovetail+

linux-image-intel-xenomai-dbg/unknown 5.10.100-bullseye-1 amd64
Linux kernel debugging symbols for 5.10.100-intel-ese-standard-lts-dovetail+

Otherwise, make sure that the running Linux distribution kernel matches the following configuration:

$ zcat /proc/config.gz | grep -e .*BPF.* -e .*XDP.*

CONFIG_CGROUP_BPF=y
CONFIG_BPF=y
# CONFIG_BPF_LSM is not set
CONFIG_BPF_SYSCALL=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y
# CONFIG_BPF_JIT_ALWAYS_ON is not set
CONFIG_BPF_JIT_DEFAULT_ON=y
# CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set
# CONFIG_BPF_PRELOAD is not set
CONFIG_XDP_SOCKETS=y
# CONFIG_XDP_SOCKETS_DIAG is not set
CONFIG_IPV6_SEG6_BPF=y
# CONFIG_NETFILTER_XT_MATCH_BPF is not set
# CONFIG_BPFILTER is not set
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_JIT=y
# CONFIG_BPF_STREAM_PARSER is not set
CONFIG_LWTUNNEL_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
# CONFIG_BPF_KPROBE_OVERRIDE is not set
# CONFIG_TEST_BPF is not set

Install bpftool-5.1x, provided by ECI, corresponding to the exact Linux Intel tree or from the Linux distribution mainline bpftool/stable:
$ sudo apt search bpftool
For example, an Debian 11 (Bullseye) distribution set up with ECI Deb packages repository will list the following:
Sorting... Done Full Text Search... Done bpftool/stable 5.10.127-1 amd64 Inspection and simple manipulation of BPF programs and maps bpftool-5.10/unknown 5.10.100-bullseye-1 amd64 Inspection and simple manipulation of BPF programs and maps
Note: Intel® ECI ensures that Linux XDP support always matches the Linux Intel LTS branches 2020/lts or 2021/lts (for example, no delta between kernel/bpf and tools/bpf).

The bpftool built from Linux Intel 2020/lts tree is recommended for an Debian 11 (Bullseye) installation :
$ sudo apt install bpftool-5.10
Also, bpftool built from Linux Intel 2021/lts tree is recommended for an Ubuntu* 22.04 (Jammy) installation:
$ sudo apt install bpftool-5.15

Linux eXpress Data Path (XDP)¶

The Intel® Linux Ethernet drivers’ Native node XDP offers a standardized Linux API to achieve low-overhead Ethernet Layer2 packet-processing (encoding, decoding, filtering, and so on) on industrial protocols like UADP ETH, ETherCAT, or Profinet-RT/IRT, without any prior knowledge of Intel® Ethernet Controller architecture.

Intel® ECI can leverage the Native mode XDP across Linux eBPF program offload XDP ** and Traffic Control (TC) **BPF Classifier (cls_bpf) in industrial networking usage models.

Linux QDisc AF_PACKET socket presents performance limitation. The following table compares both design approaches.

AF_PACKETS Socket with QDisc

AF_XDP Socket/eBPF Offload

Linux Network Stack (TCP/IP, UDP/IP)

Yes

BPF runtime program/library idg_xdp_ring direct DMA

OSI Layer L4 (Protocol)-L7 (Application)

Yes

No

Number of net packets copied across kernel to users

Several skb_data memcpy

None in UMEM/Zero-copy mode A few in UMEM/copy mode

IEEE 802.1Q-2018 Enhancements for Scheduled Traffic (EST) Frame Preemption (FPE)

Standardize API for Hardware offload

Customize Hardware offload

Deterministic Ethernet Network Cycle-Time requirement

Moderate

Tight

Per-Packets TxTime constraints

Yes, AF_PACKETS SO_TXTIME cmsg

Yes AF_XDP (xdp_desc txtime)

IEEE 802.1AS L2/PTP RX and TX hardware offload

Yes L2/PTP

Yes L2/PTP

Intel® Ethernet Controllers Linux drivers can handle the most commonly used XDP actions, allowing Ethernet L2-level packets traversing networking stack to be reflected, filtered, or redirected with lowest-latency overhead (for example, DMA accelerated transfer with limited memcpy, XDP_COPY or without any XDP_ZEROCOPY):

eBPF programs classify/modify traffic and return XDP actions:

XDP_PASS

XDP_DROP

XDP_TX

XDP_REDIRECT

XDP_ABORT

Note: cls_bpf in Linux Traffic Control (TC) works in same manner as the kernel Space.

The following table summarizes the eBPF offload Native mode XDP support available on the Intel® Linux Ethernet controllers.

eBPF Offload

v5.10.y/v5.15.y APIs

Intel® I210/igb.ko

Intel® GbE/stmmac.ko

Intel®I225-LM/igc.ko

XDP program features

XDP_DROP

Yes

Yes

Yes

XDP_PASS

Yes

Yes

Yes

XDP_TX8

Yes

Yes

Yes

XDP_REDIRECT

Yes

Yes

Yes

XDP_ABORTED

Yes

Yes

Yes

Packet read access

Yes

Yes

Yes

Conditional statements

Yes

Yes

Yes

xdp_adjust_head()

Yes

Yes

Yes

bpf_get_prandom_u32()

Yes

Yes

Yes

perf_event_output()

Yes

Yes

Yes

Partial offload

Yes

Yes

Yes

RSS rx_queue_index select

Yes

Yes

Yes

bpf_adjust_tail()

Yes

Yes

Yes

XDP maps features

Offload ownership for maps

Yes

Yes

Yes

Hash maps

Yes

Yes

Yes

Array maps

Yes

Yes

Yes

bpf_map_lookup_elem()

Yes

Yes

Yes

bpf_map_update_elem()

Yes

Yes

Yes

bpf_map_delete_elem()

Yes

Yes

Yes

Atomic sync_fetch_and_add

untested

untested

untested

Map sharing between ports

untested

untested

untested

uarch optimization features

Localized packet cache

untested

untested

untested

32 bit BPF support

untested

untested

untested

Localized maps

untested

untested

untested

xdpdump: A libbpf XDP API Example BPF Program¶

User space programs can interact with the offloaded program in the same way as normal eBPF programs. The kernel will try and offload the program if a non-null ifindex is supplied to the bpf() Linux system call for loading the program. Maps can be accessed from the kernel using user space eBPF map lookup or update commands.

BPF Helpers include/uapi/linux/bpf.h are used to add functionality that would otherwise be difficult:

Key XDP Map helpers :

bpf_map_lookup_elem

bpf_map_update_elem

bpf_map_delete_elem

bpf_redirect_map

Head Extend:

bpf_xdp_adjust_head

bpf_xdp_adjust_meta

Others

bpf_perf_event_output

bpf_ktime_get_ns

bpf_trace_printk

bpf_tail_call

bpf_redirect

This section provides the steps for creating a very basic IPv4/IPv6 UDP packet-processing eBPF program, leveraging only XDP_PASS and XDP_DROP actions:

Create the eBPF offload XDP <..._kern>.c that:

Defines SEC(maps) XDP event of bpf_map_def types such as the user space that can query the object map using the built kernel map lookup bpf_map_lookup_elem() API calls, which are subsequently relayed to the igb driver. In this example, eBPF program may communicate with the user space using the kernel’s perf tracing events.

struct bpf_map_def SEC("maps") perf_map = {
   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
   .key_size = sizeof(__u32),
   .value_size = sizeof(__u32),
   .max_entries = MAX_CPU,
};

Declares all sub-functions as static __always_inline:

static __always_inline bool parse_udp(void *data, __u64 off, void *data_end,
                    struct pkt_meta *pkt)
{
    struct udphdr *udp;

    udp = data + off;
    if (udp + 1 > data_end)
        return false;

    pkt->port16[0] = udp->source;
    pkt->port16[1] = udp->dest;
    return true;
}

Declares the SEC("xdp") program entry by taking xdp_md *ctx as input and defining the appropriate XDP actions as output - XDP_PASS` in the following example.

When ingress packets enter the XDP program, packet metadata is extracted and stored into a data structure. The XDP program sends this metadata, along with the packet contents to the ring buffer denoted in the eBPF perf event map using the current CPU index as the key.

SEC("xdp")
int process_packet(struct xdp_md *ctx)
{
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    struct ethhdr *eth = data;
    struct pkt_meta pkt = {};
    __u32 off;

    /* parse packet for IP Addresses and Ports */
    off = sizeof(struct ethhdr);
    if (data + off > data_end)
        return XDP_PASS;

    pkt.l3_proto = bpf_htons(eth->h_proto);

    if (pkt.l3_proto == ETH_P_IP) {
        if (!parse_ip4(data, off, data_end, &pkt))
            return XDP_PASS;
        off += sizeof(struct iphdr);
    } else if (pkt.l3_proto == ETH_P_IPV6) {
        if (!parse_ip6(data, off, data_end, &pkt))
            return XDP_PASS;
        off += sizeof(struct ipv6hdr);
    }

    if (data + off > data_end)
        return XDP_PASS;

    /* obtain port numbers for UDP and TCP traffic */
    if (if (pkt.l4_proto == IPPROTO_UDP) {
        if (!parse_udp(data, off, data_end, &pkt))
            return XDP_PASS;
        off += sizeof(struct udphdr);
    } else {
        pkt.port16[0] = 0;
        pkt.port16[1] = 0;
    }

    pkt.pkt_len = data_end - data;
    pkt.data_len = data_end - data - off;

    bpf_perf_event_output(ctx, &perf_map,
                    (__u64)pkt.pkt_len << 32 | BPF_F_CURRENT_CPU,
                    &pkt, sizeof(pkt));
    return XDP_PASS;
}

Compile the XDP program as x86_64 eBPF assembler using the following command on the build system or into Yocto build recipe:

$ clang -O2  -S \
    -D __BPF_TRACING__ \
    -I$(LIBBPF_DIR)/root/usr/include/ \
    -Wall \
    -Wno-unused-value \
    -Wno-pointer-sign \
    -Wno-compare-distinct-pointer-types \
    -Werror \
    -emit-llvm -c -g <..._kern>.c -o <..._kern>.S

$ llvm -march=bpf -filetype=obj -o <..._kern>.o <..._kern>.S

Create the <.._user.c> user space main() program that:

Initiates bpf_prog_load_xattr() API call to offload LLV compiled ..._kern.o eBPF XDP program, either in XDP_FLAGS_SKB_MODE or XDP_FLAGS_DRV_MODE using xdp_flags input parameter of the bpf_set_link_xdp_fd() API.

static void usage(const char *prog)
{
    fprintf(stderr,
        "%s -i interface [OPTS]\n\n"
        "OPTS:\n"
        " -h        help\n"
        " -N        Native Mode (XDPDRV)\n"
        " -S        SKB Mode (XDPGENERIC)\n"
        " -x        Show packet payload\n",
        prog);
}

int main(int argc, char **argv)
{
    static struct perf_event_mmap_page *mem_buf[MAX_CPU];
    struct bpf_prog_load_attr prog_load_attr = {
        .prog_type = BPF_PROG_TYPE_XDP,
        .file = "xdpdump_kern.o",
    };
    struct bpf_map *perf_map;
    struct bpf_object *obj;
    int sys_fds[MAX_CPU];
    int perf_map_fd;
    int prog_fd;
    int n_cpus;
    int opt;

    xdp_flags = XDP_FLAGS_DRV_MODE; /* default to DRV */
    n_cpus = get_nprocs();
    dump_payload = 0;

    if (optind == argc) {
        usage(basename(argv[0]));
        return -1;
    }

    while ((opt = getopt(argc, argv, "hi:NSx")) != -1) {
        switch (opt) {
        case 'h':
            usage(basename(argv[0]));
            return 0;
        case 'i':
            ifindex = if_nametoindex(optarg);
            break;
        case 'N':
            xdp_flags = XDP_FLAGS_DRV_MODE;
            break;
        case 'S':
            xdp_flags = XDP_FLAGS_SKB_MODE;
            break;
        case 'x':
            dump_payload = 1;
            break;
        default:
            printf("incorrect usage\n");
            usage(basename(argv[0]));
            return -1;
        }
    }

    if (ifindex == 0) {
        printf("error, invalid interface\n");
        return -1;
    }

    /* use libbpf to load program */
    if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd)) {
        printf("error with loading file\n");
        return -1;
    }

    if (prog_fd < 1) {
        printf("error creating prog_fd\n");
        return -1;
    }

    signal(SIGINT, unload_prog);
    signal(SIGTERM, unload_prog);

    /* use libbpf to link program to interface with corresponding flags */
    if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) {
        printf("error setting fd onto xdp\n");
        return -1;
    }

    perf_map = bpf_object__find_map_by_name(obj, "perf_map");
    perf_map_fd = bpf_map__fd(perf_map);

    if (perf_map_fd < 0) {
        printf("error cannot find map\n");
        return -1;
    }

    /* Initialize perf rings */
    if (setup_perf_poller(perf_map_fd, sys_fds, n_cpus, &mem_buf[0]))
        return -1;

    event_poller(mem_buf, sys_fds, n_cpus);

    return 0;
}

perf events map perf_map_fd file handle is registered via bpf_map_update_elem() method call, so it can be retrieved using the bpf_object__find_map_by_name() API:

 int setup_perf_poller(int perf_map_fd, int *sys_fds, int cpu_total,
                 struct perf_event_mmap_page **mem_buf)
 {
     struct perf_event_attr attr = {
         .sample_type    = PERF_SAMPLE_RAW | PERF_SAMPLE_TIME,
         .type       = PERF_TYPE_SOFTWARE,
         .config     = PERF_COUNT_SW_BPF_OUTPUT,
         .wakeup_events  = 1,
     };
     int mmap_size;
     int pmu;
     int n;

     mmap_size = getpagesize() * (PAGE_CNT + 1);

     for (n = 0; n < cpu_total; n++) {
         /* create perf fd for each thread */
         pmu = sys_perf_event_open(&attr, -1, n, -1, 0);
         if (pmu < 0) {
             printf("error setting up perf fd\n");
             return 1;
         }
         /* enable PERF events on the fd */
         ioctl(pmu, PERF_EVENT_IOC_ENABLE, 0);

         /* give fd a memory buf to write to */
         mem_buf[n] = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
                     MAP_SHARED, pmu, 0);
         if (mem_buf[n] == MAP_FAILED) {
             printf("error creating mmap\n");
             return 1;
         }
         /* point eBPF map entries to fd */
         assert(!bpf_map_update_elem(perf_map_fd, &n, &pmu, BPF_ANY));
         sys_fds[n] = pmu;
     }
     return 0;
 }

Defines the event_poller() loop for polling the perf_event_header events rings and epoch timestamp is done by registering with the `bpf_perf_event_read_simple() API the ring event_received() and event_print() call to the XDP program.

 struct pkt_meta {
     union {
         __be32 src;
         __be32 srcv6[4];
     };
     union {
         __be32 dst;
         __be32 dstv6[4];
     };
     __u16 port16[2];
     __u16 l3_proto;
     __u16 l4_proto;
     __u16 data_len;
     __u16 pkt_len;
     __u32 seq;
 };

 struct perf_event_sample {
     struct perf_event_header header;
     __u64 timestamp;
     __u32 size;
     struct pkt_meta meta;
     __u8 pkt_data[64];
 };

 static enum bpf_perf_event_ret event_received(void *event, void *printfn)
 {
     int (*print_fn)(struct perf_event_sample *) = printfn;
     struct perf_event_sample *sample = event;

     if (sample->header.type == PERF_RECORD_SAMPLE)
         return print_fn(sample);
     else
         return LIBBPF_PERF_EVENT_CONT;
 }

 int event_poller(struct perf_event_mmap_page **mem_buf, int *sys_fds,
         int cpu_total)
 {
     struct pollfd poll_fds[MAX_CPU];
     void *buf = NULL;
     size_t len = 0;
     int total_size;
     int pagesize;
     int res;
     int n;

     /* Create pollfd struct to contain poller info */
     for (n = 0; n < cpu_total; n++) {
         poll_fds[n].fd = sys_fds[n];
         poll_fds[n].events = POLLIN;
     }

     pagesize = getpagesize();
     total_size = PAGE_CNT * pagesize;
     for (;;) {
         /* Poll fds for events, 250ms timeout */
         poll(poll_fds, cpu_total, 250);

         for (n = 0; n < cpu_total; n++) {
             if (poll_fds[n].revents) { /* events found */
                 res = bpf_perf_event_read_simple(mem_buf[n],
                                 total_size,
                                 pagesize,
                                 &buf, &len,
                                 event_received,
                                 event_printer);
                 if (res != LIBBPF_PERF_EVENT_CONT)
                     break;
             }
         }
     }
     free(buf);
 }

In this example, when an event is received, the event_received() callback prints out the perf event’s metadata pkt_meta and epoch timestamp to the terminal. You can also specify if the packet contents pkt_data should be dumped in hexadecimal format.

void meta_print(struct pkt_meta meta, __u64 timestamp)
{
    char src_str[INET6_ADDRSTRLEN];
    char dst_str[INET6_ADDRSTRLEN];
    char l3_str[32];
    char l4_str[32];

    switch (meta.l3_proto) {
    case ETH_P_IP:
        strcpy(l3_str, "IP");
        inet_ntop(AF_INET, &meta.src, src_str, INET_ADDRSTRLEN);
        inet_ntop(AF_INET, &meta.dst, dst_str, INET_ADDRSTRLEN);
        break;
    case ETH_P_IPV6:
        strcpy(l3_str, "IP6");
        inet_ntop(AF_INET6, &meta.srcv6, src_str, INET6_ADDRSTRLEN);
        inet_ntop(AF_INET6, &meta.dstv6, dst_str, INET6_ADDRSTRLEN);
        break;
    case ETH_P_ARP:
        strcpy(l3_str, "ARP");
        break;
    default:
        sprintf(l3_str, "%04x", meta.l3_proto);
    }

    switch (meta.l4_proto) {
    case IPPROTO_TCP:
        sprintf(l4_str, "TCP seq %d", ntohl(meta.seq));
        break;
    case IPPROTO_UDP:
        strcpy(l4_str, "UDP");
        break;
    case IPPROTO_ICMP:
        strcpy(l4_str, "ICMP");
        break;
    default:
        strcpy(l4_str, "");
    }

    printf("%lld.%06lld %s %s:%d > %s:%d %s, length %d\n",
           timestamp / NS_IN_SEC, (timestamp % NS_IN_SEC) / 1000,
           l3_str,
           src_str, ntohs(meta.port16[0]),
           dst_str, ntohs(meta.port16[1]),
           l4_str, meta.data_len);
}

int event_printer(struct perf_event_sample *sample)
{
    int i;

    meta_print(sample->meta, sample->timestamp);

    if (dump_payload) { /* print payload hex */
        printf("\t");
        for (i = 0; i < sample->meta.pkt_len; i++) {
            printf("%02x", sample->pkt_data[i]);

            if ((i + 1) % 16 == 0)
                printf("\n\t");
            else if ((i + 1) % 2 == 0)
                printf(" ");
        }
        printf("\n");
    }
    return LIBBPF_PERF_EVENT_CONT;
}

Compile and link the program to libbpf and libelf using the GCC command gcc -lbpf -lelf -I$(LIBBPF_DIR)/root/usr/include/ -I../headers/ -L$(LIBBPF_DIR) -c <.._user>.c -o <.._user>.

** RX Receive Side Scaling (RSS) Queue**

The Intel® Ethernet Controller Linux drivers (igb.ko and igc.ko and stmmac-pci.ko) allow the eBPF program to leverage via XDP the Receive Side Scaling (RSS) queue feature to optimized ingress traffic load.

In the following example, all received packets will be placed onto queue 1:

SEC("xdp")
int process_packet(struct xdp_md *ctx)
{
    ctx->rx_queue_index = ​1​;
    ...
    return XDP_PASS;
}

AF_XDP Socket (CONFIG_XDP_SOCKETS)¶

An AF_XDP socket (XSK) is created with the normal socket() system call. Two rings are associated with each XSK: the RX ring and the TX ring. A socket can receive packets on the RX ring and it can send packets on the TX ring. These rings are registered and sized with the setsockopts - XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory to have at least one of these rings for each socket. An RX or TX descriptor ring points to a data buffer in a memory area called a UMEM. RX and TX can share the same UMEM so that a packet does not have to be copied between RX and TX. Moreover, if a packet needs to be kept for a while due to a possible retransmit, the descriptor that points to that packet can be changed to point to another and reused right away. This avoids copying of data.

Kernel feature CONFIG_XDP_SOCKETS allows the Linux drivers igb.ko, igc.ko, and stmmac-pci.ko to offload to the eBPF program XDP for transferring the packets up to the user space using AF_XDP.

You can install all officially supported Linux BPF samples eBPF sample programs as Debian packages from the ECI APT repository.

The following section is applicable to:

Setup the ECI APT repository, then perform either of the following commands to install this component:

Install from meta-package

$ sudo apt install eci-realtime-benchmarking

Install from individual Deb package

For example, on Debian 11 (Bullseye), run the following command to install the linux-bpf-samples package from Linux Intel 2020/lts tree.
$ sudo apt install linux-bpf-samples bpftool-5.10
Alternatively on Ubuntu 22.04 (Jammy), run the following command to install the linux-bpf-samples package from Linux Intel 2021/lts tree.
$ sudo apt install linux-bpf-samples bpftool-5.15
Note: From Linux v5.15 onward, the vmlinux.h generated header (CONFIG_DEBUG_INFO_BTF=y) is recommended for the BPF program to improve portability when libbpf enables Compile once, run everywhere (CO:RE)”. It contains all type definitions that Linux Intel 2021/lts running Linux kernel uses in its own source code.
$ bpftool btf dump file /sys/kernel/btf/vmlinux format c > /tmp/vmlinux.h

Usage

The Linux BPF sample program xdpsock provides a stable reference to understand and experiment with the AF_XDP socket API.

$ xdpsock -h

Usage: xdpsock [OPTIONS]
Options:
-r, --rxdrop          Discard all incoming packets (default)
-t, --txonly          Only send packets
-l, --l2fwd           MAC swap L2 forwarding
-i, --interface=n     Run on interface n
-q, --queue=n Use queue n (default 0)
-p, --poll            Use poll syscall
-S, --xdp-skb=n       Use XDP skb-mod
-N, --xdp-native=n    Enforce XDP native mode
-n, --interval=n      Specify statistics update interval (default 1 sec).
-z, --zero-copy      Force zero-copy mode.
-c, --copy           Force copy mode.
-m, --no-need-wakeup Turn off use of driver need wakeup flag.
-f, --frame-size=n   Set the frame size (must be a power of two in aligned mode, default is 4096).
-u, --unaligned       Enable unaligned chunk placement
-M, --shared-umem     Enable XDP_SHARED_UMEM
-F, --force           Force loading the XDP prog
-d, --duration=n      Duration in secs to run command.
                        Default: forever.
-b, --batch-size=n    Batch size for sending or receiving
                        packets. Default: 64
-C, --tx-pkt-count=n  Number of packets to send.
                        Default: Continuous packets.
-s, --tx-pkt-size=n   Transmit packet size.
                        (Default: 64 bytes)
                        Min size: 64, Max size 4096.
-P, --tx-pkt-pattern=nPacket fill pattern. Default: 0x12345678
-x, --extra-stats     Display extra statistics.
-Q, --quiet          Do not display any stats.
-a, --app-stats       Display application (syscall) statistics.
-I, --irq-string      Display driver interrupt statistics for interface associated with irq-string.

In the following example, the ingress traffic of the eth3 Ethernet device queue 1 using Intel® Ethernet Controller driver Native mode supports mostly the XDP_RX action.

$ xdpsock -i eth3 -q 1 -N

    sock0@eth3:1 rxdrop xdp-drv
                pps         pkts        1.01
rx              0           0
tx              0           0

    sock0@eth3:1 rxdrop xdp-drv
                pps         pkts        1.00
rx              0           0
tx              0           0

Display the active eBPF xdp_sock_prog program exposing the AF_XDP socket:

bpftool prog show

17: cgroup_skb  tag 6deef7357e7b4530  gpl
        loaded_at 2019-10-25T06:12:41+0000  uid 0
        xlated 64B  not jited  memlock 4096B
18: cgroup_skb  tag 6deef7357e7b4530  gpl
        loaded_at 2019-10-25T06:12:41+0000  uid 0
        xlated 64B  not jited  memlock 4096B
19: cgroup_skb  tag 6deef7357e7b4530  gpl
        loaded_at 2019-10-25T06:12:41+0000  uid 0
        xlated 64B  not jited  memlock 4096B
20: cgroup_skb  tag 6deef7357e7b4530  gpl
        loaded_at 2019-10-25T06:12:41+0000  uid 0
        xlated 64B  not jited  memlock 4096B
21: cgroup_skb  tag 6deef7357e7b4530  gpl
        loaded_at 2019-10-25T06:12:41+0000  uid 0
        xlated 64B  not jited  memlock 4096B
22: cgroup_skb  tag 6deef7357e7b4530  gpl
        loaded_at 2019-10-25T06:12:41+0000  uid 0
        xlated 64B  not jited  memlock 4096B
83: xdp  name xdp_sock_prog  tag c85daa2f1b3c395f  gpl
        loaded_at 2019-10-28T00:00:38+0000  uid 0
        xlated 176B  not jited  memlock 4096B  map_ids 45,46

Display the active xsk_map kernel memory space allocated by the eBPF program:

$ bpftool map show

45: array  name qidconf_map  flags 0x0
        key 4B  value 4B  max_entries 1  memlock 4096B
46: xskmap  name xsks_map  flags 0x0
        key 4B  value 4B  max_entries 4  memlock 4096B
47: percpu_array  name rr_map  flags 0x0
        key 4B  value 4B  max_entries 1  memlock 4096B

BPF Compiler Collection (BCC)¶

BPF Compiler Collection (BCC) makes it easy to build and load BPF programs into the kernel directly from Python code. This can be used for XDP packet processing. For more details, refer to the BCC web site.

By using BCC from a container in ECI, you can develop and test BPF programs attached to TSN NICs directly on the target without the need for a separate build system.

The following BCC XDP Redirect example is derived from the kernel self test. It creates two namespaces with two veth peers, and forwards packets in-between using generic XDP.

The following section is applicable to:

Create the veth devices and their peers in their respective namespaces:

$ ip netns add ns1
$ ip netns add ns2

$ ip link add veth1 index 111 type veth peer name veth11 netns ns1
$ ip link add veth2 index 222 type veth peer name veth22 netns ns2

$  ip link set veth1 up
$  ip link set veth2 up
$ ip -n ns1 link set dev veth11 up
$ ip -n ns2 link set dev veth22 up

$ ip -n ns1 addr add 10.1.1.11/24 dev veth11
$ ip -n ns2 addr add 10.1.1.22/24 dev veth22

Make sure that pinging from the veth peer in one namespace to the other veth peer in another namespace does not work in any direction without XDP redirect:

$ ip netns exec ns1 ping -c 1 10.1.1.22
PING 10.1.1.22 (10.1.1.22): 56 data bytes

--- 10.1.1.22 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss

$ ip netns exec ns2 ping -c 1 10.1.1.11
PING 10.1.1.11 (10.1.1.11): 56 data bytes

--- 10.1.1.11 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss

In another terminal, run the BCC container:

$ docker run -it --rm \
     --name bcc \
     --privileged \
     --net=host \
     -v /lib/modules/$(uname -r)/build:/lib/modules/host-build:ro \
     bcc

Inside the container, create a new file xdp_redirect.py with the following content:

#!/usr/bin/python

from bcc import BPF
import time
import sys

b = BPF(text = """
#include <uapi/linux/bpf.h>

int xdp_redirect_to_111(struct xdp_md *xdp)
{
       return bpf_redirect(111, 0);
}

int xdp_redirect_to_222(struct xdp_md *xdp)
{
       return bpf_redirect(222, 0);
}
""", cflags=["-w"])

flags = (1 << 1)        # XDP_FLAGS_SKB_MODE
#flags = (1 << 2)       # XDP_FLAGS_DRV_MODE
b.attach_xdp("veth1", b.load_func("xdp_redirect_to_222", BPF.XDP), flags)
b.attach_xdp("veth2", b.load_func("xdp_redirect_to_111", BPF.XDP), flags)

print("BPF programs loaded and redirecting packets, hit CTRL+C to stop")
while 1:
   try:
       time.sleep(1)
   except KeyboardInterrupt:
       print("Removing BPF programs")
       break;

b.remove_xdp("veth1", flags)
b.remove_xdp("veth2", flags)

Run xdp_redirect.py to load eBPF program :

$ python3 xdp_redirect.py
BPF programs loaded and redirecting packets, hit CTRL+C to stop

In the first terminal, make sure that pinging from the veth peer in one namespace

$ ip netns exec ns1 ping -c 1 10.1.1.22

PING 10.1.1.22 (10.1.1.22): 56 data bytes
64 bytes from 10.1.1.22: seq=0 ttl=64 time=0.067 ms

--- 10.1.1.22 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.067/0.067/0.067 ms

In the second terminal, veth peer in another namespace works in both directions due to XDP redirect:

$ ip netns exec ns2 ping -c 1 10.1.1.11

PING 10.1.1.11 (10.1.1.11): 56 data bytes
64 bytes from 10.1.1.11: seq=0 ttl=64 time=0.044 ms

--- 10.1.1.11 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.044/0.044/0.044 ms

XDP Sanity Check¶

Sanity Check #1: Load and Execute eBPF Offload Program with “Generic mode” XDP¶

Install both xdpdump example pre-built and iPerf - The ultimate speed test tool for TCP, UDP and SCTP:
$ sudo apt install xdpdump iperf3

Generate UDP traffic between talker and listener:

Set a L4-level UDP Listener on Ethernet device enp1s0 with IPv4, for example set to 192.168.1.206:

$ ip addr add 192.168.1.206/24 brd 192.168.0.255 dev enp1s0
$ iperf3 -s

-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------

Set a L4-level UDP Talker with 1448 bytes payload size on another node Ethernet device enp1s0 with a different IPv4 address, for example set to 192.168.1.203:

$ ip addr add 192.168.1.203/24 brd 192.168.0.255 dev enp1s0
$ iperf3 -c 192.168.1.206 -t 600 -b 0 -u -l 1448

Connecting to host 192.168.1.206, port 5201
[  5] local 192.168.1.203 port 36974 connected to 192.168.1.206 port 5201
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec   114 MBytes   957 Mbits/sec  82590
[  5]   1.00-2.00   sec   114 MBytes   956 Mbits/sec  82540

Execute the precompiled BPF XDP program loaded on the device in “Generic mode” XDP (for example, XDP_FLAGS_SKB_MODE) with successful XDP_PASS and XDP_DROP actions:

$ cd /opt/xdp/bpf-samples
$ ./xdpdump -i enp1s0 -S

493449 IP 192.168.1.203:36974 > 192.168.1.206:5201 UDP, length 1448
493453 IP 192.168.1.203:36974 > 192.168.1.206:5201 UDP, length 1448
493455 IP 192.168.1.203:36974 > 192.168.1.206:5201 UDP, length 1448
493458 IP 192.168.1.203:36974 > 192.168.1.206:5201 UDP, length 1448

Check whether the active eBPF program has loaded the Linux interface enp1s0 “Generic mode” XDP (xdpgeneric) on Intel® Ethernet Controller:

$ ip link show dev enp1s0

3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 9c:69:b4:61:82:73 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 18 tag a5fe55ab7ae19273

Press Ctrl + C to unload the XDP program:
^Cunloading xdp program...

Sanity Check #2: Load and Execute eBPF Offload program with “Native mode” XDP¶

Generate UDP traffic between talker and listener similar to sanity check #1.

Execute the precompiled BPF XDP program loaded on the device in “Native mode” XDP with successful XDP_PASS and XDP_DROP actions:

$ cd /opt/xdp/bpf-samples
$ ./xdpdump -i enp1s0 -N

673485.966036 0026 :0 > :0 , length 46
673486.626951 88cc :0 > :0 , length 236
673486.646948 TIMESYNC :0 > :0 , length 52
673487.646943 TIMESYNC :0 > :0 , length 52
...
673519.134258 IP 192.168.1.1:67 > 192.168.1.206:68 UDP, length 300
673519.136024 IP 192.168.1.1:67 > 192.168.1.206:68 UDP, length 300
673519.649768 TIMESYNC 192.168.1.1:0 > 192.168.1.206:0 , length 52

Check whether the active eBPF program has loaded the Linux interface enp1s0 in “Native mode” XDP ( XDP_FLAGS_DRV_MODE) on Intel® Ethernet Controller:

$ ip link show dev enp1s0

3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 9c:69:b4:61:82:73 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 80 tag a5fe55ab7ae19273

Press Ctrl + C to the unload XDP program:
```
^Cunloading xdp program...
```

Sanity Check #3: Load and Execute “Native mode” AF_XDP Socket Default XDP_COPY¶

Set up two or more ECI nodes according to the AF_XDP Socket (CONFIG_XDP_SOCKETS) guidelines.

Execute a XDP_RX action from the precompiled BPF XDP program loaded using igc.ko or igb.ko and stmmac.ko Intel® Ethernet Controller Linux interface under Native mode with default XDP_COPY:

xdpsock -i enp0s30f4 -q 0 -N -c

 sock0@enp0s30f4:0 rxdrop xdp-drv
                pps         pkts        1.01
rx              0           0
tx              0           0

 sock0@enp0s30f4:0 rxdrop xdp-drv
                pps         pkts        1.00
rx              0           0
tx              0           0

 sock0@enp0s30f4:0 rxdrop xdp-drv
                pps         pkts        1.00
rx              0           0
tx              0           0

 sock0@enp0s30f4:0 rxdrop xdp-drv
                pps         pkts        1.00
rx              0           0
tx              0           0
^C
 sock0@enp0s30f4:0 rxdrop xdp-drv
                pps         pkts        0.47
rx              0           0
tx              0           0

Display the Ethernet interface active eBPF program in “Native mode” with XDP_COPY:

$ ip link show dev enp0s30f4

3: enp0s30f4: <BROADCAST,MULTICAST,DYNAMIC,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 88:ab:cd:11:01:23 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 29 tag 992d9ddc835e5629

Display the eBPF program actively exposing the AF_XDP socket:

$ bpftool prog show

...
29: xdp  tag 992d9ddc835e5629
        loaded_at 2020-03-13T16:14:38+0000  uid 0
        xlated 176B  not jited  memlock 4096B  map_ids 1

Display the active xsk_map kernel memory space allocated by the eBPF program:

$ bpftool map

1: xskmap  name xsks_map  flags 0x0
        key 4B  value 4B  max_entries 6  memlock 4096B

Press Ctrl + C to unmount AF_XDP socket and unload eBPF program

Sanity Check #4: Load and Execute “Native mode” AF_XDP Socket Default XDP_ZEROCOPY¶

Set up two or more ECI nodes according to the AF_XDP Socket (CONFIG_XDP_SOCKETS) guidelines.

Execute a XDP_RX action from the precompiled BPF XDP program loaded using igc.ko or igb.ko and stmmac.ko Intel® Ethernet Controller Linux interface under Native mode with XDP_ZEROCOPY enabled:

$ xdpsock -i enp0s30f4 -q 0 -N -z

 sock0@enp0s30f4:0 rxdrop xdp-drv
                pps         pkts        1.01
rx              0           0
tx              0           0

 sock0@enp0s30f4:0 rxdrop xdp-drv
                pps         pkts        1.00
rx              0           0
tx              0           0

 sock0@enp0s30f4:0 rxdrop xdp-drv
                pps         pkts        1.00
rx              0           0
tx              0           0

 sock0@enp0s30f4:0 rxdrop xdp-drv
                pps         pkts        1.00
rx              0           0
tx              0           0
^C
 sock0@enp0s30f4:0 rxdrop xdp-drv
                pps         pkts        0.47
rx              0           0
tx              0           0

Display the Ethernet interface active eBPF program in “Native mode” with XDP_ZEROCOPY:

$ ip link show dev enp0s30f4

3: enp0s30f4: <BROADCAST,MULTICAST,DYNAMIC,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 88:ab:cd:11:01:23 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 30 tag 992d9ddc835e5629

Display the eBPF program actively exposing the AF_XDP socket:

$ bpftool prog show

30: xdp  tag 992d9ddc835e5629
        loaded_at 2020-03-13T16:04:37+0000  uid 0
        xlated 176B  not jited  memlock 4096B  map_ids 2

Display the active xsk_map kernel memory space allocated by the eBPF program:

$ bpftool map

2: xskmap  name xsks_map  flags 0x0
        key 4B  value 4B  max_entries 6  memlock 4096B

Press Ctrl + C to unmount the AF_XDP socket and unload eBPF program.