Attention

You are viewing an older version of the documentation. The latest version is v3.3.

Linux OS Runtime Optimizations

To achieve real-time performance on a target system, certain runtime configurations and optimizations are recommended. The section establishes a basis for enabling real-time capable workloads.

CPU Isolation

When using the ECI Linux* Intel® LTS PREEMPT_RT kernel, all Linux kernel processes are scheduled to run on CPU 0, and CPUs 1 and 3 are configured to be isolated for real-time usage.

This creates a side effect where the workloads utilizing CPU 0 will experience degraded performance. Therefore, it is recommended to move all critical processes to a CPU other than CPU 0.

For reference, the following code snippet shows the default kernel boot parameters that affect CPUs:

nmi_watchdog=0
irqaffinity=0
isolcpus=1,3
rcu_nocbs=1,3
nohz_full=1,3

Network Interrupts Affinity to CPU

When using the ECI Linux Intel LTS PREEMPT_RT kernel, all Ethernet device MSI interrupts Linux network device are scheduled to run on CPU 0.

This creates a side effect where the workloads may utilize CPU 0 for all Ethernet devices interrupt handling (for example, top-half and bottom-half handler) and degraded performance.

Therefore, it is recommended to selectively move critical Ethernet device interrupts for prioritized traffic-class onto a CPU other than CPU 0.

The following section is applicable to:

../_images/target_generic2.png

Install Network-irq-affinity Tool

You can access the table listing the default mapping of CPU to device interrupt via sysfs /proc/interrupts.

root@eci-intel-0474:~# cat /proc/interrupts | grep -e CPU. -e enp.s.
            CPU0       CPU1       CPU2       CPU3
127:          1          0          0          0   PCI-MSI 524288-edge      enp1s0
128:     362502          0          0          0   PCI-MSI 524289-edge      enp1s0-TxRx-0
129:     197962          0          0          0   PCI-MSI 524290-edge      enp1s0-TxRx-1
130:     176611          0          0          0   PCI-MSI 524291-edge      enp1s0-TxRx-2
131:     183731          0          0          0   PCI-MSI 524292-edge      enp1s0-TxRx-3
132:          7          0          0          0   PCI-MSI 2097152-edge      enp4s0
133:     487832          0          0          0   PCI-MSI 2097153-edge      enp4s0-TxRx-0
134:     174277          0          0          0   PCI-MSI 2097154-edge      enp4s0-TxRx-1
135:     164195          0          0          0   PCI-MSI 2097155-edge      enp4s0-TxRx-2
136:     164147          0          0          0   PCI-MSI 2097156-edge      enp4s0-TxRx-3
137:         53          0          0          0   PCI-MSI 2621440-edge      enp5s0
138:     166660          0          0          0   PCI-MSI 2621441-edge      enp5s0-TxRx-0
139:     174901          0          0          0   PCI-MSI 2621442-edge      enp5s0-TxRx-1
140:     164256          0          0          0   PCI-MSI 2621443-edge      enp5s0-TxRx-2
141:     164384          0          0          0   PCI-MSI 2621444-edge      enp5s0-TxRx-3

You can install the network-irq-affinity helper script from the ECI APT repository to allow remapping CPU and Ethernet device interrupts for a specific use.

Tools

Version

Source

Ethernet Device MSI interrupt CPU mapping helper script

1.1+patchset

https://github.com/suominen/network-irq-affinity

Setup the ECI APT repository, then perform the following command to install this component:

$ sudo apt install network-irq-affinity

The following are the possible arguments of the network-irq-affinity helper script:

   Usage:  network-irq-affinity [-hnv] -i <irqname>/<cpunum> -i <irqname>/<cpunum> ...

   Options:
   -h      Show this usage message.
   -i      tuple ethernet interface irq <irqname> to cpu core <cpunum>
   -n      Do not change anything, just show what would be done.
   -v      Verbose output of changes made.

**Note**: In Tuple Ethernet interrupt list, ``-i <irqname>/<cpunum>`` precedence matters for ``<irqname>`` affinity to CPU core ``<cpunum>``.

Example Command

$ network-irq-affinity -i enp5s0-TxRx-1/1 -i enp4s0-TxRx-1/1 -i enp4s0-TxRx-*/2 -i enp5s0-TxRx-*/3 -v

In the above example, the -i <irqname>/<cpunum> tuple list is used to prioritize mapping both enp4s0-TxRx-1 and enp5s0-TxRx-1 MSI interrupts to CPU1, before mapping the remaining MSI interrupts respectively to CPU2 for enp4s0 and CPU3 for enp5s0.

network-irq-affinity: Assigning enp5s0-TxRx-1 on IRQ 139 to CPU 1
network-irq-affinity: Assigning enp4s0-TxRx-1 on IRQ 134 to CPU 1
network-irq-affinity: Assigning enp4s0-TxRx-0 on IRQ 133 to CPU 2
network-irq-affinity: Assigning enp4s0-TxRx-2 on IRQ 135 to CPU 2
network-irq-affinity: Assigning enp4s0-TxRx-3 on IRQ 136 to CPU 2
network-irq-affinity: Assigning enp5s0-TxRx-0 on IRQ 138 to CPU 3
network-irq-affinity: Assigning enp5s0-TxRx-2 on IRQ 140 to CPU 3
network-irq-affinity: Assigning enp5s0-TxRx-3 on IRQ 141 to CPU 3

Verify whether the Ethernet device interrupt remap is effective using the following code:

root@eci-intel-0474:~# cat /proc/interrupts | grep -e CPU. -e enp.s.
            CPU0       CPU1       CPU2       CPU3
127:          1          0          0          0   PCI-MSI 524288-edge      enp1s0
128:     362759          0          0          0   PCI-MSI 524289-edge      enp1s0-TxRx-0
129:     198104          0          0          0   PCI-MSI 524290-edge      enp1s0-TxRx-1
130:     176736          0          0          0   PCI-MSI 524291-edge      enp1s0-TxRx-2
131:     183855          0          0          0   PCI-MSI 524292-edge      enp1s0-TxRx-3
132:          7          0          0          0   PCI-MSI 2097152-edge      enp4s0
133:     488175          0         27          0   PCI-MSI 2097153-edge      enp4s0-TxRx-0
134:     174399          8          0          0   PCI-MSI 2097154-edge      enp4s0-TxRx-1
135:     164310          0          8          0   PCI-MSI 2097155-edge      enp4s0-TxRx-2
136:     164262          0          8          0   PCI-MSI 2097156-edge      enp4s0-TxRx-3
137:         53          0          0          0   PCI-MSI 2621440-edge      enp5s0
138:     166775          0          6          2   PCI-MSI 2621441-edge      enp5s0-TxRx-0
139:     175020          8          0          0   PCI-MSI 2621442-edge      enp5s0-TxRx-1
140:     164371          0          6          2   PCI-MSI 2621443-edge      enp5s0-TxRx-2
141:     164499          0          6          2   PCI-MSI 2621444-edge      enp5s0-TxRx-3

Best Practices for Achieving Real-time Performance

The following section is applicable to:

../_images/target_generic2.png

Eliminate Sources of CPU Contention

To achieve real-time performance, it is imperative to isolate real-time workloads from other tasks. This can be achieved by using a real-time kernel and modifying the kernel boot parameters. ECI provides a Deb package named customizations-grub, which modifies the kernel boot parameters upon installation. ECI targets core-bullseye and core-jammy have the customizations-grub package installed, by default. Refer to Install ECI Deb Packages to learn how to install ECI Deb packages.

The ECI Deb package customizations-grub modifies the kernel boot parameters such that CPUs 1 and 3 are isolated, and CPU 0 is reserved to handle Linux kernel interrupts. This configuration allows the use of CPUs 1 and 3 for real-time workloads.

Important

This creates a side effect that any workloads which utilize CPU 0 will experience degraded performance. Therefore, it is recommended to move all critical processes to a CPU other than CPU 0.

For reference, the following snippet shows the default kernel boot parameters which affect CPUs:

nmi_watchdog=0
irqaffinity=0
isolcpus=1,3
rcu_nocbs=1,3
nohz_full=1,3

See also

For a list of ECI kernel boot optimizations, refer to ECI Kernel Boot Optimizations.

For best performance, only assign a single isolated CPU per real-time workload. The following example executes the workload and the assigns the affinity of the workload to CPU 3 (where <workload> is replaced with the application to run):

$ taskset -c 3 <workload>

To assign affinity of all the child tasks of a parent workload, run the following command (where <workload> is the name of your workload):

$ ps ww -eLo tid,comm,cmd | grep -i <workload> | awk '{print $1}' | xargs -n 1 taskset -pac 3 > /dev/null

Prioritize Workloads

A simple and effective method to boost the performance of a real-time workload is to increase its runtime priority. Use the following command to run a workload with increased runtime priority (where <workload> is the name of your application):

$ chrt -f 1 <workload>

To assign priority of all the child tasks of a parent workload, run the following (where <workload> is the name of your workload):

$ ps ww -eLo tid,comm,cmd | grep -i <workload> | awk '{print $1}' | xargs -n 1 chrt -p -f 1

Use Cache Allocation Technology

Shared last-level caches are common on modern processors. For example, on an Intel® Core™ processor or Intel® Xeon® processor, the cores share an L3 cache. Whereas on an Intel Atom® processor, cores 0 and 1 share an L2 cache as well as cores 2 and 3. Due to this reality, workloads on adjacent cores can potentially be a cause of cache misses. This occurs when a workload evicts a cache line in use by another workload. When a cache miss occurs, the workload must wait while the memory is fetched. This introduces undesired jitter into the workload execution time, subsequently impacting determinism. To mitigate this issue, Intel Cache Allocation Technology provides a method to partition processor caches and assign these partitions to a Class-of-Service (COS). Associating workloads to different COS can effectively isolate parts of cache available to a workload, thus preventing cache contention altogether. See Cache Allocation Technology for more information on using CAT.

ECI images pin all Linux kernel tasks to core 0, and isolate cores 1 and 3, by default. Under these conditions, it is advantageous to allocate the CPU cache such that the Linux kernel tasks never share cache with any tasks running on the isolated cores. To achieve this result, perform the following steps:

Recommended CAT configuration for Intel Core™ or Xeon® processors

The following example sets core 0 L3 cache mask to 0x0f, and cores 1 and 3 L3 cache mask to 0xf0.

Attention

This example is best suited for Intel® Core™ or Intel® Xeon® processors, which share last-level L3 cache. For Intel Atom® processors, see the subsequent example.

  1. Reset cache allocation to default state.

    pqos -R
    
  2. Define the allocation classes for the last-level-class (LLC). Class 0 is allocated exclusive access to the first half of the LLC. Class 1 is allocated exclusive access to the second half of the LLC.

    pqos -e 'llc:0=0x0f;llc:1=0xf0'
    
  3. Associate core 0 with class 0, and cores 1 and 3 with class 1.

    pqos -a 'llc:0=0;llc:1=1,3'
    

Recommended CAT configuration for Intel Atom® processors

The following example sets core 0 and 2 L2 cache mask to 0x0f, and core 1 and 3 L2 cache mask to 0xf0.

Attention

This example is best suited for Intel Atom® processors, which share last-level L2 cache. For Intel® Core™ processors or Intel® Xeon® processors, see previous example.

$ pqos -R
$ pqos -e 'l2:0=0x0f;l2:1=0xf0'
$ pqos -a 'llc:0=0,2;llc:1=1,3'

Stop Unnecessary Services

Many services run in the background, by default, on Linux. Stopping services may reduce spurious interrupts depending on the workload type. To list the loaded services, run the following command:

$ systemctl -t service

To stop a service, run the following command (where <service> is the name a service):

Warning

Stopping system services can be detrimental to the stability of the Linux system. Be sure you understand the implications before stopping a service.

$ systemctl stop <service>

Disable Machine Checks

By default, the Linux kernel periodically scans hardware for reported errors. While this feature can be useful for tracking down troublesome bugs, it also presents a source of workload preemption. Disabling this check improves real-time performance of workloads by preventing the Linux kernel from interrupting the running tasks. Run the following command to disable machine checks:

echo 0 > /sys/devices/system/machinecheck/machinecheck0/check_interval

Increase Thread Runtime Limit

The default values for the real-time throttling mechanism define that 95% of the CPU time can be used by real-time tasks. The remaining 5% will be devoted to non-realtime tasks (tasks running under SCHED_OTHER and similar scheduling policies). It is important to note that if a single real-time task occupies that 95% CPU time slot, the remaining real-time tasks on that CPU will not run. The remaining 5% of CPU time is used only by non-realtime tasks.

The impact of the default values is two-fold: rogue real-time tasks will not lock up the system by not allowing non-realtime tasks to run and, on the other hand, real-time tasks will have at most 95% of CPU time available from them, probably affecting their performance.

If it is known that a particular workload is stable consuming 100% of a CPU, it is possible to configure the runtime limit to infinity.

Warning

Configuring the runtime limit in this way can potentially lock a system if the workload in question contains unbounded polling loops. Use this configuration with caution.

echo -1 > /proc/sys/kernel/sched_rt_runtime_us

Typical Workload Optimization Flow

When executing a workload, complete the following steps to increase real-time performance of the workload:

  1. Stop unnecessary services. In this example we stop wireless communication related services:

    $ systemctl stop ofono
    $ systemctl stop wpa_supplicant
    $ systemctl stop bluetooth
    
  2. Stop Docker* daemon (if containers not used):

    $ systemctl stop docker
    
  3. Disable kernel machine check interrupt:

    $ echo 0 > /sys/devices/system/machinecheck/machinecheck0/check_interval
    
  4. Disable thread runtime limit:

    $ echo -1 > /proc/sys/kernel/sched_rt_runtime_us
    
  5. Setup cache partitioning. See Cache Allocation Technology for information on using CAT. This example sets core 0 and 2 L2 cache mask to 0x0f, and core 1 and 3 L2 cache mask to 0xf0.

    $ pqos -R
    $ pqos -e 'l2:0=0x0f;l2:1=0xf0'
    $ pqos -a 'llc:0=0,2;llc:1=1,3'
    
  6. Assign all non-realtime task affinity to core 0. The following script iterates through all interrupts and attempts to assign affinity to core 0.

    #!/bin/bash
    for i in $(cat /proc/interrupts | grep '^ *[0-9]*[0-9]:' | awk {'print $1'} | sed 's/:$//');
    do
     # Timer
     if [ "$i" = "0" ]; then
         continue
     fi
     # cascade
     if [ "$i" = "2" ]; then
         continue
     fi
     echo setting $i to affine for core 0
     echo 1 > /proc/irq/$i/smp_affinity
    done
    
  7. Offload RCU tasks:

    $ for i in $(pgrep rcu); do taskset -pc 0 $i > /dev/null ; done
    
  8. Start the real-time workload (where <workload> is the name of your workload):

    $ ./<workload>
    
  9. Change affinity of workload tasks (where <workload> is the name of your workload). This example assigns affinity of all workload tasks to core 3.

    $ ps ww -eLo tid,comm,cmd | grep -i <workload> | awk '{print $1}' | xargs -n 1 taskset -pac 3 > /dev/null
    
  10. Change priority of workload tasks to be real-time (where <workload> is the name of your workload):

    $ ps ww -eLo tid,comm,cmd | grep -i <workload> | awk '{print $1}' | xargs -n 1 chrt -p -f 1
    
  11. Minimize integrated GPU utilization on the processor. Rather than connecting a display monitor to the target system, use SSH to access the system. This minimizes the interrupts generated by the integrated GPU, thus improving system determinism.