Real-Time Linux

The Embodied Intelligence SDK provides real-time capabilities to the kernel with PREEMPT_RT patch and boot parameters for real-time optimization, which aims to increase predictability and reduce scheduler latencies.

Installation

  1. Install GRUB customizations

$ sudo apt install -y customizations-grub
  1. Install linux-firmware

$ sudo apt install -y linux-firmware

Note: Linux 6.12 requires specify i915 GuC/dmc/gsc Linux firmware, these firmwares are installed to a distinct /lib/firmware/i915/experimental/ location. Please confirm below boot parameters in cat /proc/cmdline after next reboot.

$ i915.guc_firmware_path=i915/experimental/mtl_guc_70.bin i915.dmc_firmware_path=i915/experimental/mtl_dmc.bin i915.gsc_firmware_path=i915/experimental/mtl_gsc_1.bin

If you cannot find i915 GuC/dmc/gsc Firmware in /lib/firmware/i915/experimental/, you need to install latest linux-firmware with below command:

$ sudo apt install -y linux-firmware=20220329.git681281e4-0ubuntu3.36-intel-iotg.eci8

You can double confirm to check if correct linux-firmware in use:

$ sudo apt-cache policy linux-firmware

The below result is expected:

linux-firmware:
  Installed: 20220329.git681281e4-0ubuntu3.36-intel-iotg.eci8
  1. Install the real-time Linux kernel, For more details, please refer to LinuxBSP

$ sudo apt install -y linux-intel-rt-experimental

Note: If you don’t need to use RT kernel, please follow with below command to install.

$ sudo apt install -y linux-intel-experimental
  1. To modify default boot parameters, please edit in /etc/grub.d/10_eci_experimental.

Note: Follow with below steps and modify eci_cmdline_exp in /etc/grub.d/10_eci_experimental for better real-time performance and power consumption.

# Modify default cmdline parameters to enable cstate/pstate
$ sudo sed -i 's/intel_pstate=disable intel.max_cstate=0 intel_idle.max_cstate=0 processor.max_cstate=0 processor_idle.max_cstate=0/intel_pstate=enable/g' /etc/grub.d/10_eci_experimental
# Modify default cmdline parameter to affinity irq to core 0-9
$ sudo sed -i 's/irqaffinity=0 /irqaffinity=0-9 /g' /etc/grub.d/10_eci_experimental
# Modify default cmdline parameter to isolate cpus to core 10-13
$ sudo sed -i 's/isolcpus=${isolcpus} rcu_nocbs=${isolcpus} nohz_full=${isolcpus}/isolcpus=10-13 rcu_nocbs=10-13 nohz_full=10-13/g' /etc/grub.d/10_eci_experimental
$ sudo update-grub

The following command line parameters are used for real-time optimization. You can modify them according to your requirements:

  • isolcpus: Isolates specified CPU cores from the generic scheduler, dedicating them to real-time tasks.

  • rcu_nocbs: Prevents specified CPU cores from handling RCU (Real-Copy-Update) callback, reducing latency.

  • nohz_full: Enables full dynamic ticks on specified CPU cores, reducing timer interrupts.

  • irqaffinity: Directs all hardware interrupts to specified CPU cores, keeping them free for real-time tasks.

  1. Ensure to select [Experimental] ECI Ubuntu booting after reboot.

../../../_images/eci_grub.png

Note: Select Advanced Options for [Experimental] ECI Ubuntu to list [Experimental] ECI Ubuntu, with Linux 6.12.8-intel-ese-experimental-lts-rt for Real-time kernel or [Experimental] ECI Ubuntu, with Linux 6.12.8-intel-ese-experimental-lts for generic kernel

../../../_images/kernel_select.png

Real-time Runtime Optimization

To achieve real-time performance on a target system, specific runtime configurations and optimizations are recommended. This section provides a foundation for enabling real-time capable workloads.

../../../_images/arl_rt_setup.png

Use Cache Allocation Technology

Intel® Cache Allocation Technology (CAT) enables partitioning of caches at various levels within the caching hierarchy. providing a straightforward method to enhance temporal isolation between real-time and best-effort workloads.

This is an example configuration should be tailored to your specific use case and processor. To determine cache topology, including size and number of ways supported by a processor, use the CPUID leaf “Deterministic Cache Parameters Leaf - 0x4”. Linux utilities link lstopo are also useful for obtaining an overview of a processor’s cache topology.

For more information about CAT, refer to the following resources:

  • Public Intel® Time Coordinated Computing (TCC) User Guide - RDC #[831067]

  • Intel® Resource Director Technology (Intel® RDT) Architecture Specification - RDC #[789566]

  • Intel® 64 and IA-32 Architectures Software Developer’s Manual - RDC#[671200]

Below is an example script to partition the Last Level Cache (LLC) and L2 Cache, assigning an exclusive portion to real-time tasks. Ensure you have installed the Linux msr-tools to test it according to your configuration:

(e.g. core 13 as isolate core)

# ! /bin/sh
# define LLC Core Masks
wrmsr 0xc90 0x3f          # best effort mask
wrmsr 0xc91 0xfc0         # real-time mask

# define E-core L2 Core Mask
wrmsr -p10 0xd10 0xff     # best effort mask
wrmsr -p11 0xd10 0xff     # best effort mask
wrmsr -p12 0xd10 0xff     # best effort mask
wrmsr -p13 0xd11 0xff00   # real-time mask

# assign the masks to the cores
# This has to match with the core selected for the real-time task
wrmsr -p13 0xc8f 0x100000000

Use Dynamic Voltage and Frequency

Dynamic Voltage and Frequency Scaling (DVFS) features, such as Intel® Speed Step, Speed Shift, and Turbo Boost Technology, allow processors to adjust voltage and frequency within P-States to balance power efficiency and performance. Speed Step and Speed Shift manage these adjustments, while Turbo Boost temporarily exceeds the highest P-State for additional performance during demanding task.

To enhance single-thread performance, boost the frequency of the real-time core within the turbo frequency range. For real-time requirements, you can lock the core frequency during runtime using HWP MSRs or the intel_pstate driver in Linux. Locking the core frequency of the real-time application to a turbo frequency and limiting the maximum frequency of best-effort (BE) cores to the base frequency, as guided by the TCC User Guide, results in reduced execution time jitter and significantly lower execution time.

Boost the frequency of the real-time core to a value within the turbo frequency range to leverage higher single-thread performance. As real-time requirements, you have the option to lock core frequency during runtime using the HWP MSRs or the intel_pstate driver under Linux.

For more information on accessing HWP MSRs directly instead of using the sysfs entries of the intel_pstate driver, refer to the [TCC User Guide] and the Intel® 64 and the Intel® 64 and IA-32 Architectures Software Developer’s Manual Vol3 section “Power and Thermal Management-Hardware Controlled Performance States - RDC #[671200].

Attention

Setting even just a few cores to a higher, fixed frequency does not come without a cost. Due to higher internal frequency, voltages, and subsequent higher temperature and power, such settings will negatively impact the reliability expectations of the CPU and should be used with careful consideration.

Below is an example to boost the real-time core to 3GHz, with the Energy Performance Preference (EPP) set to performance to ensure Quality of Service (QoS) in case of power limit throttling:

(e.g. core 13 as isolate core on Intel® Core™ Ultra Processors 255H)

  • (Option 1): Using the sysfs entries of the intel_pstate driver

# ! /bin/sh
# Set the min and max frequencies to specific turbo frequency
echo performance >  /sys/devices/system/cpu/cpu13/cpufreq/scaling_governor
echo 3000000 >  /sys/devices/system/cpu/cpu13/cpufreq/scaling_max_freq
echo 3000000 >  /sys/devices/system/cpu/cpu13/cpufreq/scaling_min_freq
  • (Option 2): Using msr-tools to modify IA32_HWP_REQUEST(0x774) for setting specific core frequency.

Note: For details on IA32_HWP_REQUEST, please refer to the Intel® 64 and the Intel® 64 and IA-32 Architectures Software Developer’s Manual Vol3 section “Power and Thermal Management-Hardware Controlled Performance States - RDC #[671200].

# ! /bin/sh
wrmsr 0x774 -p 0 0x80005201
wrmsr 0x774 -p 1 0x80005201
wrmsr 0x774 -p 2 0x80005201
wrmsr 0x774 -p 3 0x80005201
wrmsr 0x774 -p 4 0x80005201
wrmsr 0x774 -p 5 0x80005201
wrmsr 0x774 -p 6 0x80003e01
wrmsr 0x774 -p 7 0x80003e01
wrmsr 0x774 -p 8 0x80003e01
wrmsr 0x774 -p 9 0x80003e01
wrmsr 0x774 -p 10 0x80003e01
wrmsr 0x774 -p 11 0x80003e01
wrmsr 0x774 -p 12 0x80003e01
wrmsr 0x774 -p 13 0x00002a2a

Per-core C-State Disable

Refer to OS Setup for BIOS optimization and Linux boot parameter optimization on Real-time performance, Intel C-state and P-state are enabled. It brings more power consumption to improve on GPU AI performance, but C-state can introduce jitter due to the varying times required to transition between states in isolate cores. Per-core C-state Disable helps minimize this jitter, providing a more stable environment for real-time task.

Follow with below command to disable C-state in isolate core:

(e.g. core 13 as isolate core)

# ! /bin/sh
# Disable all cstates except C0 in isolate CPU cores
# Define the range for CPU indices
cpu_start=13  # Replace with your starting CPU index
cpu_end=13   # Replace with your ending CPU index

# Loop over each CPU index
for (( i=cpu_start; i<=cpu_end; i++ )); do
    # Determine the maximum state index for the current CPU
    max_state_index=$(ls /sys/devices/system/cpu/cpu$i/cpuidle/ | grep -o 'state[0-9]*' | sed 's/state//' | sort -n | tail -1)

    # Loop over each state index
    for (( j=1; j<=max_state_index; j++ )); do
        # Disable the current state
        sudo echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state$j/disable
        echo "Disabled CPU $i state $j"
    done
done

Timer Migration Disable

In Linux kernel, timer migration refers to the process of moving timers from one CPU to another. This is often done to balance the load across CPUs or to optimize power management by consolidating timers on fewer CPUs when others are idle. Timer migration can lead to interference with other tasks running on the target CPU, potentially affecting real-time performance in isolate CPU core. By keeping timers on their original CPU, you minimize the risk of such interference.

Disabling timer migration in a real-time kernel helps maintain the consistency and predictability required for real-time applications, ensuring that timers are executed with minimal latency and interference.

Timer migration can be disabled with the following command:

$ echo 0 > /proc/sys/kernel/timer_migration

Disable Swap

Accessing anonymous memory that has been swapped to disk results in a major page fault. Handling page faults can further increase latency and unpredictability, which is undesirable in real-time tasks. Swap can be disabled with following command:

$ swapoff -a

Verify Benchmark Performance

After installing the real-time Linux kernel, it’s a good idea to benchmark the system to establish confidence that the system is properly configured. Perform either of the following commands to install Cyclictest. Cyclictest is most commonly used for benchmarking real-time systems. It is one of the most frequently used tools for evaluating the relative performance of an RT. Cyclictest accurately and repeatedly measures the difference between a thread’s intended wake-up time and the time at which it actually wakes up to provide statistics about the system’s latency. It can measure latency in real-time systems caused by the hardware, the firmware, and the operating system. Please use rt-tests v2.6 to collect performance, which support to pin threads to specific isolate core and avoid main thread in same core with the measurement threads.

Follow with below steps, you can find cyclictest v2.6 in rt-tests-2.6

$ wget https://web.git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/snapshot/rt-tests-2.6.tar.gz
$ tar zxvf rt-tests-2.6.tar.gz
$ cd rt-tests-2.6
$ make

Note: Please ensure you had installed libnuma-dev as dependence before compilation.

$ sudo apt install libnuma-dev

An example command that runs the cyclictest benchmark as below:

$ cyclictest -mp 99 -t1 -a 13 -i 1000 --laptop -D 72h  -N --mainaffinity 12

Default parameters are used unless otherwise specified. Run cyclictest --help to list the modifiable arguments.

option

Explanation

-p

priority of highest priority thread

-t

one thread per available processor

-a

Run thread #N on processor #N, or if CPUSET given, pin threads to that set of processors in round-robin order

-i

base interval of thread in us default=1000

-D

specify a length for the test run

-N

print results in ns instead of us(default us)

–mainaffinity

Run the main thread on CPU #N. This only affects the main thread and not the measurement threads

-m

lock current and future memory allocations

–laptop

Not setting cpu_dma_latency to save battery, recommend using it when enabling per-core C-state disable.

On a realtime-enabled system, the result might be similar to the following:

T: 0 ( 3407) P:99 I:1000 C: 100000 Min:      928 Act:   1376 Avg:   1154 Max:      18373

This result indicates an apparent short-term worst-case latency of 18 us. According to this, it is important to pay attention to the Max values as these are indicators of outliers. Even if the system has decent Avg (average) values, a single outlier as indicated by Max is enough to break or disturb a real-time system.

If the real-time data is not good by default installation, please refer to OS Setup for BIOS optimization and Optimize Performance to optimize Linux OS and application runtime on Intel® Processors.