Attention

You are viewing an older version of the documentation. The latest version is v3.3.

Real-time Workload Optimizations¶

In order to achieve real-time performance on a target system, certain runtime configurations and optimizations must be observed. The goal of this section is to establish a basis for enabling real-time capable workloads.

Cache Allocation Technology¶

Shared last-level caches are common on modern processors. For example, on an Intel Core™ or Xeon® processor, the cores share an L3 cache. Whereas on an Intel Atom® processor, cores 0 and 1 share an L2 cache as well as cores 2 and 3. Due to this reality, workloads on adjacent cores can potentially be a cause of cache misses. This occurs when a workload evicts a cache line in use by another workload. When a cache miss occurs, the workload must wait while the memory is fetched. This introduces undesired jitter into the workload execution time, subsequently impacting determinism. To mitigate this issue, Intel Cache Allocation Technology provides a method to partition processor caches and assign these partitions to a Class-of-Service (COS). Associating workloads to different COS can effectively isolate parts of cache available to a workload, thus preventing cache contention altogether.

Intel’s Cache Allocation Technology (CAT) helps address shared resource concerns by providing software control of where data is allocated into the last-level cache (LLC), enabling isolation and prioritization of key applications. See this link for a detailed explanation of the features and benefits of CAT: https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-cache-allocation-technology.html

Cache Allocation Technology Terminology¶

CAT - Cache Allocation Technology. The umbrella term for all uses of cache allocation.
pqos - Platform Quality of Service. A linux tool built into the ECI image for controlling cache assignment.
COS - Class of service. A bitmask which determines which ways of a cache are exposed.
L3 - Level 3 Cache. Typically the last cache on Intel Core™ and Xeon® processors.
L3 - Level 2 Cache. Typically the last cache on Intel Atom® processors.
llc - Last-level-cache. Typically the L3 cache for Intel Core™ and Xeon® processors, and L2 for Atom® processors.

PQOS Usage Examples

The following section is applicable to:

CAT features can be accessed through the command pqos (Platform Quality of Service). pqos allows the user to partition the cache and then associate a Class of Service (COS) to a specific use.

Note

Examples are written assuming system has 12 cores. Adjust examples as necessary for actual system core count.

Cache Monitoring Technology (CMT) and Memory B/W Monitoring (MBM) usage¶

Monitor all events on cores 0 to 11:
```
$ pqos -m all:0-11
$ pqos -m :0-11
```
Monitor LLC on cores 0, 2 and 6:
```
$ pqos -m llc:0,2,6
```
Monitor local memory B/W on cores 0-2 and remote memory B/W on cores 3, 4 and 5:
```
$ pqos -m "mbl:0-2;mbr:3,4,5"
```

Monitor events on groups of cores (aggregate statistics):

$ pqos -m "all:[0-11];llc:[12,13,14];mbl:[15-17,20]"

Reset Monitoring: Reclaims in-use RMID’s.
```
$ pqos -r
```

Example CMT/MBM usage scenario¶

A user has a host machine running 3 guest VMs with 3 cores assigned to each guest.

VM0 - cores 0-2
VM1 - cores 3-5
VM2 - cores 6-8

To monitor all events (LLC occupancy, local and remote memory B/W) run:

$ pqos -m "all:[0-2],[3-5],[6-8];"

Console output:

CORE    IPC   MISSES    LLC[KB]  MBL[MB/s]  MBR[MB/s]
 0-2   0.28    7893k      383.2      901.2      430.8
 3-5   0.28      45k       25.3   361282.6       22.4
 6-8   0.26   89468k     6778.8    43904.3        4.3

Cache Allocation Technology (CAT) usage¶

Set COS 1 to the first 4 cache ways and COS 2 to the next 8 cache ways:
```
$ pqos -e "llc:1=0x000f;llc:2=0x0ff0;"
```
Set COS 1 on all sockets, COS 2 on socket 0 and 1 and COS 3 on sockets 2 to 3:
```
$ pqos -e "llc:1=0x000f;llc@0,1:2=0x0ff0;llc@2-3:3=0x3c"
```

Console output for pqos -s to show current configuration:

L3CA COS definitions for Socket 0:
    L3CA COS0 => MASK 0xfffff
    L3CA COS1 => MASK 0xf
    L3CA COS2 => MASK 0xff0
    L3CA COS3 => MASK 0xfffff
    ...
L3CA COS definitions for Socket 1:
    L3CA COS0 => MASK 0xfffff
    L3CA COS1 => MASK 0xf
    L3CA COS2 => MASK 0xff0
    L3CA COS3 => MASK 0xfffff
    ...
L3CA COS definitions for Socket 2:
    L3CA COS0 => MASK 0xfffff
    L3CA COS1 => MASK 0xf
    L3CA COS2 => MASK 0xfffff
    L3CA COS3 => MASK 0x3c
    ...
L3CA COS definitions for Socket 3:
    L3CA COS0 => MASK 0xfffff
    L3CA COS1 => MASK 0xf
    L3CA COS2 => MASK 0xfffff
    L3CA COS3 => MASK 0x3c
    ...

Associate cores 0, 2, and 6 to 10 with COS 1 and core 1 to COS 2:
```
$ pqos -a "llc:1=0,2,6-10;llc:2=1;"
```
Enable, disable L3 CDP:
```
$ pqos -R l3cdp-on
$ pqos -R l3cdp-off
```
Enable, disable L3 CDP:
```
$ pqos -S cdp-on
$ pqos -S cdp-off
```
Use current L3 CDP settings and set COS 1 code and data bitmasks:
```
$ pqos -e "llc:1d=0xfff;llc:1c=0xfff00;"
```

Use current L3 CDP settings and set COS 1 code and data bitmasks:

$ pqos -S cdp-any -e "llc:1d=0xfff;llc:1c=0xfff00;"

or

$ pqos -e "llc:1d=0xfff;llc:1c=0xfff00;"

Show current CAT settings:
```
$ pqos -s
```
Reset CAT: Sets all COS to default (fill into all ways) and associates all cores with COS 0.
```
$ pqos -R
```

Example CAT usage scenario¶

A user has a host machine running 3 guest VMs. Each guest is assigned 3 cores and a priority.

VM0 - cores 0-2 (P5)
VM1 - cores 3-5 (P2)
VM2 - cores 6-8 (P1)

As VM0 has the highest priority it will be assigned 8 exclusive LLC ways. VM1 and VM2 are relatively low priority so VM1 is assigned 6 ways and VM2 is assigned 4 ways, 2 of which will be shared.

First, set the 3 COS bitmasks for each VM:

$ pqos -e "llc:1=0x00ff;llc:2=0x3f00;llc:3=0xf000;"

Next, associate each COS with the cores where each VM is running:
```
$ pqos -a "llc:1=0-2;llc:2=3-5;llc:3=6-8;"
```

VM 0 now has exclusive access to 8 LLC ways, VM1 has exclusive access to 4 ways and shared access to 2 ways and VM2 has exclusive access to 2 ways and shared access to another 2 ways. All other cores have access to all other ways.

Example CAT usage on ECI images¶

ECI images pin all Linux kernel tasks to core 0, and isolate cores 1-3 by default. Under these conditions, it’s advantageous to allocate the CPU cache such that the Linux kernel tasks never share cache with any tasks running on the isolated cores. To achieve this result, perform the following steps:

Recommended CAT configuration for Intel Core™ or Xeon® processors

The following example sets core 0 L3 cache mask to 0x0f, and core 1-3 L3 cache mask to 0xf0.

Attention

This example is best suited for Intel Core™ or Xeon® processors which share last-level L3 cache. For Intel Atom® processors, see subsequent example.

Reset cache allocation to default state.
```
pqos -R
```
Define the allocation classes for the last-level-class (llc). Class 0 is allocated exclusive access to the first half of the llc. Class 1 is allocated exclusive access to the second half of the llc.
```
pqos -e 'llc:0=0x0f;llc:1=0xf0'
```
Associate core 0 with class 0, and cores 1-3 with class 1.
```
pqos -a 'llc:0=0;llc:2=1,2,3'
```

Recommended CAT configuration for Intel Atom® processors

The following example sets core 0 & 3 L2 cache mask to 0xfe, and core 1 & 2 L2 cache mask to 0x1.

Attention

This example is best suited for Intel Atom® processors which share last-level L2 cache. For Intel Core™ or Xeon® processors, see previous example.

$ pqos -R
$ pqos -e 'l2:0=0xfe;l2:1=0x1;l2:2=0x1;l2:3=0xfe'
$ pqos -a 'llc:0=0,1;llc:2=2;llc:3=3'

Memory Bandwidth Allocation (MBA) usage¶

Set COS 1 to 50% available and COS 2 to 70% available:
```
$ pqos -e "mba:1=50;mba:2=70;"
```
Set COS 1 on all sockets, COS 2 on socket 0 and 1 and COS 3 on sockets 2 to 3: Note: MBA rounds numbers given to it.
```
$ pqos -e "mba:1=80;mba@0,1:2=64;mba@2-3:3=85"
```

Console output for pqos -s to show current configuration:

L3CA/MBA COS definitions for Socket 0:
    MBA COS0 => 100% available
    MBA COS1 => 80%  available
    MBA COS2 => 60%  available
    MBA COS3 => 100% available
    ...
L3CA/MBA COS definitions for Socket 1:
    MBA COS0 => 100% available
    MBA COS1 => 80%  available
    MBA COS2 => 60%  available
    MBA COS3 => 100%  available
    ...
L3CA/MBA COS definitions for Socket 2:
    MBA COS0 => 100% available
    MBA COS1 => 80%  available
    MBA COS2 => 100%  available
    MBA COS3 => 90%  available
    ...
L3CA/MBA COS definitions for Socket 3:
    MBA COS0 => 100% available
    MBA COS1 => 80%  available
    MBA COS2 => 100%  available
    MBA COS3 => 90%  available
    ...

Show current MBA settings:
```
$ pqos -s
```
Reset MBA: Sets all COS to default and associates all cores with COS 0.
```
$ pqos -R
```

Best Practices for Achieving Real-time Performance¶

The following section is applicable to:

Eliminate Sources of CPU Contention¶

To achieve real-time performance, it is imperative to isolate real-time workloads from other tasks. ECI-B provides CPU isolation by default. The kernel boot parameters of ECI-B isolate CPUs 1 and 3, and utilizes CPU 0 to handle Linux kernel interrupts. This configuration allows the use of CPUs 1 and 3 for real-time workloads.

Prioritize Workloads¶

A simple and effective method to boost the performance of a real-time workload is to increase its runtime priority. Use the following command to run a workload with increased runtime priority (where <workload> is replaced with the name of your application):

$ chrt -f 1 <workload>

To assign priority of all the child tasks of a parent workload, the following may be used (where <workload> is replaced with the name of your workload):

$ ps ww -eLo tid,comm,cmd | grep -i <workload> | awk '{print $1}' | xargs -n 1 chrt -p -f 1

Use Cache Allocation Technology¶

Shared last-level caches are common on modern processors. For example, on an Intel Core™ or Xeon® processor, the cores share an L3 cache. Whereas on an Intel Atom® processor, cores 0 and 1 share an L2 cache as well as cores 2 and 3. Due to this reality, workloads on adjacent cores can potentially be a cause of cache misses. This occurs when a workload evicts a cache line in use by another workload. When a cache miss occurs, the workload must wait while the memory is fetched. This introduces undesired jitter into the workload execution time, subsequently impacting determinism. To mitigate this issue, Intel Cache Allocation Technology provides a method to partition processor caches and assign these partitions to a Class-of-Service (COS). Associating workloads to different COS can effectively isolate parts of cache available to a workload, thus preventing cache contention altogether. See Cache Allocation Technology for more information on using CAT.

ECI images pin all Linux kernel tasks to core 0, and isolate cores 1-3 by default. Under these conditions, it’s advantageous to allocate the CPU cache such that the Linux kernel tasks never share cache with any tasks running on the isolated cores. To achieve this result, perform the following steps:

Recommended CAT configuration for Intel Core™ or Xeon® processors

The following example sets core 0 L3 cache mask to 0x0f, and core 1-3 L3 cache mask to 0xf0.

Attention

This example is best suited for Intel Core™ or Xeon® processors which share last-level L3 cache. For Intel Atom® processors, see subsequent example.

Reset cache allocation to default state.
```
pqos -R
```
Define the allocation classes for the last-level-class (llc). Class 0 is allocated exclusive access to the first half of the llc. Class 1 is allocated exclusive access to the second half of the llc.
```
pqos -e 'llc:0=0x0f;llc:1=0xf0'
```
Associate core 0 with class 0, and cores 1-3 with class 1.
```
pqos -a 'llc:0=0;llc:2=1,2,3'
```

Recommended CAT configuration for Intel Atom® processors

The following example sets core 0 & 3 L2 cache mask to 0xfe, and core 1 & 2 L2 cache mask to 0x1.

Attention

This example is best suited for Intel Atom® processors which share last-level L2 cache. For Intel Core™ or Xeon® processors, see previous example.

$ pqos -R
$ pqos -e 'l2:0=0xfe;l2:1=0x1;l2:2=0x1;l2:3=0xfe'
$ pqos -a 'llc:0=0,1;llc:2=2;llc:3=3'

Stop Unnecessary Services¶

Many services run in the background by default on Linux. Stopping services may reduce spurious interrupts depending on the workload type. To list the loaded services, use the following command:

$ systemctl -t service

To stop a service, use the following command below (where <service> is replaced with the name a service).

Warning

Stopping system services can be detrimental to the stability of the Linux system. Be sure you understand the implications before stopping a service.

$ systemctl stop <service>

Disable Machine Checks¶

By default, the Linux kernel periodically scans hardware for reported errors. While this feature can be useful for tracking down troublesome bugs, it also presents a source of workload preemption. Disabling this check improves real-time performance of workloads by preventing the Linux kernel from interrupting the running tasks. Use the following command to disable machine checks:

echo 0 > /sys/devices/system/machinecheck/machinecheck0/check_interval

Increase Thread Runtime Limit¶

The default values for the real-time throttling mechanism define that 95% of the CPU time can be used by real-time tasks. The remaining 5% will be devoted to non-realtime tasks (tasks running under SCHED_OTHER and similar scheduling policies). It is important to note that if a single real-time task occupies that 95% CPU time slot, the remaining real-time tasks on that CPU will not run. The remaining 5% of CPU time is used only by non-realtime tasks.

The impact of the default values is two-fold: rogue real-time tasks will not lock up the system by not allowing non-realtime tasks to run and, on the other hand, real-time tasks will have at most 95% of CPU time available from them, probably affecting their performance.

If it is known that a particular workload is stable consuming 100% of a CPU, it is possible to configure the runtime limit to infinity.

Warning

Configuring the runtime limit in this way can potentially lock a system if the workload in question contains unbounded polling loops. Use this configuration with caution.

echo -1 > /proc/sys/kernel/sched_rt_runtime_us

Typical Workload Optimization Flow¶

When executing a workload, complete the following steps to increase real-time performance of the workload.

Stop unnecessary services. These are a few of the services you may decide to stop. There are likely others to stop.
```
$ systemctl stop ofono
$ systemctl stop wpa_supplicant
$ systemctl stop bluetooth
```
Stop Docker daemon (if containers not used)
```
$ systemctl stop docker
```
Stop CODESYS daemon (if CODESYS not used)
```
$ systemctl stop codesyscontrol
```

Disable kernel machine check interrupt

$ echo 0 > /sys/devices/system/machinecheck/machinecheck0/check_interval

Disable thread runtime limit

$ echo -1 > /proc/sys/kernel/sched_rt_runtime_us

Setup cache partitioning. See Cache Allocation Technology for information on using CAT. This example sets core 0 & 3 L2 cache mask to 0xfe, and core 1 & 2 L2 cache mask to 0x1.
```
$ pqos -R
$ pqos -e 'l2:0=0xfe;l2:1=0x1;l2:2=0x1;l2:3=0xfe'
$ pqos -a 'llc:0=0,1;llc:2=2;llc:3=3'
```

Assign all non-rt task affinity to core 0. The script below iterates through all interrupts and attempts to assign affinity to core 0.

#!/bin/bash
for i in `cat /proc/interrupts | grep '^ *[0-9]*[0-9]:' | awk {'print $1'} | sed 's/:$//' `;
do
 # Timer
 if [ "$i" = "0" ]; then
     continue
 fi
 # cascade
 if [ "$i" = "2" ]; then
     continue
 fi
 echo setting $i to affine for core 0
 echo 1 > /proc/irq/$i/smp_affinity
done

Offload RCU tasks

$ for i in `pgrep rcu`; do taskset -pc 0 $i > /dev/null ; done

Start the real-time workload (where <workload> is replaced with the name of your workload)
```
$ ./<workload>
```
Change affinity of workload tasks (where <workload> is replaced with the name of your workload). This example assigns affinity of all workload tasks to core 3.
```
$ ps ww -eLo tid,comm,cmd | grep -i <workload> | awk '{print $1}' | xargs -n 1 taskset -pac 3 > /dev/null
```

Change priority of workload tasks to be real-time (where <workload> is replaced with the name of your workload)

$ ps ww -eLo tid,comm,cmd | grep -i <workload> | awk '{print $1}' | xargs -n 1 chrt -p -f 1

Minimize integrated GPU utilization on the processor. Rather than connecting a display monitor to the target system, use SSH to access the system. This minimizes the interrupts generated by the integrated GPU, thus improving system determinism.

For complete examples that incorporate the above steps, see section System Performance Characterization.