Attention

You are viewing an older version of the documentation. The latest version is v3.3.

Real-Time Scheduling on Linux IA64

Over time, two major approaches have been taken in the open-source software community to bring real-time requirements into Linux:

  • Improve the Linux kernel itself so that it matches real-time requirements, by providing bounded latencies, real-time APIs, etc. This is the approach taken by the mainline Linux kernel and the PREEMPT_RT project.

  • Add a layer below the Linux kernel (e.g. OS Real-time extension) that will handle all the real-time requirements, so that the behavior of Linux doesn’t affect real-time tasks. This is the approach taken by the Xenomai project.

General definitions

Both approaches aim to bring the “lowest thread scheduling latency” under Linux multi-cpu RT and non-RT software execution context.

../_images/def_scheduling_latency.png

Note

Scheduling latency = interrupt latency + handler duration + scheduler latency + scheduler duration

IA64 interrupt definitions

Interrupts can be described as an “immediate response to hardware events”. The execution of this response is typically called an Interrupt Service Routine (ISR). In the process of servicing the ISR, many latencies may occur. These latencies are divided into two components, based on their originating source as follows:

  • Software Interrupt Latency can be predicted based on the system interrupt-disable time and the size of the system ISR (Interrupt Service Routine) prologue. This is the code that saves the registers manually, it also performs operations before the start of interrupt handler.

  • Hardware Interrupt Latency reflects the time required for operations such as retirement of in-flight instructions, determining address of interrupt handler, and storing all CPU registers.

../_images/def_IA64_interrupts_path.png

Various Types of Interrupt sources:

Legacy Interrupts XT-PIC

side-band signals backward compatible with PC/AT peripheral IRQs (ie. PIRQ/INTR/INTx).

Message-Signaled Interrupts (MSI)

in-band messages which target a memory address send data along with the interrupt message. MSI exhibit the following characteristics:

  • MSI messages achieve the lowest latency possible. The CPU begins executing the MSI Interrupt Service Routine (ISR) immediately after it finishes its current instruction.

  • MSI messages appear as a Posted Memory Write transaction. As such, a PCI function can request up to 32 MSI messages.

  • MSI messages send data along with the interrupt message but do not receive any hardware acknowledgment.

  • MSI messages write specific device addresses and send transactions to the Local IO-APIC of the CPU which it is assigned.

NMI – Non-Maskable Interrupts

Typically system events (e.g. power-button, watchdog timer, …). NMI exhibit the following characteristics:

  • NMI usually originate from Power-Control Unit (PCU) or IA64 firmware sources.

SCI – System Control Interrupt

Used by hardware to notify the OS through ACPI 5.0, PCAT, or IASOC (Hardware-Reduced ACPI)

SMI – System Management Interrupt

Generated by the power management hardware on the board. SMI exhibit the following characteristics:

  • SMI processing can last for hundreds of microseconds and are the highest priority interrupt (even higher than the NMI).

  • The CPU receives an SMI whenever the mode is changed (e.g. thermal sensor events, chassis open) and jumps to a hard-wired location in a special SMM address space (System Management RAM).

  • The SMI cannot be intercepted by user-code since there are no vectors in the CPU. This effectively renders SMI interrupts “invisible” to the Operating System.

Linux multi-threading definitions

User-space process

Created when the POSIX fork() command is called and is comprised of:

  • An address space (e.g vma), which contains the program code, data, stack, shared libraries, etc.

  • One thread, that starts executing the main() function.

User-thread

Can be created/added inside an existing process, using the POSIX pthread_create() command.

  • User-threads run in the same address space as the initial process thread.

  • User-threads start executing a function passed as argument to pthread_create().

Kernel-thread

Can be created/added inside an kernel module, using the POSIX kthread_create() command.

  • Kernel-threads are light-weight processes cloned from process 0 (the swapper), which share its memory map and limits, but contain a copy of its file descriptor table.

  • Kernel-threads run in the same address space as the initial process thread.

General Linux Timer definitions

Isochronous applications aim to complete their tasks at exact defined times. Unfortunately the Linux standard timer does not generally meet the required cycle deadline resolution and/or precision.

For example, a typical timer function in Linux such as the gettimeofday() system call will return clock time with microsecond precision, where nanosecond timer precision is often desirable.

To mitigate this limitation, additional POSIX APIs have been created that provide more precise timing capability. These APIs are described below.

Timer cyclic-task scheduling

Within the PREEMPT_RT scheduling context, a cyclic-task timer can be created with a given clock-domain using the POSIX timer_create() command. This timer exhibits the following characteristics:

  • Delivery of signals at the expiry of POSIX timers can not be done in the hard interrupt context of the high resolution timer interrupt.

  • The signal delivery in both these cases must happen in thread context due to locking constraints that results in long latencies.

//POSIX timers
int timer_create(clockid_t clockid,
                 struct sigevent *sevp,
                 timer_t *timerid);
Task nanosleep cyclic scheduling wake-up

Within the COBALT task scheduling context, cyclic-task timers can be created with a given clock-domain using the POSIX clock_nanosleep() command. This timer exhibits the following characteristics.

  • Clock_nanosleep does not work on signaling mechanism, which is why clock_nanosleep does not suffer above latency problem.

  • The task sleep-state timer expiry is executed in the context of the high resolution timer interrupt.

  • It is recommended if the an application is not using asynchronous signal handler better to use clock_nanosleep.

//Clock_nanosleep
int clock_nanosleep(clockid_t clock_id,
                    int flags,
                    const struct timespec *request,
                    struct timespec *remain);

PREEMPT_RT preemptive and priority scheduling on Linux OS runtime

The PREEMPT_RT project is an open-source framework under GPLv2 License lead by Linux kernel developers.

The goal is to gradually improve the Linux kernel regarding real time requirements and to get these improvements merged into the mainline kernel PREEMPT_RT development works very closely with the mainline development.

Many of the improvements designed, developed, and debugged inside PREEMPT_RT over the years are now part of the mainline Linux kernel. The project is a long­-term branch of the Linux kernel that ultimately should disappear as everything will have been merged.

Setting low-latency Interrupt SW handling

PREEMPT_RT enforces fundamental SW design rules to reach full-preemptive and low-latency scheduling by evangelizing “No non-threaded IRQ nesting” development practices across kernel code and numerous drivers/modules code-base.

top-­half, started by the CPU as soon as interrupts are flagged, is supposed to complete as quickly as possible:

  1. The interrupt controller (APIC, MSI, etc) receives an event from hardware that triggers an interrupt.

  2. The processor switches modes, saves registers, disables preemption, and disables IRQs.

  3. Generic Interrupt vector code is called.

  4. At this point, the context of the interrupted activity is saved.

  5. Lastly, the relevant ISR pertaining to the interrupt event is identified and called.

bottom-­half, scheduled by the top-­half which starts as soft-IRQs, tasklets, or work queues tasks, is to be completed by ISR execution: Real-­time critical interrupts, bottom­-half should be used very carefully:  ISR execution is undeterministic, as function of all other interrupts top-half. Non-Real­time interrupts, bottom­-half shall be threaded to reduce the duration of non-preemptible.

../_images/preempt_rt_top_bottom_half.png

Multi-thread scheduling Preemption CAN happen when:

  • High priority task wakes up as a result of an interrupt

  • Time slice expiration

  • System call results in task sleeping

Multi-thread scheduling Preemption CAN NOT happen when kernel-code critical section:

  • Interrupts explicitly disabled

  • Preemption explicitly disabled

  • Spinlock critical sections unless using preemptive spinlocks

Setting Preemptive and Priority Scheduling policies

The standard Linux kernel includes different scheduling policies, as described in the manpage for sched. There are three policies relevant for real-time tasks:

  • SCHED_FIFO implements a first-in, first-out scheduling algorithm.

    • When a SCHED_FIFO task starts running it continues to run until either it is preempted by a higher priority thread, it is blocked by an I/O request or it calls yield function.

    • All other tasks of lower priority will not be scheduled until SCHED_FIFO task release the CPU.

    • Two SCHED_FIFO tasks with same priority cannot preempt each other.

  • SCHED_RR is identical to the SCHED_FIFO scheduling, the only difference will be the way it handles the processes with the same priority.

    • The scheduler assigns each SCHED_RR task a time slice, when the process exhausts its time slice the scheduler moves it to the end of the list of processes at its priority.

    • In this manner, SCHED_RR task of a given priority are scheduled in round-robin among themselves.

    • If there is only one process at a given priority, the RR scheduling is identical to the FIFO scheduling.

  • SCHED_DEADLINE is implemented using Earliest Deadline First (EDF) scheduling algorithm, in conjunction with Constant Bandwidth Server (CBS).

    • SCHED_DEADLINE policy uses three parameters to schedule tasks - Runtime, Deadline and Period.

    • A SCHED_DEADLINE task gets “runtime” nanoseconds of CPU time for every “period” nanoseconds. The “runtime” nanoseconds should be available within “deadline” nanoseconds from the period beginning.

    • Tasks are scheduled using EDF based on the scheduling deadlines(these are calculated every time when the task wakes up).

    • Task with the earliest deadline is executed.

    • SCHED_DEADLINE threads are the highest priority (user controllable) threads in the system.

    • If any SCHED_DEADLINE thread is runnable, it will preempt any thread scheduled under one of the other policies.

Priority Inheritance assume that the lock (e.g. spin_lock, mutex, … ) inherits the priority of the process thread waiting for the lock with greatest priority.

CONFIG_PREEMPT_RT_FULL provides priority-inheritance capabilities to rtmutex, spin_lock, and mutex code:

A process with a low priority might hold a lock needed by a higher priority process, effectively reducing the priority of this process.

chrt runtime Processes Linux scheduling policies

On Linux, the chrt command can be used to set the real-time attributes of a process, such as policy and priority

  • Syntax to set scheduling policy to FIFO based priority values for SCHED_FIFO can be between 1 and 99:

    chrt --fifo --pid <priority> <pid>
    

    Below example will set scheduling attribute to SCHED_FIFO for the process with pid 1823:

    root@intel-corei7-64:~# chrt --fifo --pid 99 1823
    
  • Syntax to set scheduling policy to round-robin based priority values for SCHED_RR can be between 1 and 99

    chrt -rr --pid <priority> <pid>
    

    Below example will set the scheduling attribute to SCHED_RR and priority 99 for the process with pid 1823

    root@intel-corei7-64:~# chrt --rr --pid 99 1823
    
  • Syntax to set scheduling policy to deadline-based priority value for SCHED_DEADLINE is 0 runtime <= deadline <= period

    chrt --deadline --sched-runtime <nanoseconds> \
                    --sched-period <nanoseconds> \
                    --sched-deadline <nanoseconds> \
                    --pid <priority> <pid>
    

    Below example will set scheduling attribute to SCHED_DEADLINE for the process with pid 472. The runtime, deadline and period are given in nanoseconds.

    root@intel-corei7-64:~# ps f -g 0 -o pid,policy,rtprio,cmd
    PID POL RTPRIO CMD
       1 TS       - /sbin/init nosoftlockup noht 3
     185 TS       - /lib/systemd/systemd-journald
     209 TS       - /lib/systemd/systemd-udevd
     472 RR      99 /usr/sbin/acpid
     476 TS       - /usr/sbin/thermald --no-daemon --dbus-enable
     486 TS       - /usr/sbin/jhid -d
    

    Execute below command to change the policy to SCHED_DEADLINE(#6)

    root@intel-corei7-64:~# chrt --deadline --sched-runtime 10000 \
                                                --sched-deadline 100000 \
                                                --sched-period 1000000  \
                                                --pid 0 472
    

    Execute below command to see change in the policy of a task

    root@intel-corei7-64:~# ps f -g 0 -o pid,policy,rtprio,cmd
    PID POL RTPRIO CMD
       1 TS       - /sbin/init nosoftlockup noht 3
     185 TS       - /lib/systemd/systemd-journald
     209 TS       - /lib/systemd/systemd-udevd
     472 #6       0 /usr/sbin/acpid
     476 TS       - /usr/sbin/thermald --no-daemon --dbus-enable
     486 TS       - /usr/sbin/jhid -d
    

    Condensing this information in a table:

    Prio

    Names

    99

    posixcputmr, migration

    50

    All IRQ handlers except 39-s-mmc0 and 42-s-mmc1. E.g. 367-enp2s0 deals with one of the network interfaces

    49

    IRQ handlers 39-s-mmc0 and 42-s-mmc1

    1

    i915/signal, ktimersoftd, rcu_preempt, rcu_sched, rcub, rcuc

    0

    The rest of the tasks currently running

    The highest priority real-time tasks in this system are the timers and migration threads, with prio 99. The lowest priority real-time tasks are

sched_setscheduler() and sched_setattr() Processes runtime Linux scheduling policies

The sched_setscheduler function can be used to change the active scheduling policy. The below values can be used to set real-time scheduling policies:

  • SCHED_FIFO

  • SCHED_RR

Note

Non-real-time scheduling policies such as SCHED_OTHER, SCHED_BATCH, and SCHED_IDLE are also available. There is no support for deadline scheduling policy in sched_setscheduler function.

  • sched_setscheduler function sets the SCHED_FIFO or SCHED_RR scheduling policy and priority for a real-time thread policy

    int sched_setscheduler(pid_t pid, int policy, const struct sched_param *param);
    

    The example below will configure the running process to use SCHED_RR scheduling with priority as 99:

    struct sched_param param_rr;
    memset(&param_rr, 0, sizeof(param_rr));
    param_rr.sched_priority = 99;
    pid_t pid = getpid();
    if (sched_setscheduler(pid, SCHED_RR, &param_rr))
      perror("sched_setscheduler error:");
    

    The example below will configure the running process to use SCHED_FIFO scheduling with priority as 99:

    struct sched_param param_fifo;
    memset(&param_fifo, 0, sizeof(param_fifo));
    param_fifo.sched_priority = 99;
    pid_t pid = getpid();
    if (sched_setscheduler(pid, SCHED_FIFO, &param_fifo))
      perror("sched_setscheduler error:");
    
  • sched_setattr function sets SCHED_DEADLINE scheduling policy (from kernel version 3.14.)

    int sched_setattr(pid_t pid, struct sched_attr *attr, unsigned int flags);
    

    In the below example the process in execution is assigned with the SCHED_DEADLINE policy. The process gets a runtime of 2 milliseconds for every 9 milliseconds period. The runtime milliseconds should be available within 5 milliseconds of deadline from the period beginning.

     #define _GNU_SOURCE
     #include <stdint.h>
     #include <stdio.h>
     #include <unistd.h>
     #include <sys/syscall.h>
     #include <sched.h>
     #include <string.h>
     #include <linux/sched.h>
     #include <sys/types.h>
    
     struct sched_attr {
      uint32_t size;
      uint32_t sched_policy;
      uint64_t sched_flags;
      int32_t sched_nice;
      uint32_t sched_priority;
      uint64_t sched_runtime;
      uint64_t sched_deadline;
      uint64_t sched_period;
    };
    
    int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags) {
       return syscall(__NR_sched_setattr, pid, attr, flags);
    }
    
    int main() {
             unsigned int flags = 0;
             int status = -1;
             struct sched_attr attr_deadline;
             memset(&attr_deadline, 0, sizeof(attr_deadline));
             pid_t pid =  getpid();
             attr_deadline.sched_policy = SCHED_DEADLINE;
             attr_deadline.sched_runtime = 2*1000*1000;
             attr_deadline.sched_deadline = 5*1000*1000;
             attr_deadline.sched_period = 9*1000*1000;
             attr_deadline.size = sizeof(attr_deadline);
             attr_deadline.sched_flags = 0;
             attr_deadline.sched_nice = 0;
             attr_deadline.sched_priority = 0;
             status = sched_setattr(pid,&attr_deadline,flags)
             if(status)
                  perror("sched_setscheduler error:");
             return 0;
    }
    

pthread POSIX Linux runtime scheduling APIs

The scheduling policy for threads can be set using the pthread functions pthread_attr_setschedpolicy, pthread_attr_setschedparam, pthread_attr_setinheritsched.

Creating a realtime thread using FIFO scheduling policy and POSIX pthread functions can be broken down into simple steps:

  1. To create a thread using FIFO scheduling initialize the pthread_attr_t (thread attribute object) object using pthread_attr_init function.

    pthread_attr_t attr_fifo;
    pthread_attr_init(&attr_fifo) ;
    
  2. After initialization, set the thread attributes object referred to by attr_fifo to SCHED_FIFO (FIFO scheduling policy) using pthread_attr_setschedpolicy.

    pthread_attr_setschedpolicy(&attr_fifo, SCHED_FIFO);
    
  3. Set the priority (can take values between 1-99 for FIFO scheduling) of the thread using the sched_param object and copy the parameter values to thread attribute using pthread_attr_setschedparam.

    struct sched_param param_fifo;
    param_fifo.sched_priority = 92;
    pthread_attr_setschedparam(&attr_fifo, &param_fifo);
    
  4. Set the inherit-scheduler attribute of the thread attribute. The inherit-scheduler attribute determines if new thread takes scheduling attributes from the calling thread or from the attr. To use the scheduling attribute used in attr call the function pthread_attr_setinheritsched using PTHREAD_EXPLICIT_SCHED.

    pthread_attr_setinheritsched(&attr_fifo, PTHREAD_EXPLICIT_SCHED);
    
  5. In the next step create the thread by calling pthread_create function

    pthread_t thread_fifo;
    pthread_create(&thread_fifo, &attr_fifo, thread_function_fifo, NULL);
    

    All together, the simplest preemptible multi-threading application can be achieved under FIFO scheduling policy as followed :

    #include <pthread.h>
    #include <stdio.h>
    
    void *thread_function_fifo(void *data) {
     printf("Inside Thread\n");
     return NULL;
     }
    
     int main(int argc, char* argv[]) {
             struct sched_param param_fifo;
             pthread_attr_t attr_fifo;
             pthread_t thread_fifo;
             int status = -1;
             memset(&param_fifo, 0, sizeof(param_fifo));
             status = pthread_attr_init(&attr_fifo);
             if (status) {
                     printf("pthread_attr_init failed\n");
                     return status;
             }
             status = pthread_attr_setschedpolicy(&attr_fifo, SCHED_FIFO);
             if (status) {
                     printf("pthread_attr_setschedpolicy failed\n");
                     return status;
             }
             param_fifo.sched_priority = 92;
             status = pthread_attr_setschedparam(&attr_fifo, &param_fifo);
             if (status) {
                     printf("pthread_attr_setschedparam failed\n");
                     return status;
             }
             status = pthread_attr_setinheritsched(&attr_fifo, PTHREAD_EXPLICIT_SCHED);
             if (status) {
                     printf("pthread_attr_setinheritsched failed\n");
                     return status;
             }
             status = pthread_create(&thread_fifo, &attr_fifo, thread_function_fifo, NULL);
             if (status) {
                     printf("pthread_create failed\n");
                     return status;
             }
             pthread_join(thread_fifo, NULL);
             return status;
     }
    

On Glibc 2.25 (and onward) POSIX pthread condition variable (e.g. pthread_cond*) to define priority-inheritance :

  • rt_mutex cannot be in a state with waiters and no owner

  • pthread_cond* APIs re-implemented signal threads for _wait() and _signal() operations using not PI-aware futex operations to put the calling waiter.

../_images/pthread_cond_example.png

Note

references https://wiki.linuxfoundation.org/realtime/events/rt-summit2016/pthread-condvars

kthread Read-Copy Update (RCU)

Read-Copy Update (RCU) APIs are used heavily in the Linux code to synchronize kernel threads without locks

  • Excellent for read-mostly data where staleness and inconsistency OK

  • Good for read-mostly data where consistency is required

  • Can be OK for read-write data where consistency is required

  • Might not be best for update-mostly consistency-required data

  • Provide existence guarantees that are useful for scalable updates.

Tuning RCU is part of any deterministic and synchronized data-segmentation:

  • CONFIG_PREEMPT_RCU Real-Time Preemption and RCU readers manipulate CPU-local counters to limit blocking within RCU read-side critical sections https://lwn.net/Articles/128228/

  • CONFIG_RCU_NOCB_CPU RCU Callback Offloading directed to the CPUs of your choice

  • CONFIG_RCU_BOOST RCU priority boosting tasks blocking the current grace period for more than half a second to real-time priority level 1.

  • CONFIG_RCU_KTHREAD_PRIO and CONFIG_RCU_BOOST_DELAY provide additional control of RCU priority boosting

Pointer to RCU-protected object guaranteed exist throughout RCU read-side critical section using very light weight primitives:

../_images/rcu_statemachine_apis.png

All RCU writers must wait for an RCU grace period before Reclaim, to elapse between making something inaccessible to readers and freeing it.

spinlock(&updater_lock);
q = cptr;
rcu_assign_pointer(cptr, new_p);
spin_unlock(&updater_lock);
synchronize_rcu(); /* Wait for grace period. */
kfree(q);

RCU grace period for all pre-exiting readers to complete their RCU read-side critical sections Grace period begins after synchronize_rcu() call and ends after all CPUs execute a context switch

../_images/rcu_example.png

Setting POSIX thread virtual memory allocation (vma)

Linux process memory management is considered as one of the important and critical aspects of PREEMPT_RT Linux runtime as compared to a standard Linux runtime. From kernel scheduling point of view it makes no difference as processes and Threads represents each one as task_struct kernel structure of type running. However, from scheduling latency standpoint a Process context-switch is significantly longer than a User-thread context-switch within the same process, as process switching needs to flush TLB.

../_images/def_process_thread.png

There are different algorithms for memory management designed to optimize both the runnable processes and improve system performance. For instance, when a processes which need the full memory page that kernel-allocated mmap() returns, or only part of a page, memory management works along with the scheduler to optimally utilize resources.

Let’s explore three main areas of memory management:

Memory Locking

Memory locking is essential as part of the initialization the program. Most of the real-time processes lock the memory throughout their execution. Memory locking API mlock and mlockall functions can be used by applications to lock the memory while munlock and munlockall are used to unlock the memory pages (virtual address space) of the application into the main memory.

  • mlock(void *addr,size_t len) - This function locks a selected region(starting from addr to len bytes) of address space of the calling process into memory.

  • mlockall(int flags) - This function locks all of the process address space. MCL_CURRENT, MCL_FUTURE and MCL_ONFAULT are different flags available.

  • munlock(void *addr,size_t len) - This function will unlock a specified region of a process address space.

  • munlockall(void) - This system call will unlock all of the process address space.

Locking of memory will make sure that the application pages are not removed from main memory during crisis. This will also ensures that page-fault do not occur during RT critical operations which is very important.

Stack Memory

Each of the threads within an application have their own stack. The size of the stack can be specified by using the pthread function pthread_attr_setstacksize().

Syntax of pthread_attr_setstacksize(pthread_attr_t *attr, size_t stacksize):

  • attr - thread attribute structure.

  • stacksize - in bytes, should not be less than PTHREAD_STACK_MIN (16384 bytes). The default stack size on Linux is 2MB.

If the size of the stack is not set explicitly then the default stack size is allocated. If the application uses a large number of RT threads, it is advised to use smaller stack size than default size.

Dynamic Memory Allocation

Dynamic memory allocation of memory is not suggested for RT threads during the execution is in RT critical path as this increases the chance of page faults. It is suggested to allocate the required memory before the start of RT execution and lock the memory using the mlock/mlockall functions. In the below example the thread function is trying to dynamically allocate memory to a thread local variable and try to access data stored in theses random locations.

#define BUFFER_SIZE 1048576
void *thread_function_fifo(void *data) {
    double sum = 0.0;
    double* tempArray = (double*)calloc(BUFFER_SIZE, sizeof(double));
    size_t randomIndex;
    int i = 50000;
    while(i--)
    {
             randomIndex =  rand() % BUFFER_SIZE;
             sum += tempArray[randomIndex];
    }
             return NULL;
}

Setting NoHz (Tickless) Kernel

The Linux kernel used to send the scheduling clock interrupt (ticks) to each CPU every jiffy, in order to shift CPU attention periodically towards multiple task. Where a jiffy is a very short period of time, which is determined by the value of the kernel Hz.

It traditionally being unused in cases of:

  • Input power constrained device like mobile device, triggering clock interrupt can drain its power source very quickly even if it is idle.

  • Virtualization, multiple OS instance might find that half of its CPU time is consumed by unnecessary scheduling clock interrupts.

  • … many more

A tickless kernel inherently reduces the number of scheduling clock interrupt, which helps to improve energy efficiency and reducing Linux runtime scheduling jitter.

Below are the three contexts where one needs to look for configuring scheduling-clock interrupts to improve energy efficiency:

CPU with Heavy Workload

CONFIG_HZ_PERIODIC=y (for older kernels CONFIG_NO_HZ=n) There are situations when CPU with heavy workloads with lots of tasks that use very short time of CPU and having very frequent idle periods, but these idle periods are also quite short (order of tens or hundreds of microseconds). Reducing scheduling-clock ticks will have reverse effect of increasing the overhead of switching to and from idle and transition between user and kernel execution.

CPU in idle

CONFIG_NO_HZ_IDLE=y primary purpose of a scheduling-clock interrupt is to force a busy CPU to shift its attention among multiple tasks. But an idle CPU, no tasks to shifts its attention, therefore sending scheduling-clock interrupt is of no use. Instead, configure tickless kernel to avoid sending scheduling-clock interrupts to idle CPUs, thereby improving energy efficiency of the systems. This mode can be disable from boot command-line by specifying nohz=off. By default, kernel boot with nohz=on.

CPU with Single Task

CONFIG_NO_HZ_FULL=y In a case where CPU predefined with only one task to do, there is no point of sending scheduling-clock interrupt to switch task. So, in order to avoid sending-clock interrupt to this kind of CPUs, below configuration setting in Kconfig of Linux kernel will be useful.

Setting High Resolution Timers thread

Timer resolution has been progressively improved by the Linux.org community to offer a more precise way of waking up the system and process data at more accurate intervals:

  • Initially Unix/Linux systems used timers with a frequency of 100Hz (i.e. 100 timer events per second/one event every 10ms).

  • Linux version 2.4, i386 systems started using timers with frequency of 1000Hz (i.e. 1000 timer events per second/one event every 1ms). The 1ms timer event improves minimum latency and interactivity but at the same time it also incurs higher timer overhead.

  • Linux kernel version 2.6 timer frequency was reduced to 250Hz (i.e. 250 timer events per second/one event every 4ms) to reduce timer overhead.

  • Finally Linux kernel streamlined high-resolution timers nanosecond precision thread usage by adding CONFIG_HIGH_RES_TIMERS=y kernel builtin driver.

One can also examine the timer_list per cpu cores from /proc/timer_list file system as below :

  • .resolution value of 1 nanosecond, clock supports high resolution.

  • event_handler is set to hrtimer_interrupt, this represents high resolution timer feature is active.

  • .hres_active has a value of 1, this means high resolution timer feature is active.

Note

A resolution of 1ns is not resonable. This only tells, that the system uses HRTs. The usual resolution of HRTs on modern systems is in the micro seconds.

 root@intel-corei7-64:~#cat /proc/timer_list | grep 'cpu:\|resolution\|hres_active\|clock\|event_handler'
 cpu: 0
  clock 0:
         .resolution: 1 nsecs
         #2: <ffffc9000214ba00>, hrtimer_wakeup, S:01, schedule_hrtimeout_range_clock.part.28, rpcbind/507
         #3: <ffffc900026d7d80>, hrtimer_wakeup, S:01, schedule_hrtimeout_range_clock.part.28, cleanupd/585
         #4: <ffffc9000269fd80>, hrtimer_wakeup, S:01, schedule_hrtimeout_range_clock.part.28, smbd-notifyd/584
         #8: <ffffc9000261fd80>, hrtimer_wakeup, S:01, schedule_hrtimeout_range_clock.part.28, smbd/583
         #9: <ffffc9000212bd80>, hrtimer_wakeup, S:01, schedule_hrtimeout_range_clock.part.28, syslog-ng/494
         #10: <ffffc900026dfd80>, hrtimer_wakeup, S:01, schedule_hrtimeout_range_clock.part.28, lpqd/587
  clock 1:
        .resolution: 1 nsecs
  clock 2:
        .resolution: 1 nsecs
  clock 3:
       .resolution: 1 nsecs
       .get_time:   ktime_get_clocktai
       .hres_active    : 1
cpu: 2
  clock 0:
      .resolution: 1 nsecs
       #2: <ffffc90002313a00>, hrtimer_wakeup, S:01, schedule_hrtimeout_range_clock.part.28, thermald/548
       #3: <ffffc900023fb8c0>, hrtimer_wakeup, S:01, schedule_hrtimeout_range_clock.part.28, wpa_supplicant/562
  clock 1:
       .resolution: 1 nsecs
  clock 2:
       .resolution: 1 nsecs
  clock 3:
       .resolution: 1 nsecs
       .get_time:   ktime_get_clocktai
       .hres_active    : 1
  event_handler:  tick_handle_oneshot_broadcast
  event_handler:  hrtimer_interrupt
  event_handler:  hrtimer_interrupt

pthread_…() POSIX Linux isochronous scheduling

An isochronous application is one which will be repeated after a fixed period of time:

  • The execution time of this application should always be less than its period.

  • An isochronous application should always be a real time thread to measure performance.

Below are step-by-step program breakdown to develop simple isoch-rt-thread sanity-check test:

  1. step - Define a structures that will have the time period information along with the current time of the clock. This structure will be used to pass the data between multiple tasks.

    /*Data format to be passed between tasks*/
    struct time_period_info {
            struct timespec next_period;
            long period_ns;
    
  2. step - Define the time period of the cyclic thread to 1ms and get the current time of the system.

    /*Initialize the periodic task with 1ms time period*/
    static void initialize_periodic_task(struct time_period_info *tinfo)
    {
            /* keep time period for 1ms */
            tinfo->period_ns = 1000000;
            clock_gettime(CLOCK_MONOTONIC, &(tinfo->next_period));
    }
    
  3. step - Use Timer increment module to go for nanosleep to complete the time period of the real-thread.

    /*Increment the timer until the time period elapses and the Real time task will execute*/
    static void inc_period(struct time_period_info *tinfo)
    {
          tinfo->next_period.tv_nsec += tinfo->period_ns;
          while(tinfo->next_period.tv_nsec >= 1000000000){
            tinfo->next_period.tv_sec++;
            tinfo->next_period.tv_nsec -=1000000000;
          }
    }
    
  4. step - Use a loop for waiting for time period completion. Assumption here is thread execution time is less as compared to time period

    /*Assumption: Real time task requires less time to complete task as compared to period length, so wait till period completes*/
    static void wait_for_period_complete(struct period_info *pinfo)
    {
            inc_period(pinfo);
            /* Ignore possibilities of signal wakes */
            clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &pinfo->next_period, NULL);
    }
    
  5. step - Define a Real-Time thread here. For simplicity a print statement is kept here.

    static void real_time_task()
    {
            printf("Real-Time Task executing\n");
            return NULL;
    }
    
  6. step - Initialize and trigger the realtime thread cyclic execution. Also will wait for the time period completion. This thread will be created from the main thread as a POSIX thread.

    void *realtime_isochronous_task(void *data)
    {
            struct time_period_info tpinfo;
            periodic_task_init(&tpinfo);
            while (1) {
                    real_time_task();
                    wait_for_period_complete(&tpinfo);
            }
            return NULL;
    }
    

    Note

    A non real time main thread will spawn a Real-time isochronous application thread here. Also it sets the preemptive scheduling priority and policy.

  7. step - created a POSIX main thread to create & initialize all threads with the attributes.

    int main(int argc, char* argv[]) {
            struct sched_param param_fifo;
            pthread_attr_t attr_fifo;
            pthread_t thread_fifo;
            int status = -1;
            memset(&param_fifo, 0, sizeof(param_fifo));
            status = pthread_attr_init(&attr_fifo);
            if (status) {
                    printf("pthread_attr_init failed\n");
                    return status;
            }
    

    Next, Set the Real time thread with FIFO scheduling policy here.

    status = pthread_attr_setschedpolicy(&attr_fifo, SCHED_FIFO);
    if (status) {
      printf("pthread_attr_setschedpolicy failed\n");
      return status;
    }
    

    The Real time task priority is kept here as 92. The priority can be choose between 1-99.

    param_fifo.sched_priority = 92;
    status = pthread_attr_setschedparam(&attr_fifo, &param_fifo);
    if (status) {
            printf("pthread_attr_setschedparam failed\n");
            return status;
    }
    

    Set the inherit-scheduler attribute of the thread attribute. The inherit-scheduler attribute determines if new thread takes scheduling attributes from the calling thread or from the attr.

    status = pthread_attr_setinheritsched(&attr_fifo, PTHREAD_EXPLICIT_SCHED);
    if (status) {
            printf("pthread_attr_setinheritsched failed\n");
            return status;
    }
    

    Real Time isochronous application thread will be created here.

    status = pthread_create(&thread_fifo, &attr_fifo, realtime_isochronous_task, NULL);
    if (status) {
            printf("pthread_create failed\n");
            return status;
    }
    

    Wait for Real Time task completion

            pthread_join(thread_fifo, NULL);
        return status;
    }
    

    Find the complete code example below.

    /*Header Files*/
    #include <pthread.h>
    #include <stdio.h>
    #include <string.h>
    
    /*Data format to be passed between tasks*/
    struct time_period_info {
            struct timespec next_period;
            long period_ns;
    };
    
    /*Initialize the periodic task with 1ms time period*/
    static void initialize_periodic_task(struct time_period_info *tinfo){
            /*Keep time period for 1ms*/
            tinfo->period_ns = 1000000;
            clock_gettime(CLOCK_MONOTONIC, &(tinfo->next_period));
    }
    
    /*Increment the timer to till time period elapsed*/
    static void inc_period(struct time_period_info *tinfo){
            tinfo->next_period.tv_nsec += tinfo->period_ns;
            while(tinfo->next_period.tv_nsec >= 1000000000){
                    tinfo->next_period.tv_sec++;
                    tinfo->next_period.tv_nsec -=1000000000;
            }
    }
    
    /*Real time task requires less time to complete task as compared to period length, so wait till period completes*/
    static void wait_for_period_complete(struct time_period_info *tinfo){
            inc_period(tinfo);
            clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &tinfo->next_period, NULL);
    }
    
    /*Real Time Task*/
    static void* real_time_task(){
            printf("Real-Time Task executing\n");
            return NULL;
    }
    
    /*Main module for an isochronous application task with Real Time priority and scheduling call as SCHED_FIFO */
    void *realtime_isochronous_task(void *data){
    
            struct time_period_info tinfo;
            initialize_periodic_task(&tinfo);
    
            while(1){
                    real_time_task();
                    wait_for_period_complete(&tinfo);
            }
            return NULL;
    }
    
    /*Non Real Time master thread that will spawn a Real Time isochronous application thread*/
    int main(int argc, char* argv[]) {
    
            struct sched_param param_fifo;
            pthread_attr_t attr_fifo;
            pthread_t thread_fifo;
            int status = -1;
            memset(&param_fifo, 0, sizeof(param_fifo));
    
            status = pthread_attr_init(&attr_fifo);
            if (status) {
                    printf("pthread_attr_init failed\n");
                    return status;
            }
    
            status = pthread_attr_setschedpolicy(&attr_fifo, SCHED_FIFO);
            if (status) {
                    printf("pthread_attr_setschedpolicy failed\n");
                    return status;
            }
    
            param_fifo.sched_priority = 92;
            status = pthread_attr_setschedparam(&attr_fifo, &param_fifo);
            if (status) {
                    printf("pthread_attr_setschedparam failed\n");
                    return status;
            }
    
            status = pthread_attr_setinheritsched(&attr_fifo, PTHREAD_EXPLICIT_SCHED);
            if (status) {
                    printf("pthread_attr_setinheritsched failed\n");
                    return status;
            }
    
            status = pthread_create(&thread_fifo, &attr_fifo, realtime_isochronous_task, NULL);
            if (status) {
                    printf("pthread_create failed\n");
                    return status;
            }
    
            pthread_join(thread_fifo, NULL);
            return status;
    }
    

Setting thread temporal-isolation via kernel boot parameters

Assuming the following best known configuration to implement CPU core Temporal-isolation:

  • cpu2 (Critical Core): will run our real-time applications

  • cpu0: will run everything else

See the table below for a non-exhaustive list of kernel cmdline options that act upon thread/process cores affinity at boot-time :

cmdline

Parameter

Isolation

isolcpus

List of critical cores

The kernel scheduler will not migrate tasks from other cores to them

irqaffinity

List of non-critical cores

Protects the cores from IRQs

rcu_nocbs

List of critical cores

Stops RCU callbacks from getting called

nohz_full

List of critical cores

If the core is idle or has a single running task, it will not get scheduling clock ticks. Use together with nohz=off so dynamic ticks do not impact latencies.

In our case, the resulting parameters will look like:

isolcpus=2 irqaffinity=0 rcu_nocbs=2 nohz=off nohz_full=2

For Editing systemd-bootx64.efi OS loader entry file to add custom boot parameters:

root@intel-corei7-64:~# vi /boot/EFI/loader/entries/boot.conf

title boot
linux /vmlinuz
initrd /initrd
options LABEL=boot isolcpus=2 irqaffinity=0 rcu_nocbs=2 nohz=off nohz_full=2 i915.enable_rc6=0 i915.enable_dc=0 i915.disable_power_well=0 i915.enable_execlists=0 i915.powersave=0 processor.max_cstate=0 intel.max_cstate=0 processor_idle.max_cstate=0 intel_idle.max_cstate=0 clocksource=tsc tsc=reliable nmi_watchdog=0 nosoftlockup intel_pstate=disable noht nosmap mce=ignore_mce nohalt acpi_irq_nobalance noirqbalance vt.handoff=7

Xenomai3/i-pipe Cobalt preemptive & priority scheduling Linux OS runtime

The Xenomai project (https://xenomai.org) is an open-source RTOS-to-Linux Portability Framework under Creative Commons BY-SA 3.0 and GPLv2 Licenses which comes in two flavors:

  • As co-kernel/real-time extension(RTE) for patched Linux codename Cobalt

  • As libraries for native Linux (incl. PREEMPT-RT) codename Mercury. It aims at working both as a co­-kernel and on top of PREEMPT_RT in the 3.0 branch.

Xenomai project merged a real-time core (named Cobalt core) into Linux kernel, which co-exists in kernel space. The interrupts and threads in the Cobalt core have higher priority than the interrupts and threads in Linux kernel. Since the Cobalt core executes with less instructions compared to the Linux kernel, unnecessary delay and historical burden in calling path can be reduced.

../_images/xenomai_dual-kernel.png

Using this real-time targeted design, a Xenomai-patched Linux kernel can achieve quite a good real-time multi-threading performance.

Setting 2-stage interrupt pipeline [Head] and [Root] stages

The 2-stage interrupt pipeline is the underlying mechanism enabling the Xenomai real-time framework.

../_images/xenomai_ipipe-x86.png

The Xenomai-patched Linux kernel takes dominance over hardware interrupts which originally belong to the Linux kernel. Xenomai will firstly handle the interrupts it is interested in, and then route the other interrupts to the Linux kernel. The former named head stage and the latter is named root stage:

  • The [Head] stage corresponds to the Cobalt core (real-time domain or out-of-band context)

  • The root stage corresponds to the Linux kernel (non real-time domain or in-band context)

The head stage has higher priority over the root stage, and offers the shortest response time by squeezing both the hardware and software.

../_images/xenomai_interrupt_latency.png

The Xenomai patches that implement these mechanics are named I-pipe patches; the x86 version patches are hosted at: https://xenomai.org/gitlab/ipipe-x86

Setting POSIX thread context migration between [Head] and [Root] stages

On Linux, the taskset command allows a user to change a process’s CPU affinity. It is typically used in conjunction with the CPU isolation determined by the kernel cmdline. The example script below demonstrates changing the real-time process affinity to Core 1:

#!/bin/bash

cpu="1"
cycle="250"
time_total="345600"

taskset -c $cpu /usr/bin/latency -g ./latency.histo.log -s -p $cycle -c $cpu -P 99 -T $time_total 2>&1 | tee ./latency.log

Ensuring that workloads are running isolated on CPUs can be determined by monitoring the state of currently running tasks. One such utility that enables monitoring of task CPU affinity is htop.

Cobalt core’s threads are not entirely isolated from the Linux kernel’s threads. Instead, it reuses ordinary kthread and adds special capabilities; kthread can jump between Cobalt’s real-time context (out-of-band context) and common Linux kernel context (in-band context). The advantage is when under in-band context the thread can enjoy Linux kernel’s infrastructure. A typical scenario is a Cobalt thread will start up as a normal kthread, call the Linux kernel’s API for preparation work, then switch to out-of-band context and behave as a Cobalt thread to perform real-time work. The disadvantage is during out-of-band context the Cobalt thread is quite easily migrated to in-band context by a mistakenly called Linux kernel API. In such a case it is quite difficult discover; developers misleadingly consider their thread to be running under the Cobalt core, and don’t notice the issue until checking the ftrace output or when the task exceeds its deadline.

../_images/xenomai_scheduling2.gif

A Xenomai/Cobalt POSIX-based userspace application can shadow the same thread between Preemptive and Common Time-Sharing (SCHED_OTHER) scheduling policy:

  • Secondary mode: all where Linux GPOS services and Linux [ROOT] domain device drivers are accessible (i.e. ps –x or top Linux commands can be used).

  • Primary mode: all where all Xenomai RTOS services and RTDM [HEAD] domain device drivers are accessible

# root@intel-corei7-64:~# cat /proc/xenomai/sched/stat
CPU  PID    MSW        CSW        XSC        PF    STAT       %CPU  NAME
  0  0      0          5321688352 0          0     00018000   96.8  [ROOT/0]
  1  0      0          1067292    0          0     00018000  100.0  [ROOT/1]
  1  852    1          1          5          0     000680c0    0.0  latency
  1  854    532167     1064334    532171     0     00068042    0.0  display-852
  0  855    2          5321669632 5322202288 0     0004c042    2.4  sampling-852
  1  0      0          13288313   0          0     00000000    0.0  [IRQ4355: [timer]]

pthread POSIX skin runtime scheduling APIs

When linked with libcobalt.so, Linux pthread_create() with pthread_setschedparam() policy SCHED_FIFO and SCHED_RR POSIX scheduling system calls are trampolined to Xenomai/Cobalt task by a mechanism called shadowing.

On top of those Xenomai/cobalt provide also specific scheduling policies :

  • SCHED_TP implements the temporal partitioning scheduling policy for groups of threads (a group can be one or more threads)

  • SCHED_SPORADIC implements a task server scheduler used to run sporadic activities with quota to avoid periodic (under SCHED_RR or SCHED_FIFO) tasks perturbation

  • SCHED_QUOTA implements a budget-based scheduling policy. The group of threads is suspended since the budget exceeded. The budget is refilled every quota interval

../_images/xenomai_scheduling_policy.png

Setting High Resolution Timers thread in Xenomai

Xenomai3/Cobalt High Resolution Timers (CONFIG_XENO_OPT_TIMER_RBTREE=y) allows to use the available X86 hardware timers to create time-interval based of high-priority [HEAD] interrupts:

  • [host-timer/x] and [watchdog] multiplexed into number of software programmable timers exposed to the Cobalt

  • timerfd_handler POSIX timers API call from userspace

  • clock_nanosleep() thread accurate wakeup by high-resolution timer hw-offload.

Linux Userspace filesystem interface allow user to report cobalt timer information:

# cat /proc/xenomai/timer/coreclk
CPU  SCHED/SHOT            TIMEOUT     INTERVAL    NAME
0    79845094/32427139     419us       -           [host-timer/0]
0    673759/673758         164ms579us  1s          [watchdog]
1    41484498/15201506     419us       -           [host-timer/1]
1    673759/673758         164ms583us  1s          [watchdog]
0    2440455925/2440400928  94us       100us       timerfd_handler

Further reading reference

This document will not cover all the technical details of the Xenomai framework. Please refer to the official documentation for further reading.

RT-Scheduling Sanity-Check Testing

The following section is applicable to:

../_images/target8.png

The following test procedure admin user logs in as root user.

Sanity-Check #1: User Monitor Thread CPU core Affinity

  1. Step - run ps command to report tree of all the processes executing on the computer, listed together with their Process ID, Scheduling Policy, Real-Time Priority and Command Line by running:

    root@intel-corei7-64:~# ps f -g 0 -o pid,policy,rtprio,cmd
     PID POL RTPRIO CMD
        2 TS       - [kthreadd]
        3 TS       -  \_ [ksoftirqd/0]
        4 FF       1  \_ [ktimersoftd/0]
        6 TS       -  \_ [kworker/0:0H]
        8 FF       1  \_ [rcu_preempt]
        9 FF       1  \_ [rcu_sched]
       10 FF       1  \_ [rcub/0]
       11 FF       1  \_ [rcuc/0]
       12 TS       -  \_ [kswork]
       13 FF      99  \_ [posixcputmr/0]
       14 FF      99  \_ [migration/0]
       15 TS       -  \_ [cpuhp/0]
       16 TS       -  \_ [cpuhp/2]
       17 FF      99  \_ [migration/2]
       18 FF       1  \_ [rcuc/2]
       19 FF       1  \_ [ktimersoftd/2]
       20 TS       -  \_ [ksoftirqd/2]
       22 TS       -  \_ [kworker/2:0H]
       23 FF      99  \_ [posixcputmr/2]
       24 TS       -  \_ [kdevtmpfs]
       25 TS       -  \_ [netns]
       27 TS       -  \_ [oom_reaper]
       28 TS       -  \_ [writeback]
       29 TS       -  \_ [kcompactd0]
       30 TS       -  \_ [crypto]
       31 TS       -  \_ [bioset]
       32 TS       -  \_ [kblockd]
       33 FF      50  \_ [irq/9-acpi]
       34 TS       -  \_ [md]
       35 TS       -  \_ [watchdogd]
       36 TS       -  \_ [rpciod]
       37 TS       -  \_ [xprtiod]
       39 TS       -  \_ [kswapd0]
       40 TS       -  \_ [vmstat]
       41 TS       -  \_ [nfsiod]
       63 TS       -  \_ [kthrotld]
       66 TS       -  \_ [bioset]
       67 TS       -  \_ [bioset]
       68 TS       -  \_ [bioset]
       69 TS       -  \_ [bioset]
       70 TS       -  \_ [bioset]
       71 TS       -  \_ [bioset]
       72 TS       -  \_ [bioset]
       73 TS       -  \_ [bioset]
       74 TS       -  \_ [bioset]
       75 TS       -  \_ [bioset]
       76 TS       -  \_ [bioset]
       77 TS       -  \_ [bioset]
       78 TS       -  \_ [bioset]
       79 TS       -  \_ [bioset]
       80 TS       -  \_ [bioset]
       81 TS       -  \_ [bioset]
       83 TS       -  \_ [bioset]
       84 TS       -  \_ [bioset]
       85 TS       -  \_ [bioset]
       86 TS       -  \_ [bioset]
       87 TS       -  \_ [bioset]
       88 TS       -  \_ [bioset]
       89 TS       -  \_ [bioset]
       90 TS       -  \_ [bioset]
       92 FF      50  \_ [irq/27-idma64.0]
       93 FF      50  \_ [irq/27-i2c_desi]
       94 FF      50  \_ [irq/28-idma64.1]
       95 FF      50  \_ [irq/28-i2c_desi]
       96 FF      50  \_ [irq/29-idma64.2]
       98 FF      50  \_ [irq/29-i2c_desi]
      100 FF      50  \_ [irq/30-idma64.3]
      101 FF      50  \_ [irq/30-i2c_desi]
      102 FF      50  \_ [irq/31-idma64.4]
      103 FF      50  \_ [irq/31-i2c_desi]
      104 FF      50  \_ [irq/32-idma64.5]
      106 FF      50  \_ [irq/32-i2c_desi]
      107 FF      50  \_ [irq/33-idma64.6]
      108 FF      50  \_ [irq/33-i2c_desi]
      109 FF      50  \_ [irq/34-idma64.7]
      110 FF      50  \_ [irq/34-i2c_desi]
      111 FF      50  \_ [irq/4-idma64.8]
      112 FF      50  \_ [irq/5-idma64.9]
      113 FF      50  \_ [irq/35-idma64.1]
      115 FF      50  \_ [irq/37-idma64.1]
      117 TS       -  \_ [nvme]
      118 FF      50  \_ [irq/365-xhci_hc]
      121 TS       -  \_ [scsi_eh_0]
      122 TS       -  \_ [scsi_tmf_0]
      123 TS       -  \_ [usb-storage]
      124 TS       -  \_ [dm_bufio_cache]
      125 FF      50  \_ [irq/39-mmc0]
      126 FF      49  \_ [irq/39-s-mmc0]
      127 FF      50  \_ [irq/42-mmc1]
      128 FF      49  \_ [irq/42-s-mmc1]
      129 TS       -  \_ [ipv6_addrconf]
      144 TS       -  \_ [bioset]
      175 TS       -  \_ [bioset]
      177 TS       -  \_ [mmcqd/0]
      182 TS       -  \_ [bioset]
      187 TS       -  \_ [mmcqd/0boot0]
      189 TS       -  \_ [bioset]
      191 TS       -  \_ [mmcqd/0boot1]
      193 TS       -  \_ [bioset]
      195 TS       -  \_ [mmcqd/0rpmb]
      332 TS       -  \_ [kworker/2:1H]
      342 TS       -  \_ [kworker/0:1H]
      407 TS       -  \_ [jbd2/mmcblk0p2-]
      409 TS       -  \_ [ext4-rsv-conver]
      429 TS       -  \_ [bioset]
      463 TS       -  \_ [loop0]
      466 TS       -  \_ [jbd2/loop0-8]
      467 TS       -  \_ [ext4-rsv-conver]
      558 FF      50  \_ [irq/366-mei_me]
      559 FF      50  \_ [irq/8-rtc0]
      560 FF      50  \_ [irq/35-pxa2xx-s]
      561 TS       -  \_ [spi1]
      563 FF      50  \_ [irq/37-pxa2xx-s]
      564 TS       -  \_ [spi3]
      572 FF      50  \_ [irq/369-i915]
      798 FF       1  \_ [i915/signal:0]
      800 FF       1  \_ [i915/signal:1]
      801 FF       1  \_ [i915/signal:2]
      802 FF       1  \_ [i915/signal:4]
      832 FF      50  \_ [irq/367-enp2s0]
      835 FF      50  \_ [irq/368-enp3s0]
      844 FF      50  \_ [irq/4-serial]
      846 FF      50  \_ [irq/5-serial]
     4167 TS       -  \_ [kworker/0:1]
     4194 TS       -  \_ [kworker/u8:1]
     4234 TS       -  \_ [kworker/2:0]
     4242 TS       -  \_ [kworker/0:0]
     4288 TS       -  \_ [kworker/u8:0]
     4313 TS       -  \_ [kworker/2:1]
     4318 TS       -  \_ [kworker/0:2]
        1 TS       - /sbin/init initrd=\initrd LABEL=boot processor.max_cstate=0 intel_idle.max_cstate=0 clocksource=tsc tsc=reliable nmi_watchdog=0 nosoftlockup intel_pstate=disable i915.disable_power_well=0 i915.enable_rc6=0 noht 3 snd_hda_intel.power_save=1 snd_hda_intel.power_save_controller=y scsi_mod.scan=async console=ttyS2,115200 rootwait console=ttyS0,115200 console=tty0
      491 TS       - /lib/systemd/systemd-journald
      525 TS       - /lib/systemd/systemd-udevd
      551 TS       - /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid
      571 TS       - /usr/sbin/jhid -d
      625 TS       - /usr/sbin/connmand -n
      629 TS       - /lib/systemd/systemd-logind
      631 TS       - /usr/sbin/thermald --no-daemon --dbus-enable
      658 TS       - /usr/sbin/ofonod -n
      712 TS       - /usr/sbin/acpid
      828 TS       - /sbin/agetty --noclear tty1 linux
      829 TS       - /sbin/agetty -8 -L ttyS0 115200 xterm
      830 TS       - /sbin/agetty -8 -L ttyS1 115200 xterm
      834 TS       - /usr/sbin/wpa_supplicant -u
      836 TS       - /usr/sbin/nmbd
      849 TS       - /usr/sbin/smbd
      850 TS       -  \_ /usr/sbin/smbd
      851 TS       -  \_ /usr/sbin/smbd
      853 TS       -  \_ /usr/sbin/smbd
      855 TS       - /usr/sbin/dropbear -i -r /etc/dropbear/dropbear_rsa_host_key -B
      856 TS       -  \_ -sh
     3804 TS       - /usr/sbin/dropbear -i -r /etc/dropbear/dropbear_rsa_host_key -B
     3805 TS       -  \_ -sh
     4394 TS       -  \_ ps f -g 0 -o pid,policy,rtprio,cmd
     4393 TS       - /sbin/agetty -8 -L ttyS2 115200 xterm
    

    Expected outcomes :

    • The processes between brackets belong to the kernel.

    • processes with regular Best-Effort CBS scheduling policy (timesharing, `TS`)

    • processes with a real time policy (FIFO, `FF`…. )

  2. Step - run htop is an interactive system-monitor process-viewer and process-manager with the following command:

    $ htop
    

    Expected outcomes:

    • reports shows a lively updated listing of the processes running on a computer normally ordered by the amount of CPU usage.

    • Color-coding provides visual information about processor, swap and memory status.

    • Additional columns can be added to the output of htop. To configure htop to filter tasks by CPU affinity, follow the steps below:

      1. Press the F2 key to open the menu system

      2. Press the down arrow key until “Columns” is selected from the “Setup” category

      3. Press the right arrow key to until the selector moves to the “Available Columns” section

      4. Press the down arrow key until “PROCESSOR” is selected from the “Available Columns” section

      5. Press the Enter key to add “PROCESSOR” to the “Active Columns” section

      6. Press the F10 key to complete the addition

      7. Press the F6 key to open the “Sort by” menu

      8. Use the →↑↓ arrow keys to move the selection to “PROCESSOR”

      9. Press the Enter key to filter tasks by processor affinity

    Note

    Workloads may consist of many individual child tasks. These child tasks are constituent to the workload and it is acceptable and preferred that they run on the same CPU.

Sanity-Check #2: User Monitor Kernel Interrupts

  1. Step - list unwanted interrupts source by monitoring the number of interrupts that occur on a CPU with the following command :

    $ watch -n 0.1 cat /proc/interrupts
    

    Expected outcome This command polls the processor every 100 milliseconds and displays interrupt counts. Ideally, isolated CPUs 1, 2, and 3 should not show any interrupt counts incrementing. In practice, the Linux “Local timer interrupt” may still be occurring, but properly prioritized workloads should not be impacted.

Sanity-Check #3: Determine CPU LLC Cache Allocation preset

Note

If target system’s CPU supports CAT, it can greatly help to reduce the worst case jitter. Please see section Cache Allocation Technology for description and usage.

  1. Step - Admin User verify command that cache partitioning configuration per-cpu or per-cpumodules to mitigate LLC cache misses resulting in thread execution overhead , page-fault scheduling penalties,…

    $ pqos -s
    

    Expected outcome output from a target system utilizing CAT is shown below:

    ../_images/pqos-s.png

Sanity-Check #4: Check IA UEFI firmware setting

UEFI firmware BIOS menu settings should be modified to improve target system’s real-time performance.

  1. Step - Admin User verify CPU’s speedstep/speedshift Turned-off , CPU frequency fixed CPU always running is stuck C0 state;

    Expected outcome please refer to Recommended ECI-B/X BIOS Optimizations

  2. Step - Admin User verify hyper-threading Disable ;

    Expected outcome please refer to Recommended ECI-B/X BIOS Optimizations

  3. Step - Admin User verify North-complex power-management policy on Gfx state/frequency and Bus Fabric (ex: Gersville/GV,…);

    Expected outcome please refer to Recommended ECI-B/X BIOS Optimizations

  4. Step - Check North-complex IP : PCIe ASPM, USB PM, … (varies greatly by OEM and SKUs).

    Expected outcome please refer to power-management policy Recommended ECI-B/X BIOS Optimizations

And remember, the Linux kernel is able to override BIOS’s settings if related hardware register exposed to kernel space.

Note

But BIOS menu show items vary among different board vendors. Some useful config items maybe hidden by OEM and you never have a chance to modify it

Sanity-Check #5: Check Linux kernel cmdline parameters

Certain kernel boot parameters should be added for tuning the real-time performance.

  1. Step - Admin User reviews kernel commandline boot to fixed as documented under ECI Kernel Boot Optimizations which isolate CPUs 1,2,3 respectively.

    • i915. power management Turn on/off

    • processor*. and intel*. Power saving and Clocking gating features

    • isolcpus CPU isolation

    • irqaffinity CPU interrupt affinity

    expected outcomes on example with CPU 1 isolated (reserve for real-time process) and binding irq affinity to CPU 0

    i915.enable_rc6=0 i915.enable_dc=0 i915.disable_power_well=0 i915.enable_execlists=0 i915.powersave=0 processor.max_cstate=0 intel.max_cstate=0 processor_idle.max_cstate=0 intel_idle.max_cstate=0 clocksource=tsc tsc=reliable nmi_watchdog=0 nosoftlockup intel_pstate=disable noht nosmap mce=ignore_mce nohalt acpi_irq_nobalance noirqbalance vt.handoff=7 rcu_nocbs=1 rcu_nocb_poll nohz_full=1 isolcpus=1 irqaffinity=0 vt.handoff=1
    
  2. Step - Admin User using /etc/default/grub, add kernel cmdline to GRUB_CMDLINE_LINUX="", do not forget to run update-grub before reboot.

Note

Thermal condition of target system must be monitored after changing the BIOS and kernel cmdline. Lower the CPU frequency and use cooling apparatus if the CPU is too hot.

Sanity-Check #6: User reports thread KPIs as latency histograms

The minimally invasive Linux tracing events are commonly used to report latency histogram of a particular thread over long runtime. For example

  • Thread wakeup + scheduling + execution time KPIs overhead

  • Thread semaphore acquire/release performance

  • Thread WCET jitter

  1. Step - Admin User checks if kernel CONFIG enables /sys/kernel/debug/tracing to established comparable KPIs measurement across various Linux kernel runtimes ie. PREEMPT_RT and COBALT

    if [ -d /sys/kernel/debug/tracing ] ; then echo PASS; else echo FAIL; fi
    

    expected outcomes PASS

  2. Step - Admin User checks trace-event ram-buffer records multi-threaded scheduling timeline time precision (ie. nanosecond TSC-clock epoch-time) .

    echo nop > /sys/kernel/debug/tracing/tracer
    echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
    echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
    echo 1 > /sys/kernel/debug/tracing/trace_on
    sleep 5
    echo 0 > /sys/kernel/debug/tracing/trace_on
    echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
    echo 0 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
    

    Exporting all records into filesystem

    cat /sys/kernel/debug/tracing/trace > ~/ftrace_buffer_dump.txt
    

    expected outcomes

           Task2-1891  ( 1026) [000] d..h2..  6528.499461: hrtimer_cancel: hrtimer=ffffb09ec138be58
           Task2-1891  ( 1026) [000] d..h1..  6528.499461: hrtimer_expire_entry: hrtimer=ffffb09ec138be58 function=hrtimer_wakeup now=6528499005755
    ts0--> Task2-1891  ( 1026) [000] d..h2..  6528.499461: sched_waking: comm=Task pid=1890 prio=33 target_cpu=000
           Task2-1891  ( 1026) [000] d..h3..  6528.499462: sched_wakeup: comm=Task pid=1890 prio=33 target_cpu=000
           Task2-1891  ( 1026) [000] d..h1..  6528.499462: hrtimer_expire_exit: hrtimer=ffffb09ec138be58
           Task2-1891  ( 1026) [000] d..h1..  6528.499462: write_msr: 6e0, value 136d628c2e73e7
           Task2-1891  ( 1026) [000] d..h1..  6528.499463: local_timer_exit: vector=239
           Task2-1891  ( 1026) [000] d...2..  6528.499463: sched_waking: comm=ktimersoftd/0 pid=8 prio=98 target_cpu=000
           Task2-1891  ( 1026) [000] d...3..  6528.499463: sched_wakeup: comm=ktimersoftd/0 pid=8 prio=98 target_cpu=000
           Task2-1891  ( 1026) [000] .......  6528.499464: sys_exit: NR 202 = 1
           Task2-1891  ( 1026) [000] ....1..  6528.499464: sys_futex -> 0x1
           Task2-1891  ( 1026) [000] .......  6528.499472: sys_enter: NR 230 (1, 1, 7ff75266fdb0, 0, 2, 7fff10bda080)
           Task2-1891  ( 1026) [000] ....1..  6528.499472: sys_clock_nanosleep(which_clock: 1, flags: 1, rqtp: 7ff75266fdb0, rmtp: 0)
           Task2-1891  ( 1026) [000] .......  6528.499472: hrtimer_init: hrtimer=ffffb09ec13bbe58 clockid=CLOCK_MONOTONIC mode=ABS
           Task2-1891  ( 1026) [000] d...1..  6528.499473: hrtimer_start: hrtimer=ffffb09ec13bbe58 function=hrtimer_wakeup expires=6528499988112 softexpires=6528499988112 mode=ABS
           Task2-1891  ( 1026) [000] d...1..  6528.499473: write_msr: 6e0, value 136d628c2e1be1
           Task2-1891  ( 1026) [000] d...1..  6528.499474: rcu_utilization: Start context switch
           Task2-1891  ( 1026) [000] d...1..  6528.499474: rcu_utilization: End context switch
           Task2-1891  ( 1026) [000] d...2..  6528.499475: sched_switch: prev_comm=Task2 prev_pid=1891 prev_prio=33 prev_state=D ==> next_comm=Task next_pid=1890 next_prio=33
           Task2-1891  ( 1026) [000] d...2..  6528.499475: x86_fpu_regs_deactivated: x86/fpu: ffff96a6199156c0 initialized: 1 xfeatures: 3 xcomp_bv: 800000000000001f
           Task2-1891  ( 1026) [000] d...2..  6528.499475: write_msr: c0000100, value 7ff75292c700
           Task2-1891  ( 1026) [000] d...2..  6528.499476: x86_fpu_regs_activated: x86/fpu: ffff96a619913880 initialized: 1 xfeatures: 3 xcomp_bv: 800000000000001f
            Task-1890  ( 1026) [000] .......  6528.499476: sys_exit: NR 230 = 0
            Task-1890  ( 1026) [000] ....1..  6528.499476: sys_clock_nanosleep -> 0x0
            Task-1890  ( 1026) [000] .......  6528.499482: sys_enter: NR 230 (1, 1, 7ff75292bdb0, 0, 2, 7fff10bda080)
            Task-1890  ( 1026) [000] ....1..  6528.499483: sys_clock_nanosleep(which_clock: 1, flags: 1, rqtp: 7ff75292bdb0, rmtp: 0)
            Task-1890  ( 1026) [000] .......  6528.499483: hrtimer_init: hrtimer=ffffb09ec138be58 clockid=CLOCK_MONOTONIC mode=ABS
            Task-1890  ( 1026) [000] d...1..  6528.499483: hrtimer_start: hrtimer=ffffb09ec138be58 function=hrtimer_wakeup expires=6528500005368 softexpires=6528500005368 mode=ABS
            Task-1890  ( 1026) [000] d...1..  6528.499483: rcu_utilization: Start context switch
            Task-1890  ( 1026) [000] d...1..  6528.499484: rcu_utilization: End context switch
    ts1-->  Task-1890  ( 1026) [000] d...2..  6528.499484: sched_switch: prev_comm=Task prev_pid=1890 prev_prio=33 prev_state=D ==> next_comm=rcuc/0 next_pid=12 next_prio=98
    
  3. Step - Admin User reviews trace-event specific print format

    cat /sys/kernel/debug/tracing/events/cobalt_core/sched_switch/format
    cat /sys/kernel/debug/tracing/events/cobalt_core/cobalt_switch_context/format
    

    expected outcomes

    name: cobalt_switch_context
    ID: 451
    format:
            field:unsigned short common_type;       offset:0;       size:2; signed:0;
            field:unsigned char common_flags;       offset:2;       size:1; signed:0;
            field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
            field:int common_pid;   offset:4;       size:4; signed:1;
    
            field:struct xnthread * prev;   offset:8;       size:8; signed:0;
            field:struct xnthread * next;   offset:16;      size:8; signed:0;
            field:__data_loc char[] prev_name;      offset:24;      size:4; signed:1;
            field:__data_loc char[] next_name;      offset:28;      size:4; signed:1;
    
    print fmt: "prev=%p(%s) next=%p(%s)", REC->prev, __get_str(prev_name), REC->next, __get_str(next_name)
    
  4. Step - Admin User enables a primary trace-event as hist:keys trigger start condition

    • hist:keys trigger = Add event data to a histogram instead of logging it to the trace buffer

    • if () event filters narrow down number of events triggers.

    • vals= variables evaluate and Save multi-event quantities

    $ echo 'hist:keys=common_pid:vals=$ts0,$root:ts0=common_timestamp.usecs if ( comm == "IEC_mainTask" )' >>  \
            /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
    

    expected outcomes

    $ cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/hist
    # event histogram
    #
    # trigger info: hist:keys=common_pid:vals=hitcount,common_timestamp.usecs,pid:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global if ( name == "Task-1890" ) [active]
    #
    
        { common_pid: 1890 } hitcount:        146  common_timestamp: 8563360302
    
        Totals:
            Hits: 146
            Entries: 1
            Dropped: 0
    
  5. Step - Admin User enables synthetic_events as mean to create user-defined trace-events.

    $ echo 'iectask_wcet u64 lat; ; pid_t pid;' > \
            /sys/kernel/debug/tracing/synthetic_events
    
    $ cat /sys/kernel/debug/tracing/synthetic_events/iectask_wcet/format
    

    expected outcomes N/A

  6. Step - Admin User enables a secondary trace events hist:keys trigger stop condition and actions.

    Trigger actions inject quantities seamlessly back into the trace event subsystem

    • onmatch().xxxx = generate synthetic events

    • onmax() = save() maximum latency values and arbitrary context

    • snapshot() = generate any a small porting of ftrace buffer

    $ echo 'hist:keys=common_pid:latency=common_timestamp.usecs-$ts0:\
        onmatch(sched.sched_switch).iectask_wcet($latency,pid) \
        if ( prev_comm == "IEC_mainTask" )' >> \  /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
    

    expected outcomes N/A

  7. Step - Admin User report synthetic_events as an histograms sorted from Min-to-Max.

    $ echo 'hist:keys=pid,lat:sort=pid,lat' \
        >> /sys/kernel/debug/tracing/events/synthetic_events/iectask_wcet/trigger
    
    $ cat /sys/kernel/debug/tracing/events/synthetic_events/iectask_wcet/hist
    

    expected outcomes N/A

    # event histogram
    #
    # trigger info: hist:keys=pid,lat:vals=hitcount:sort=pid,lat:size=2048 [active]
    #
    
    { pid:        854, lat:          6 } hitcount:          2
    { pid:        854, lat:          7 } hitcount:        109
    { pid:        854, lat:          8 } hitcount:         55
    { pid:        854, lat:          9 } hitcount:          6
    { pid:        854, lat:         10 } hitcount:          2
    
    Totals:
        Hits: 174
        Entries: 5
    

Note

Some tips and tricks

  1. <event>/trigger Syntax ERROR are reported under <event>/hist (ex. ERROR: Variable already defined: ts2)

  2. Systematically ERASE each <event>/trigger using the ‘!’ Character BEFORE issuing another command into same <event>/trigger ex. echo ‘!hist:keys=thread:…’ >> <event>/trigger

  3. ONLY one hist trigger per <event> can exist simultaneously