Explanation of the values in /proc/PID/sched
This article has last been updated at .
Introduction
The Linux scheduler handles task scheduling. It is one of the most important jobs of the kernel and consists of deciding which thread can run and for how long. This should be done in such a way so that each of them gets enough time on the CPU. To ensure important processes get more priority, the scheduler uses a runqueue and defines for each task how much priority it gets.
Tracking statistics about tasks
Also part of the Linux scheduler is tracking processes and providing statistics in the form of sched files in the /proc directory. This is useful for developers and system administrators to troubleshoot any issues. That is why for each process ID (PID) there is a file available, with specifics for that PID. So for the first PID the related file would be /proc/1/sched. But what do these statistics mean? Although the Linux kernel has its source code available, understanding the details is definitely not easy for the average system administrator. This article is trying to sched light (pun intended) light on the file structure and its contents.
Show details and enabling more statistics
To see the scheduler information for a process ID, simply look in the related file.
# cat /proc/1/sched
systemd (1, #threads: 1)
-------------------------------------------------------------------
se.exec_start : 46944911.963339
se.vruntime : 155.103393
se.sum_exec_runtime : 566.369810
se.nr_migrations : 56
nr_switches : 2685
nr_voluntary_switches : 2334
nr_involuntary_switches : 351
se.load.weight : 1048576
se.avg.load_sum : 108
se.avg.runnable_sum : 110592
se.avg.util_sum : 110592
se.avg.load_avg : 0
se.avg.runnable_avg : 0
se.avg.util_avg : 0
se.avg.last_update_time : 46944911963136
se.avg.util_est.ewma : 9
se.avg.util_est.enqueued : 0
policy : 0
prio : 120
clock-delta : 36
mm->numa_scan_seq : 0
numa_pages_migrated : 0
numa_preferred_nid : -1
total_numa_faults : 0
current_node=0, numa_group_id=0
numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0
The list consists of already a good amount of information. To enable even more detailed statistics, we can active this using the sysctl by setting kernel.sched_schedstats=1.
sysctl kernel.sched_schedstats=1
The number of available fields increased by a lot.
# cat /proc/1/sched
systemd (1, #threads: 1)
-------------------------------------------------------------------
se.exec_start : 46944911.963339
se.vruntime : 155.103393
se.sum_exec_runtime : 566.369810
se.nr_migrations : 56
sum_sleep_runtime : 0.000000
sum_block_runtime : 0.000000
wait_start : 0.000000
sleep_start : 46944911.963339
block_start : 0.000000
sleep_max : 0.000000
block_max : 0.000000
exec_max : 0.110262
slice_max : 0.000000
wait_max : 0.000000
wait_sum : 0.000000
wait_count : 1
iowait_sum : 0.000000
iowait_count : 0
nr_migrations_cold : 0
nr_failed_migrations_affine : 0
nr_failed_migrations_running : 0
nr_failed_migrations_hot : 0
nr_forced_migrations : 0
nr_wakeups : 1
nr_wakeups_sync : 0
nr_wakeups_migrate : 0
nr_wakeups_local : 0
nr_wakeups_remote : 1
nr_wakeups_affine : 0
nr_wakeups_affine_attempts : 1
nr_wakeups_passive : 0
nr_wakeups_idle : 0
avg_atom : 0.210938
avg_per_cpu : 10.113746
nr_switches : 2685
nr_voluntary_switches : 2334
nr_involuntary_switches : 351
se.load.weight : 1048576
se.avg.load_sum : 108
se.avg.runnable_sum : 110592
se.avg.util_sum : 110592
se.avg.load_avg : 0
se.avg.runnable_avg : 0
se.avg.util_avg : 0
se.avg.last_update_time : 46944911963136
se.avg.util_est.ewma : 9
se.avg.util_est.enqueued : 0
policy : 0
prio : 120
clock-delta : 27
mm->numa_scan_seq : 0
numa_pages_migrated : 0
numa_preferred_nid : -1
total_numa_faults : 0
current_node=0, numa_group_id=0
numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0
So with even more fields, let’s have a look at some of the most important fields and values.
Fields and values
clock-delta
The clock-delta value is displayed when statistics are enabled. The value is the difference between checks of retrieving the CPU clock value. The delta is stored and displayed.
File: kernel/sched/debug.c
#undef P_SCHEDSTAT
{
unsigned int this_cpu = raw_smp_processor_id();
u64 t0, t1;
t0 = cpu_clock(this_cpu);
t1 = cpu_clock(this_cpu);
__PS("clock-delta", t1-t0);
}
sched_show_numa(p, m);
}
iowait_count
Each time a process is waiting for a block device, such as a disk to become available, the counter of iowait_count is increased. Together with iowait_sum this provides insights on how often and long a process it waiting.
iowait_sum
When a process a waiting on a block device, such as a disk, the time is recorded. A process having a high iowait_sum value, means that is waiting often (or long) for IO to become available. Value is expressed in nanoseconds.
mm->numa_scan_seq
The field mm->numa_scan_seq is related to memory management. This value stores a completed scan sequence that is related to NUMA balancing.
numa_preferred_nid
The value of numa_preferred_nid refers to the preferred node ID of NUMA . The available NUMA nodes can be shown using numactl with the --hardware option.
# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 63970 MB
node 0 free: 1696 MB
node distances:
node 0
0: 10
An alternative command is numastat, which also shows the available nodes, including a few statistics.
# numastat
node0
numa_hit 973528
numa_miss 0
numa_foreign 0
interleave_hit 784
local_node 973528
other_node 0
policy
The policy refers to the scheduling policy. Typically the value 0 wil be seen, which is the SCHED_NORMAL policy, also referred to as SCHED_OTHER.
Number | Policy |
---|---|
0 | SCHED_NORMAL |
1 | SCHED_FIFO |
2 | SCHED_RR |
3 | SCHED_BATCH |
4 | Reserved |
5 | SCHED_IDLE |
6 | SCHED_DEADLINE |
7 | SCHED_EXT |
Policy with number 4 is reserved for SCHED_ISO.
Related file: include/uapi/linux/sched.h
prio
The prio field shows the kernel priority of a process. Usually it shows the value 120, unless the priority has been changed with a command like renice. Setting the nice value to 10, means the value of ‘prio’ will go to 130. The lower the number, the higher its priority within the runqueue that gives CPU time to tasks.
Scheduler policy | Return value | Kernel priority | User priority (nice value) |
---|---|---|---|
SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE | 0 … 39 | 100 … 139 | -20 … 19 (default 0) |
SCHED_FIFO, SCHED_RR | -2 … -100 | 98 … 0 | 1 … 99 |
SCHED_DEADLINE | -101 | -1 | 0 |
For most processes the policies SCHED_NORMAL (also known as SCHED_OTHER) is used. With a nice value of 0, this means it is in the middle of the range 100-139, which is 120.
se.sum_exec_runtime
The value of se.sum_exec_runtime is the time spent on the CPU and is expressed in nanoseconds.
se.vruntime
The se.vruntime value presents a weighted time (in nanoseconds) a task has run on the CPU. Virtual runtime is used to decide which task deserves to run and is calculated using the task priority, nice value, applicable cgroups, etc. The higher the weight (higher priority) is, the lower the vruntime value. The process with the lowest vruntime is the next candidate for more CPU time.
wait_sum
Time spent waiting on a runqueue (in nanoseconds)
wait_count
Number of times waiting on a runqueue
sum_exec_runtime – Total time process ran on CPU – In real time – Nano second units