« Back to Linux kernel scheduler

Explanation of the values in /proc/PID/sched

This article is under development, so might change a few times in a short period. Got feedback or ideas to extend it? Let's improve it together!

This article has last been updated at .

Introduction

The Linux scheduler handles task scheduling. It is one of the most important jobs of the kernel and consists of deciding which thread can run and for how long. This should be done in such a way so that each of them gets enough time on the CPU. To ensure important processes get more priority, the scheduler uses a runqueue and defines for each task how much priority it gets.

Tracking statistics about tasks

Also part of the Linux scheduler is tracking processes and providing statistics in the form of sched files in the /proc directory. This is useful for developers and system administrators to troubleshoot any issues. That is why for each process ID (PID) there is a file available, with specifics for that PID. So for the first PID the related file would be /proc/1/sched. But what do these statistics mean? Although the Linux kernel has its source code available, understanding the details is definitely not easy for the average system administrator. This article is trying to sched light (pun intended) light on the file structure and its contents.

Show details and enabling more statistics

To see the scheduler information for a process ID, simply look in the related file.

# cat /proc/1/sched
systemd (1, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :      46944911.963339
se.vruntime                                  :           155.103393
se.sum_exec_runtime                          :           566.369810
se.nr_migrations                             :                   56
nr_switches                                  :                 2685
nr_voluntary_switches                        :                 2334
nr_involuntary_switches                      :                  351
se.load.weight                               :              1048576
se.avg.load_sum                              :                  108
se.avg.runnable_sum                          :               110592
se.avg.util_sum                              :               110592
se.avg.load_avg                              :                    0
se.avg.runnable_avg                          :                    0
se.avg.util_avg                              :                    0
se.avg.last_update_time                      :       46944911963136
se.avg.util_est.ewma                         :                    9
se.avg.util_est.enqueued                     :                    0
policy                                       :                    0
prio                                         :                  120
clock-delta                                  :                   36
mm->numa_scan_seq                            :                    0
numa_pages_migrated                          :                    0
numa_preferred_nid                           :                   -1
total_numa_faults                            :                    0
current_node=0, numa_group_id=0
numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0

The list consists of already a good amount of information. To enable even more detailed statistics, we can active this using the sysctl by setting kernel.sched_schedstats=1.

sysctl kernel.sched_schedstats=1

The number of available fields increased by a lot.

# cat /proc/1/sched
systemd (1, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :      46944911.963339
se.vruntime                                  :           155.103393
se.sum_exec_runtime                          :           566.369810
se.nr_migrations                             :                   56
sum_sleep_runtime                            :             0.000000
sum_block_runtime                            :             0.000000
wait_start                                   :             0.000000
sleep_start                                  :      46944911.963339
block_start                                  :             0.000000
sleep_max                                    :             0.000000
block_max                                    :             0.000000
exec_max                                     :             0.110262
slice_max                                    :             0.000000
wait_max                                     :             0.000000
wait_sum                                     :             0.000000
wait_count                                   :                    1
iowait_sum                                   :             0.000000
iowait_count                                 :                    0
nr_migrations_cold                           :                    0
nr_failed_migrations_affine                  :                    0
nr_failed_migrations_running                 :                    0
nr_failed_migrations_hot                     :                    0
nr_forced_migrations                         :                    0
nr_wakeups                                   :                    1
nr_wakeups_sync                              :                    0
nr_wakeups_migrate                           :                    0
nr_wakeups_local                             :                    0
nr_wakeups_remote                            :                    1
nr_wakeups_affine                            :                    0
nr_wakeups_affine_attempts                   :                    1
nr_wakeups_passive                           :                    0
nr_wakeups_idle                              :                    0
avg_atom                                     :             0.210938
avg_per_cpu                                  :            10.113746
nr_switches                                  :                 2685
nr_voluntary_switches                        :                 2334
nr_involuntary_switches                      :                  351
se.load.weight                               :              1048576
se.avg.load_sum                              :                  108
se.avg.runnable_sum                          :               110592
se.avg.util_sum                              :               110592
se.avg.load_avg                              :                    0
se.avg.runnable_avg                          :                    0
se.avg.util_avg                              :                    0
se.avg.last_update_time                      :       46944911963136
se.avg.util_est.ewma                         :                    9
se.avg.util_est.enqueued                     :                    0
policy                                       :                    0
prio                                         :                  120
clock-delta                                  :                   27
mm->numa_scan_seq                            :                    0
numa_pages_migrated                          :                    0
numa_preferred_nid                           :                   -1
total_numa_faults                            :                    0
current_node=0, numa_group_id=0
numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0

So with even more fields, let’s have a look at some of the most important fields and values.

Fields and values

clock-delta

The clock-delta value is displayed when statistics are enabled. The value is the difference between checks of retrieving the CPU clock value. The delta is stored and displayed.

File: kernel/sched/debug.cExternal link

#undef P_SCHEDSTAT

	{
		unsigned int this_cpu = raw_smp_processor_id();
		u64 t0, t1;

		t0 = cpu_clock(this_cpu);
		t1 = cpu_clock(this_cpu);
		__PS("clock-delta", t1-t0);
	}

	sched_show_numa(p, m);
}

iowait_count

Each time a process is waiting for a block device, such as a disk to become available, the counter of iowait_count is increased. Together with iowait_sum this provides insights on how often and long a process it waiting.

iowait_sum

When a process a waiting on a block device, such as a disk, the time is recorded. A process having a high iowait_sum value, means that is waiting often (or long) for IO to become available. Value is expressed in nanoseconds.

mm->numa_scan_seq

The field mm->numa_scan_seq is related to memory management. This value stores a completed scan sequence that is related to NUMA balancing.

numa_preferred_nid

The value of numa_preferred_nid refers to the preferred node ID of NUMA . The available NUMA nodes can be shown using numactl with the --hardware option.

# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 63970 MB
node 0 free: 1696 MB
node distances:
node   0 
  0:  10 

An alternative command is numastat, which also shows the available nodes, including a few statistics.

# numastat
                           node0
numa_hit                  973528
numa_miss                      0
numa_foreign                   0
interleave_hit               784
local_node                973528
other_node                     0

policy

The policy refers to the scheduling policy. Typically the value 0 wil be seen, which is the SCHED_NORMAL policy, also referred to as SCHED_OTHER.

NumberPolicy
0SCHED_NORMAL
1SCHED_FIFO
2SCHED_RR
3SCHED_BATCH
4Reserved
5SCHED_IDLE
6SCHED_DEADLINE
7SCHED_EXT

Policy with number 4 is reserved for SCHED_ISO.

Related file: include/uapi/linux/sched.hExternal link

prio

The prio field shows the kernel priority of a process. Usually it shows the value 120, unless the priority has been changed with a command like renice. Setting the nice value to 10, means the value of ‘prio’ will go to 130. The lower the number, the higher its priority within the runqueue that gives CPU time to tasks.

Scheduler policyReturn valueKernel priorityUser priority (nice value)
SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE0 … 39100 … 139-20 … 19 (default 0)
SCHED_FIFO, SCHED_RR-2 … -10098 … 01 … 99
SCHED_DEADLINE-101-10

For most processes the policies SCHED_NORMAL (also known as SCHED_OTHER) is used. With a nice value of 0, this means it is in the middle of the range 100-139, which is 120.

se.sum_exec_runtime

The value of se.sum_exec_runtime is the time spent on the CPU and is expressed in nanoseconds.

se.vruntime

The se.vruntime value presents a weighted time (in nanoseconds) a task has run on the CPU. Virtual runtime is used to decide which task deserves to run and is calculated using the task priority, nice value, applicable cgroups, etc. The higher the weight (higher priority) is, the lower the vruntime value. The process with the lowest vruntime is the next candidate for more CPU time.

wait_sum

Time spent waiting on a runqueue (in nanoseconds)

wait_count

Number of times waiting on a runqueue

sum_exec_runtime – Total time process ran on CPU – In real time – Nano second units

Feedback

Small picture of Michael Boelen

This article has been written by our Linux security expert Michael Boelen. With focus on creating high-quality articles and relevant examples, he wants to improve the field of Linux security. No more web full of copy-pasted blog posts.

Discovered outdated information or have a question? Share your thoughts. Thanks for your contribution!

Mastodon icon

Related articles

Like to learn more? Here is a list of articles within the same category or having similar tags.