Systemd syscall filtering
This article has last been updated at .
Introduction
Systemd uses seccomp to implement filtering by syscalls. Syscalls are system functions and are usually provided by GLIBC. This is a generic library full with functions to allow communication between a process and the kernel. With seccomp support, these syscalls can be blocked. Systemd uses this to allow or deny specific systems functions with the SystemCallFilter.
Besides allowing or deny specific syscalls, systemd also provides predefined sets. These sets group similar or related functionality into a filter set, that then can be used to allow or deny.
What syscalls does a process use?
A web server running nginx should obviously be allowed to listen to network traffic for a port like 80 or 443. Most likely it does not need to be able to change the system clock, while a NTP daemon should. But how do you know what syscalls are used in the first place?
Dynamic analysis
Want to discover functions what syscalls are used by a running process? You may use strace on a running process, although this may crash it or decrease its performance. So when possible do this only on systems that are not in production.
Binary analysis
Another way is looking at the binary and see what functions are used.
strings /usr/sbin/nginx | grep -E '^[a-z0-9_]{4,32}\(\)' | awk '{print $1}' | sort | uniq
Filter sets
Systemd uses filter sets to allow or deny functionality per group. To see the content of a set (e.g. @clock):
systemd-analyze syscall-filter @clock
To simplify looking up this information, they are collected here in this overview.
@default
Description: These system calls are always permitted
Syscall | Purpose |
---|---|
arch_prctl | |
brk | Change the location of program break, specifically the end of the process's data segment |
cacheflush | Flushes contents of cache(s) for user addresses in specified range |
clock_getres | Retrieve the resolution (precision) of a specified clock |
clock_getres_time64 | |
clock_gettime | Retrieve time from specified clock |
clock_gettime64 | 64-bit version of clock_gettime() |
clock_nanosleep | |
clock_nanosleep_time64 | |
execve | Executes the program referred to by specified pathname |
exit | Terminates the calling process, parent process will receive a SIGCHLD signal |
exit_group | |
futex | Provides a method for waiting until certain condition becomes true |
futex_time64 | |
futex_waitv | |
get_robust_list | |
get_thread_area | |
getegid | Returns effective group ID of the calling process |
getegid32 | |
geteuid | Retrieve effective user ID of calling process |
geteuid32 | 32-bit version of geteuid() |
getgid | Returns real group ID of the calling process |
getgid32 | 32-bit version of getgid() |
getgroups | Returns supplementary group IDs of calling process |
getgroups32 | 32-bit version of getgroups() |
getpgid | Retrieve process group ID (PGID) |
getpgrp | Retrieves process group ID (PGID) of the calling process |
getpid | Returns process ID (PID) of calling process |
getppid | Returns process ID (PID) of parent of the calling process |
getrandom | Receive random bytes |
getresgid | |
getresgid32 | |
getresuid | |
getresuid32 | |
getrlimit | Get resource limits |
getsid | Receive session ID of a defined process |
gettid | Returns thread ID (TID) of caller. Same as process ID (PID) for single-threaded applications, otherwise different |
gettimeofday | Get time or timezone |
getuid | Get resource limits |
getuid32 | 32-bit version of getuid() |
membarrier | |
mmap | Create new mapping in the virtual address space of the calling process |
mmap2 | |
mprotect | Sets protection on a defined region of memory |
munmap | Deletes the mappings for specified address range and marks range to generate invalid memory references |
nanosleep | Suspends the execution of the thread calling the request |
pause | Changes the thread or process that is calling the request to sleep until the moment a signal is received |
prlimit64 | |
restart_syscall | |
riscv_flush_icache | |
riscv_hwprobe | |
rseq | |
rt_sigreturn | |
sched_getaffinity | Retrieves the mask on which CPUs the process thread can run |
sched_yield | Request by the calling thread to free up itself from the CPU and move it to the very end of the queue, so the next thread can run |
set_robust_list | |
set_thread_area | |
set_tid_address | |
set_tls | |
sigreturn | |
time | Return time; as number of seconds since the Epoch (1970-01-01 00:00:00 +0000 (UTC)) |
ugetrlimit |
@aio
Description: Asynchronous IO
Syscall | Purpose |
---|---|
io_cancel | Attempts to cancel asynchronous I/O operation that was submitted by io_submit() |
io_destroy | |
io_getevents | |
io_pgetevents | |
io_pgetevents_time64 | |
io_setup | |
io_submit | Submit asynchronous I/O blocks for processing, can be cancelled with io_cancel() |
io_uring_enter | |
io_uring_register | |
io_uring_setup |
@basic-io
Description: Basic IO
Almost all software requires this set to open a file, read from it, or write to it.
Syscall | Purpose |
---|---|
_llseek | |
close | Close file descriptor |
close_range | Close the file descriptors of the selected range |
dup | Duplicate file descriptor; more specifically it allocates a new file descriptor that also refers to open file description oldfd |
dup2 | Same as dup(), duplicate file descriptor; difference is that it uses file descriptor number specified in newfd |
dup3 | Same as dup2(); difference is that caller can force close-on-exec flag (O_CLOEXEC) to be set |
lseek | Reposition file offset for read/write |
pread64 | Similar to pread() but with an offtype of type off64_t that allows changing file positions in files larger than two gigabytes |
preadv | Reads data into multiple buffers as readv(), with option to set an offset where in the file the read operation is to be performed. |
preadv2 | Reads data into multiple buffers like preadv(), with additional flags |
pwrite64 | Similar to pwrite() but with an offtype of type off64_t that allows changing file positions in files larger than two gigabytes |
pwritev | Write to file description like writev(), except that multiple buffers are written |
pwritev2 | Write to file description like pwritev() with multiple buffers, additionally has extra flags |
read | Read from file descriptor |
readv | Read buffers from file |
write | Write to file descriptor |
writev | Writes buffers to file |
@chown
Description: Ability to change ownership of files and directories
Syscall | Purpose |
---|---|
chown | Changes ownership of file specified by pathname, dereferenced if file is a symbolic link |
chown32 | |
fchown | Changes ownership of file, referred to by open file descriptor (fd) |
fchown32 | |
fchownat | Similar to fchown(), but deals differently with relative paths |
lchown | Like chown(), does not dereference symbolic links |
lchown32 |
@clock
Description: Ability to change system time
Note: this is rarely needed for normal services.
Syscall | Purpose |
---|---|
adjtimex | Reads and optionally sets adjustment parameters for clock adjustment algorithm used on Linux (RFC 5905) |
clock_adjtime | Reads and optionally sets adjustment parameters for clock adjustment algorithm used on Linux (RFC 5905). It behaves like adjtimex(), but takes an additional clk_id argument to define the clock |
clock_adjtime64 | |
clock_settime | Set time of specified clock |
clock_settime64 | |
settimeofday | Set time or timezone |
@cpu-emulation
Description: Ability to do CPU emulation
Syscall | Purpose |
---|---|
modify_ldt | |
subpage_prot | |
switch_endian | |
vm86 | |
vm86old |
@debug
Description: Debugging, performance monitoring, tracing functionality
Note: this is normally only used by tools like strace and perf.
Syscall | Purpose |
---|---|
lookup_dcookie | |
perf_event_open | |
pidfd_getfd | |
ptrace | Process tracing; usually for breakpoint debugging and system call tracing |
rtas | |
s390_runtime_instr | |
sys_debug_setcontext |
@file-system
Description: File system operations
Note: normally all processes need this to be able to read a directory or open a file.
Syscall | Purpose |
---|---|
access | Checks whether the calling process can access the pathname, dereferenced when it is a symbolic link |
chdir | Change work directory |
chmod | Change mode of the file, dereferenced for symbolic links |
close | Close file descriptor |
creat | Like open(), but sets flags O_CREAT|O_WRONLY|O_TRUNC |
faccessat | Similar to access(), works slightly different when pathname is relative |
faccessat2 | Closely similar to faccessat() but implements flags argument to correct incorrect implementation in faccessat() |
fallocate | Allows the caller to manipulate allocated disk space for a file |
fchdir | Similar to chdir, but uses open file descriptor |
fchmod | Same as chmod, but used file by open file descriptor fd |
fchmodat | Similar to chmod(), works slightly different when pathname is relative |
fchmodat2 | |
fcntl | Performs an action on file defined by a file descriptor, such as setting flags |
fcntl64 | 64-bit version of fcntl64() |
fgetxattr | Retrieves xattr (extended attributes) by using a file descriptor returned by open() |
flistxattr | Like listxattr() it retrieves the list of extended attributes (xattr) but associated with an open file descriptor |
fremovexattr | |
fsetxattr | |
fstat | Similar to stat(), but uses file descriptor fd |
fstat64 | |
fstatat64 | |
fstatfs | Returns information about an open file |
fstatfs64 | |
ftruncate | Truncate a file open for writing to specified number of bytes, which may fill it with null bytes (\0) or decrease its size and losing data |
ftruncate64 | 64-bit version of ftruncate() |
futimesat | |
getcwd | Copies the absolute pathname of current working directory to a buffer |
getdents | Retrieve entries from a directory |
getdents64 | 64-bit version of getdents() |
getxattr | Retrieves xattr (extended attributes) from a given file defined by its path and name |
inotify_add_watch | Adds a new watch or change an existing watch for a file |
inotify_init | Initializes new inotify instance and returns file descriptor that is associated with a new inotify event queue |
inotify_init1 | Like inotify_init() it initializes a new inotify instance and returns file descriptor that is associated with a new inotify event queue, with additional flags |
inotify_rm_watch | Removes a watch with a watch descriptor (wd) from an inotify instance specified by its file descriptor (fd) |
lgetxattr | Retrieves xattr (extended attributes) for symbolic links itself, not the destination it links to |
link | Create new link (hard link) to existing file |
linkat | Similar to link(), but deals differently with relative paths |
listxattr | Retrieves the list of extended attributes (xattr) associated with the path |
llistxattr | Like listxattr, it retrieves the list of extended attributes (xattr) associated with the path, but if it is a symbolic link it will give information about the link, not the destination target |
lremovexattr | |
lsetxattr | |
lstat | Similar to stat(), but if pathname is symbolic link, return information about link and not the file that symbolic link points to |
lstat64 | |
mkdir | Create directory |
mkdirat | Similar to mkdir() but deals differently with relative paths |
mknod | Create filesystem node (file, device special file, or named pipe) named pathname |
mknodat | Similar to mknod, works slightly different when pathname is relative |
newfstatat | |
oldfstat | |
oldlstat | |
oldstat | |
open | Opens file specified by pathname to allow reading or writing data |
openat | Similar to open(), but uses dirfd argument and deals differently with path |
openat2 | |
readlink | Places a copy of the symbolic link referenced by its path into a buffer |
readlinkat | Like readlinkat() it places a copy of the symbolic link referenced by its path into a buffer, but can use relative path |
removexattr | |
rename | Rename a file, move it between directories if required |
renameat | Similar to rename(), with deals differently with relative paths |
renameat2 | Similar to renameat() when no flags are provided, otherwise it has additional options |
rmdir | Delete directory |
setxattr | |
stat | Get information about file |
stat64 | 64-bit version of stat() |
statfs | |
statfs64 | |
statx | |
symlink | Create symbolic link |
symlinkat | Similar to symlink() but deals differently with relative paths |
truncate | Truncate a writable file to specified number of bytes, which may fill it with null bytes (\0) or decrease its size and losing data |
truncate64 | 64-bit of truncate() |
unlink | Delete name from filesystem |
unlinkat | Similar to unlink() but deals differently with relative paths |
utime | Change access and modification times of inode |
utimensat | |
utimensat_time64 | |
utimes | Similar to utime(), but uses array instead of a structure |
@io-event
Description: Event loop system calls
Syscall | Purpose |
---|---|
_newselect | |
epoll_create | |
epoll_create1 | |
epoll_ctl | Manage (add, modify, remove) entries in epoll instance, which is used to monitor if I/O is allowed on the defined set of file descriptors. Similar to poll(), with additional benefits. |
epoll_ctl_old | |
epoll_pwait | |
epoll_pwait2 | |
epoll_wait | Waits for events on an epoll instance which is defined by a file descriptor (epfd) |
epoll_wait_old | |
eventfd | |
eventfd2 | |
poll | Similar task to select(2), which is waiting for a set of file descriptors to become available for I/O. |
ppoll | Let an application wait until file descriptor is available or signal is caught |
ppoll_time64 | |
pselect6 | |
pselect6_time64 | |
select | Let a program monitor multiple file descriptors until one or more become available for I/O actions. This system call has limitations and typically poll or epoll is used. |
@ipc
SysV IPC, POSIX Message Queues or other Inter-Process Communication (IPC)
Syscall | Purpose |
---|---|
ipc | |
memfd_create | |
mq_getsetattr | |
mq_notify | |
mq_open | |
mq_timedreceive | |
mq_timedreceive_time64 | |
mq_timedsend | |
mq_timedsend_time64 | |
mq_unlink | |
msgctl | |
msgget | |
msgrcv | |
msgsnd | |
pipe | Create a pipe that allows unidirectional communication between processes |
pipe2 | Similar to pipe(), to create a channel between two processes. With flag O_DIRECT it will use packet-style communication instead of a stream |
process_madvise | |
process_vm_readv | |
process_vm_writev | |
semctl | |
semget | |
semop | |
semtimedop | |
semtimedop_time64 | |
shmat | |
shmctl | |
shmdt | |
shmget |
@keyring
Kernel keyring access
Syscall | Purpose |
---|---|
add_key | Create or update a key for kernel key management facility |
keyctl | Allow user-space programs to take actions on keys, such as updating, revocation, ownership |
request_key | Request a key from kernel key management facility |
@memlock
Memory locking control
Syscall | Purpose |
---|---|
mlock | Lock pages in a specified address range, so they are guaranteed to stay in memory instead of being swapped to disk |
mlock2 | Same as mlock() if flags is 0. With flag MLOCK_ONFAULT is locks the current resident pages, the mark the range so currently nonresident pages are locked later when they are used (page fault) |
mlockall | Similar to mlock, but tries to lock all the memory pages of the calling process to prevent swapping |
munlock | Opposite of mlock() to release lock on memory area, so it can be swapped to disk if needed |
munlockall | Unlocks all memory pages of calling process so it can be swapped to disk again by the kernel |
@module
Description: Ability to load or unload kernel modules
Syscall | Purpose |
---|---|
delete_module | Tries to remove an unused loadable module entry which is related currently loaded Linux kernel module (LKM) |
finit_module | Similar to init_module(); loads image (ELF) but refers to a file description |
init_module | Load image (ELF) into the kernel space including the required steps to initialize it, including triggering the init() function of the module |
@mount
Description: Ability to mount or unmount a file system
Note: Most services will not need to use mount/umount
Syscall | Purpose |
---|---|
chroot | The root directory (which is normally /) of the calling process will be changed to the one specified in the path to sandbox a process |
fsconfig | |
fsmount | |
fsopen | |
fspick | |
mount | |
mount_setattr | |
move_mount | |
open_tree | |
pivot_root | |
umount | |
umount2 |
@network-io
Description: Network or Unix socket actions, like opening a network port to listen
When to use: This filter set is only required for services that actually listen to a socket on the network.
Syscall | Purpose |
---|---|
accept | Accept a connection on a socket |
accept4 | |
bind | Assigns address to a socket that was created with socket() |
connect | Initiate connection on a defined socket |
getpeername | Receive address of the peer connected to a socket |
getsockname | Retrieve current address of defined socket |
getsockopt | Get options for socket |
listen | Marks socket as a passive to allow it accepting incoming connections with accept() |
recv | Like read(), but normally only used on a socket and has additional flags that can be set |
recvfrom | Receives a message on a socket, close to recv(), but with additional flags related to receiving source |
recvmmsg | |
recvmmsg_time64 | |
recvmsg | Receives a message on a socket with a predefined structure to minimize the number of arguments |
send | |
sendmmsg | |
sendmsg | |
sendto | |
setsockopt | Set options on socket |
shutdown | |
socket | Create endpoint for communication and return file descriptor |
socketcall | |
socketpair | Create a pair of connected sockets, for example for communication between parent and child process |
@obsolete
Description: Unusual, obsolete or unimplemented system calls, with some unknown to the underlying seccomp library
Syscall | Purpose |
---|---|
_sysctl | |
afs_syscall | |
bdflush | |
break | |
create_module | |
ftime | |
get_kernel_syms | |
getpmsg | |
gtty | |
idle | |
lock | |
mpx | |
prof | |
profil | |
putpmsg | |
query_module | |
security | |
sgetmask | |
ssetmask | |
stime | |
stty | |
sysfs | |
tuxcall | |
ulimit | |
uselib | |
ustat | |
vserver |
@pkey
Description: Set of calls for memory protection keys
Syscall | Purpose |
---|---|
pkey_alloc | |
pkey_free | |
pkey_mprotect |
@privileged
Description: System calls which typically need super-user capabilities. It includes also other filter sets:
Syscall | Purpose |
---|---|
_sysctl | |
acct | |
bpf | |
capset | |
chroot | The root directory (which is normally /) of the calling process will be changed to the one specified in the path to sandbox a process |
fanotify_init | |
fanotify_mark | |
nfsservctl | |
open_by_handle_at | |
pivot_root | |
quotactl | |
quotactl_fd | |
setdomainname | Sets the NIS domain name to the defined value |
setfsuid | |
setfsuid32 | |
setgroups | |
setgroups32 | |
sethostname | Sets the hostname to the defined value |
setresuid | |
setresuid32 | |
setreuid | |
setreuid32 | |
setuid | |
setuid32 | |
vhangup |
@process
Description: Process control, execution, namespacing operations
Syscall | Description |
---|---|
capget | retrieve thread capabilities |
clone | similar to fork() to create a child process, with more fine-grained options to define what is shared between calling process and child. This system call can also make a new process part of newly created namespace by specifying a flag. |
clone3 | provides superset of the functionality of the older clone() interface to create child process |
execveat | |
fork | create a new child process by duplicating the calling process, with caller becoming the parent process |
getrusage | |
kill | |
pidfd_open | obtain a file descriptor referring to a process |
pidfd_send_signal | |
prctl | Perform operations on a process or thread, such as changing its capabilities, set name of the calling thread, set the secure computing mode (seccomp), and more. |
rt_sigqueueinfo | |
rt_tgsigqueueinfo | |
setns | allows calling thread to switch to a different namespace |
swapcontext | |
tgkill | |
times | get process and child process times, including CPU time in userspace and by the system for the calling process, and similar for the child processes |
tkill | |
unshare | allows a process to unshare parts of its execution context, such as mount namespace, from other processes. Parts of the execution context are automatically shared with other processes when fork(2), vfork(2) or clone(2) are used. With this syscall it does not have not to create a new process. |
vfork | |
wait4 | |
waitid | |
waitpid | Suspend the execution of the calling process thread until one of the specified child processes (by PID) terminates. By using specific options, also other actions like termination can be waited for. The functionality of this system call is similar to wait(), yet with more control over the children and states. |
@raw-io
Description: raw I/O port access
Syscall | Purpose |
---|---|
ioperm | |
iopl | |
pciconfig_iobase | |
pciconfig_read | |
pciconfig_write | |
s390_pci_mmio_read | |
s390_pci_mmio_write |
@reboot
Description: ability to reboot or reboot preparation using kexec functionality that loads the kernel for later execution.
Note: normal services do not need this set of syscalls
Syscall | Purpose |
---|---|
kexec_file_load | Similar to kexec_load(), but uses file descriptor for kernel and initrd (initial ram disk) |
kexec_load | Load new kernel for later execution |
reboot | Reboots the system, or enables/disables reboot keystroke (default: Ctrl+Alt+Delete; changed using loadkeys(1)) |
@resources
Description: ability to alter resource settings, such as process priority
Syscall | Purpose |
---|---|
ioprio_set | |
mbind | |
migrate_pages | |
move_pages | |
nice | Change process priority, with +19 (lowest priority) up to to -20 (high priority) |
sched_setaffinity | Defines by using a mask on which CPUs the process thread can run |
sched_setattr | |
sched_setparam | |
sched_setscheduler | |
set_mempolicy | |
set_mempolicy_home_node | |
setpriority | |
setrlimit | Set resource limits |
@sandbox
Description: sandbox functionality, such as support for landlock and seccomp
Syscall | Purpose |
---|---|
landlock_add_rule | |
landlock_create_ruleset | |
landlock_restrict_self | |
seccomp |
@setuid
Description: Operations to changing user/group credentials (setuid/setgid)
Syscall | Purpose |
---|---|
setgid | Set effective group ID of calling process, with CAP_SETGID capability it also sets real GID and saved set-group-ID |
setgid32 | |
setgroups | |
setgroups32 | |
setregid | |
setregid32 | |
setresgid | |
setresgid32 | |
setresuid | |
setresuid32 | |
setreuid | |
setreuid32 | |
setuid | Set effective user ID of calling process, with CAP_SETUID capability it also sets real UID and saved set-user-ID |
setuid32 |
@signal
Description: signal handling for processes
Syscall | Purpose |
---|---|
rt_sigaction | |
rt_sigpending | |
rt_sigprocmask | |
rt_sigsuspend | |
rt_sigtimedwait | |
rt_sigtimedwait_time64 | |
sigaction | |
sigaltstack | |
signal | |
signalfd | |
signalfd4 | |
sigpending | |
sigprocmask | |
sigsuspend |
@swap
Description: ability to enable or disable swap devices
Note: not required for normal services
Syscall | Purpose |
---|---|
swapoff | |
swapon |
@sync
Description: synchronize files and memory to storage
Syscall | Purpose |
---|---|
fdatasync | |
fsync | |
msync | |
sync | |
sync_file_range | |
sync_file_range2 | |
syncfs |
@system-service
General system service operations
Besides the syscalls below, it also includes the following filter sets:
- @aio
- @basic-io
- @chown
- @default
- @file-system
- @io-event
- @ipc
- @keyring
- @memlock
- @network-io
- @process
- @resources
- @setuid
- @signal
- @sync
- @timer
Syscall | Purpose |
---|---|
arm_fadvise64_64 | |
capget | Retrieve thread capabilities |
capset | Set thread capabilities |
copy_file_range | |
fadvise64 | |
fadvise64_64 | |
flock | Apply or remove advisory lock on file |
get_mempolicy | |
getcpu | |
getpriority | |
ioctl | |
ioprio_get | |
kcmp | |
madvise | |
mremap | |
name_to_handle_at | |
oldolduname | |
olduname | |
personality | |
readahead | |
readdir | Read a directory |
remap_file_pages | |
sched_get_priority_max | |
sched_get_priority_min | |
sched_getattr | |
sched_getparam | |
sched_getscheduler | |
sched_rr_get_interval | |
sched_rr_get_interval_time64 | |
sched_yield | Request by the calling thread to free up itself from the CPU and move it to the very end of the queue, so the next thread can run |
sendfile | Copies data between one file descriptor and another |
sendfile64 | |
setfsgid | Sets the group identity used for performing file system checks |
setfsgid32 | |
setfsuid | |
setfsuid32 | |
setpgid | Sets the process group ID (PGID) of the process |
setsid | |
splice | |
sysinfo | |
tee | Duplicate pipe content, does not consume the data |
umask | Set file mode creation mask |
uname | Retrieve name and information about the current kernel |
userfaultfd | |
vmsplice |
@timer
Description: Timers, to schedule operations by time
Syscall | Description |
---|---|
alarm | schedule an alarm; it lets the system generate a SIGALRM signal for the process after a specified time |
getitimer | |
setitimer | |
timer_create | |
timer_delete | |
timer_getoverrun | |
timer_gettime | |
timer_gettime64 | |
timer_settime | |
timer_settime64 | |
timerfd_create | |
timerfd_gettime | |
timerfd_gettime64 | |
timerfd_settime | |
timerfd_settime64 | |
times | get process and child process times, including CPU time in userspace and by the system for the calling process, and similar for the child processes |
@known
Description: Includes all syscalls that are known to the Linux kernel, plus the ones in @obsolete