Systemd syscall filtering
Introduction
Systemd uses seccomp to implement filtering by syscalls. Syscalls are system functions and are usually provided by GLIBC. This is a generic library full with functions to allow communication between a process and the kernel. With seccomp support, these syscalls can be blocked. Systemd uses this to allow or deny specific systems functions with the SystemCallFilter.
Besides allowing or deny specific syscalls, systemd also provides predefined sets. These sets group similar or related functionality into a filter set, that then can be used to allow or deny.
What syscalls does a process use?
A web server running nginx should obviously be allowed to listen to network traffic for a port like 80 or 443. Most likely it does not need to be able to change the system clock, while a NTP daemon should. But how do you know what syscalls are used in the first place?
Dynamic analysis
Want to discover functions what syscalls are used by a running process? You may use strace on a running process, although this may crash it or decrease its performance. So when possible do this only on systems that are not in production.
Binary analysis
Another way is looking at the binary and see what functions are used.
strings /usr/sbin/nginx | grep -E '^[a-z0-9_]{4,32}\(\)' | awk '{print $1}' | sort | uniq
Filter sets
Systemd uses filter sets to allow or deny functionality per group. To see the content of a set (e.g. @clock):
systemd-analyze syscall-filter @clock
To simplify looking up this information, they are collected here in this overview.
@default
Description: These system calls are always permitted
Syscall | Purpose |
---|---|
arch_prctl | |
brk | Change the location of program break, specifically the end of the process's data segment |
cacheflush | Flushes contents of cache(s) for user addresses in specified range |
clock_getres | Retrieve the resolution (precision) of a specified clock |
clock_getres_time64 | |
clock_gettime | Retrieve time from specified clock |
clock_gettime64 | 64-bit version of clock_gettime() |
clock_nanosleep | |
clock_nanosleep_time64 | |
execve | Executes the program referred to by specified pathname |
exit | Terminates the calling process, parent process will receive a SIGCHLD signal |
exit_group | |
futex | Provides a method for waiting until certain condition becomes true |
futex_time64 | |
futex_waitv | |
get_robust_list | |
get_thread_area | |
getegid | Returns effective group ID of the calling process |
getegid32 | |
geteuid | Retrieve effective user ID of calling process |
geteuid32 | 32-bit version of geteuid() |
getgid | Returns real group ID of the calling process |
getgid32 | 32-bit version of getgid() |
getgroups | Returns supplementary group IDs of calling process |
getgroups32 | 32-bit version of getgroups() |
getpgid | Retrieve process group ID (PGID) |
getpgrp | |
getpid | Returns process ID (PID) of calling process |
getppid | Returns process ID (PID) of parent of the calling process |
getrandom | Receive random bytes |
getresgid | |
getresgid32 | |
getresuid | |
getresuid32 | |
getrlimit | Get resource limits |
getsid | Receive session ID of a defined process |
gettid | Returns thread ID (TID) of caller. Same as process ID (PID) for single-threaded applications, otherwise different |
gettimeofday | Get time or timezone |
getuid | Get resource limits |
getuid32 | 32-bit version of getuid() |
membarrier | |
mmap | Create new mapping in the virtual address space of the calling process |
mmap2 | |
mprotect | |
munmap | Deletes the mappings for specified address range and marks range to generate invalid memory references |
nanosleep | |
pause | |
prlimit64 | |
restart_syscall | |
riscv_flush_icache | |
riscv_hwprobe | |
rseq | |
rt_sigreturn | |
sched_getaffinity | |
sched_yield | |
set_robust_list | |
set_thread_area | |
set_tid_address | |
set_tls | |
sigreturn | |
time | Return time; as number of seconds since the Epoch (1970-01-01 00:00:00 +0000 (UTC)) |
ugetrlimit |
@aio
Description: Asynchronous IO
Syscall | Purpose |
---|---|
io_cancel | Attempts to cancel asynchronous I/O operation that was submitted by io_submit() |
io_destroy | |
io_getevents | |
io_pgetevents | |
io_pgetevents_time64 | |
io_setup | |
io_submit | Submit asynchronous I/O blocks for processing, can be cancelled with io_cancel() |
io_uring_enter | |
io_uring_register | |
io_uring_setup |
@basic-io
Description: Basic IO
Almost all software requires this set to open a file, read from it, or write to it.
Syscall | Purpose |
---|---|
_llseek | |
close | Close file descriptor |
close_range | |
dup | Duplicate file descriptor; more specifically it allocates a new file descriptor that also refers to open file description oldfd |
dup2 | Same as dup(), duplicate file descriptor; difference is that it uses file descriptor number specified in newfd |
dup3 | Same as dup2(); difference is that caller can force close-on-exec flag (O_CLOEXEC) to be set |
lseek | Reposition file offset for read/write |
pread64 | |
preadv | |
preadv2 | |
pwrite64 | |
pwritev | |
pwritev2 | |
read | Read from file descriptor |
readv | Read buffers from file |
write | Write to file descriptor |
writev | Writes buffers to file |
@chown
Description: Ability to change ownership of files and directories
Syscall | Purpose |
---|---|
chown | Changes ownership of file specified by pathname, dereferenced if file is a symbolic link |
chown32 | 32-bit version of chown() |
fchown | Changes ownership of file, referred to by open file descriptor (fd) |
fchown32 | 32-bit version of fchown() |
fchownat | Similar to fchown(), but deals differently with relative paths |
lchown | Like chown(), does not dereference symbolic links |
lchown32 | 32-bit version of lchown() |
@clock
Description: Ability to change system time
Note: this is rarely needed for normal services.
Syscall | Purpose |
---|---|
adjtimex | Reads and optionally sets adjustment parameters for clock adjustment algorithm used on Linux (RFC 5905) |
clock_adjtime | Reads and optionally sets adjustment parameters for clock adjustment algorithm used on Linux (RFC 5905). It behaves like adjtimex(), but takes an additional clk_id argument to define the clock |
clock_adjtime64 | 64-bit version of clock_adjtime() |
clock_settime | Set time of specified clock |
clock_settime64 | 64-bit version of clock_setttime() |
settimeofday | Set time or timezone |
@cpu-emulation
Description: Ability to do CPU emulation
Syscall | Purpose |
---|---|
modify_ldt | |
subpage_prot | |
switch_endian | |
vm86 | |
vm86old |
@debug
Description: Debugging, performance monitoring, tracing functionality
Note: this is normally only used by tools like strace and perf.
Syscall | Purpose |
---|---|
lookup_dcookie | |
perf_event_open | |
pidfd_getfd | |
ptrace | Process tracing; usually for breakpoint debugging and system call tracing |
rtas | |
s390_runtime_instr | |
sys_debug_setcontext |
@file-system
Description: File system operations
Note: normally all processes need this to be able to read a directory or open a file.
Syscall | Purpose |
---|---|
access | Checks whether the calling process can access the pathname, dereferenced when it is a symbolic link |
chdir | Change work directory |
chmod | Change mode of the file, dereferenced for symbolic links |
close | Close file descriptor |
creat | Like open(), but sets flags O_CREAT|O_WRONLY|O_TRUNC |
faccessat | Similar to access(), works slightly different when pathname is relative |
faccessat2 | Closely similar to faccessat() but implements flags argument to correct incorrect implementation in faccessat() |
fallocate | |
fchdir | Similar to chdir, but uses open file descriptor |
fchmod | Same as chmod, but used file by open file descriptor fd |
fchmodat | Similar to chmod(), works slightly different when pathname is relative |
fchmodat2 | |
fcntl | Performs an action on file defined by a file descriptor, such as setting flags |
fcntl64 | 64-bit version of fcntl64() |
fgetxattr | |
flistxattr | |
fremovexattr | |
fsetxattr | |
fstat | Similar to stat(), but uses file descriptor fd |
fstat64 | |
fstatat64 | |
fstatfs | |
fstatfs64 | |
ftruncate | Truncate a file open for writing to specified number of bytes, which may fill it with null bytes (\0) or decrease its size and losing data |
ftruncate64 | 64-bit version of ftruncate() |
futimesat | |
getcwd | Copies the absolute pathname of current working directory to a buffer |
getdents | Retrieve entries from a directory |
getdents64 | 64-bit version of getdents() |
getxattr | |
inotify_add_watch | |
inotify_init | |
inotify_init1 | |
inotify_rm_watch | |
lgetxattr | |
link | Create new link (hard link) to existing file |
linkat | Similar to link(), but deals differently with relative paths |
listxattr | |
llistxattr | |
lremovexattr | |
lsetxattr | |
lstat | Similar to stat(), but if pathname is symbolic link, return information about link and not the file that symbolic link points to |
lstat64 | |
mkdir | Create directory |
mkdirat | Similar to mkdir() but deals differently with relative paths |
mknod | Create filesystem node (file, device special file, or named pipe) named pathname |
mknodat | Similar to mknod, works slightly different when pathname is relative |
newfstatat | |
oldfstat | |
oldlstat | |
oldstat | |
open | Opens file specified by pathname to allow reading or writing data |
openat | Similar to open(), but uses dirfd argument and deals differently with path |
openat2 | |
readlink | |
readlinkat | |
removexattr | |
rename | Rename a file, move it between directories if required |
renameat | Similar to rename(), with deals differently with relative paths |
renameat2 | Similar to renameat() when no flags are provided, otherwise it has additional options |
rmdir | Delete directory |
setxattr | |
stat | Get information about file |
stat64 | 64-bit version of stat() |
statfs | |
statfs64 | |
statx | |
symlink | Create symbolic link |
symlinkat | Similar to symlink() but deals differently with relative paths |
truncate | Truncate a writable file to specified number of bytes, which may fill it with null bytes (\0) or decrease its size and losing data |
truncate64 | 64-bit of truncate() |
unlink | Delete name from filesystem |
unlinkat | Similar to unlink() but deals differently with relative paths |
utime | Change access and modification times of inode |
utimensat | |
utimensat_time64 | |
utimes | Similar to utime(), but uses array instead of a structure |
@io-event
Description: Event loop system calls
Syscall | Purpose |
---|---|
_newselect | |
epoll_create | |
epoll_create1 | |
epoll_ctl | Manage (add, modify, remove) entries in epoll instance, which is used to monitor if I/O is allowed on the defined set of file descriptors. Similar to poll(), with additional benefits. |
epoll_ctl_old | |
epoll_pwait | |
epoll_pwait2 | |
epoll_wait | |
epoll_wait_old | |
eventfd | |
eventfd2 | |
poll | Similar task to select(2), which is waiting for a set of file descriptors to become available for I/O. |
ppoll | Let an application wait until file descriptor is available or signal is caught |
ppoll_time64 | |
pselect6 | |
pselect6_time64 | |
select | Let a program monitor multiple file descriptors until one or more become available for I/O actions. This system call has limitations and typically poll or epoll is used. |
@ipc
SysV IPC, POSIX Message Queues or other Inter-Process Communication (IPC)
Syscall | Purpose |
---|---|
ipc | |
memfd_create | |
mq_getsetattr | |
mq_notify | |
mq_open | |
mq_timedreceive | |
mq_timedreceive_time64 | |
mq_timedsend | |
mq_timedsend_time64 | |
mq_unlink | |
msgctl | |
msgget | |
msgrcv | |
msgsnd | |
pipe | Create a pipe that allows unidirectional communication between processes |
pipe2 | Similar to pipe(), to create a channel between two processes. With flag O_DIRECT it will use packet-style communication instead of a stream |
process_madvise | |
process_vm_readv | |
process_vm_writev | |
semctl | |
semget | |
semop | |
semtimedop | |
semtimedop_time64 | |
shmat | |
shmctl | |
shmdt | |
shmget |
@keyring
Kernel keyring access
Syscall | Purpose |
---|---|
add_key | Create or update a key for kernel key management facility |
keyctl | Allow user-space programs to take actions on keys, such as updating, revocation, ownership |
request_key | Request a key from kernel key management facility |
@memlock
Memory locking control
Syscall | Purpose |
---|---|
mlock | Lock pages in a specified address range, so they are guaranteed to stay in memory instead of being swapped to disk |
mlock2 | Same as mlock() if flags is 0. With flag MLOCK_ONFAULT is locks the current resident pages, the mark the range so currently nonresident pages are locked later when they are used (page fault) |
mlockall | Similar to mlock, but tries to lock all the memory pages of the calling process to prevent swapping |
munlock | Opposite of mlock() to release lock on memory area, so it can be swapped to disk if needed |
munlockall | Unlocks all memory pages of calling process so it can be swapped to disk again by the kernel |
@module
Description: Ability to load or unload kernel modules
Syscall | Purpose |
---|---|
delete_module | Tries to remove an unused loadable module entry which is related currently loaded Linux kernel module (LKM) |
finit_module | Similar to init_module(); loads image (ELF) but refers to a file description |
init_module | Load image (ELF) into the kernel space including the required steps to initialize it, including triggering the init() function of the module |
@mount
Description: Ability to mount or unmount a file system
Note: Most services will not need to use mount/umount
Syscall | Purpose |
---|---|
chroot | |
fsconfig | |
fsmount | |
fsopen | |
fspick | |
mount | |
mount_setattr | |
move_mount | |
open_tree | |
pivot_root | |
umount | |
umount2 |
@network-io
Description: Network or Unix socket actions, like opening a network port to listen
When to use: This filter set is only required for services that actually listen to a socket on the network.
Syscall | Purpose |
---|---|
accept | Accept a connection on a socket |
accept4 | |
bind | Assigns address to a socket that was created with socket() |
connect | Initiate connection on a defined socket |
getpeername | Receive address of the peer connected to a socket |
getsockname | Retrieve current address of defined socket |
getsockopt | Get options for socket |
listen | Marks socket as a passive to allow it accepting incoming connections with accept() |
recv | Like read(), but normally only used on a socket and has additional flags that can be set |
recvfrom | Receives a message on a socket, close to recv(), but with additional flags related to receiving source |
recvmmsg | |
recvmmsg_time64 | |
recvmsg | Receives a message on a socket with a predefined structure to minimize the number of arguments |
send | |
sendmmsg | |
sendmsg | |
sendto | |
setsockopt | Set options on socket |
shutdown | |
socket | Create endpoint for communication and return file descriptor |
socketcall | |
socketpair | Create a pair of connected sockets, for example for communication between parent and child process |
@obsolete
Description: Unusual, obsolete or unimplemented system calls, with some unknown to the underlying seccomp libary
Syscall | Purpose |
---|---|
_sysctl | |
afs_syscall | |
bdflush | |
break | |
create_module | |
ftime | |
get_kernel_syms | |
getpmsg | |
gtty | |
idle | |
lock | |
mpx | |
prof | |
profil | |
putpmsg | |
query_module | |
security | |
sgetmask | |
ssetmask | |
stime | |
stty | |
sysfs | |
tuxcall | |
ulimit | |
uselib | |
ustat | |
vserver |
@pkey
Description: Set of calls for memory protection keys
Syscall | Purpose |
---|---|
pkey_alloc | |
pkey_free | |
pkey_mprotect |
@privileged
Description: System calls which typically need super-user capabilities. It includes also other filter sets:
Syscall | Purpose |
---|---|
_sysctl | |
acct | |
bpf | |
capset | |
chroot | |
fanotify_init | |
fanotify_mark | |
nfsservctl | |
open_by_handle_at | |
pivot_root | |
quotactl | |
quotactl_fd | |
setdomainname | |
setfsuid | |
setfsuid32 | |
setgroups | |
setgroups32 | |
sethostname | |
setresuid | |
setresuid32 | |
setreuid | |
setreuid32 | |
setuid | |
setuid32 | |
vhangup |
@process
Description: Process control, execution, namespacing operations
Syscall | Purpose |
---|---|
capget | |
clone | Similar to fork() to create a child process, with more fine-grained options to define what is shared between calling process and child. This system call can also make a new process part of newly created namespace by specifying a flag. |
clone3 | Provides superset of the functionality of the older clone() interface to create child process |
execveat | |
fork | |
getrusage | |
kill | |
pidfd_open | |
pidfd_send_signal | |
prctl | |
rt_sigqueueinfo | |
rt_tgsigqueueinfo | |
setns | |
swapcontext | |
tgkill | |
times | |
tkill | |
unshare | |
vfork | |
wait4 | |
waitid | |
waitpid |
@raw-io
Description: raw I/O port access
Syscall | Purpose |
---|---|
ioperm | |
iopl | |
pciconfig_iobase | |
pciconfig_read | |
pciconfig_write | |
s390_pci_mmio_read | |
s390_pci_mmio_write |
@reboot
Description: ability to reboot or reboot preparation using kexec functionality that loads the kernel for later execution.
Note: normal services do not need this set of syscalls
Syscall | Purpose |
---|---|
kexec_file_load | Similar to kexec_load(), but uses file descriptor for kernel and initrd (initial ram disk) |
kexec_load | Load new kernel for later execution |
reboot | Reboots the system, or enables/disables reboot keystroke (default: Ctrl+Alt+Delete; changed using loadkeys(1)) |
@resources
Description: ability to alter resource settings, such as process priority
Syscall | Purpose |
---|---|
ioprio_set | |
mbind | |
migrate_pages | |
move_pages | |
nice | Change process priority, with +19 (lowest priority) up to to -20 (high priority) |
sched_setaffinity | |
sched_setattr | |
sched_setparam | |
sched_setscheduler | |
set_mempolicy | |
set_mempolicy_home_node | |
setpriority | |
setrlimit | Set resource limits |
@sandbox
Description: sandbox functionality, such as support for landlock and seccomp
Syscall | Purpose |
---|---|
landlock_add_rule | |
landlock_create_ruleset | |
landlock_restrict_self | |
seccomp |
@setuid
Description: Operations to changing user/group credentials (setuid/setgid)
Syscall | Purpose |
---|---|
setgid | Set effective group ID of calling process, with CAP_SETGID capability it also sets real GID and saved set-group-ID |
setgid32 | |
setgroups | |
setgroups32 | |
setregid | |
setregid32 | |
setresgid | |
setresgid32 | |
setresuid | |
setresuid32 | |
setreuid | |
setreuid32 | |
setuid | Set effective user ID of calling process, with CAP_SETUID capability it also sets real UID and saved set-user-ID |
setuid32 |
@signal
Description: signal handling for processes
Syscall | Purpose |
---|---|
rt_sigaction | |
rt_sigpending | |
rt_sigprocmask | |
rt_sigsuspend | |
rt_sigtimedwait | |
rt_sigtimedwait_time64 | |
sigaction | |
sigaltstack | |
signal | |
signalfd | |
signalfd4 | |
sigpending | |
sigprocmask | |
sigsuspend |
@swap
Description: ability to enable or disable swap devices
Note: not required for normal services
Syscall | Purpose |
---|---|
swapoff | |
swapon |
@sync
Description: synchronize files and memory to storage
Syscall | Purpose |
---|---|
fdatasync | |
fsync | |
msync | |
sync | |
sync_file_range | |
sync_file_range2 | |
syncfs |
@system-service
General system service operations
Besides the syscalls below, it also includes the following filter sets:
- @aio
- @basic-io
- @chown
- @default
- @file-system
- @io-event
- @ipc
- @keyring
- @memlock
- @network-io
- @process
- @resources
- @setuid
- @signal
- @sync
- @timer
Syscall | Purpose |
---|---|
arm_fadvise64_64 | |
capget | Retrieve thread capabilities |
capset | Set thread capabilities |
copy_file_range | |
fadvise64 | |
fadvise64_64 | |
flock | Apply or remove advisory lock on file |
get_mempolicy | |
getcpu | |
getpriority | |
ioctl | |
ioprio_get | |
kcmp | |
madvise | |
mremap | |
name_to_handle_at | |
oldolduname | |
olduname | |
personality | |
readahead | |
readdir | Read a directory |
remap_file_pages | |
sched_get_priority_max | |
sched_get_priority_min | |
sched_getattr | |
sched_getparam | |
sched_getscheduler | |
sched_rr_get_interval | |
sched_rr_get_interval_time64 | |
sched_yield | |
sendfile | Copies data between one file descriptor and another |
sendfile64 | |
setfsgid | |
setfsgid32 | |
setfsuid | |
setfsuid32 | |
setpgid | |
setsid | |
splice | |
sysinfo | |
tee | Duplicate pipe content, does not consume the data |
umask | Set file mode creation mask |
uname | Retrieve name and information about the current kernel |
userfaultfd | |
vmsplice |
@timer
Description: Timers, to schedule operations by time
Syscall | Purpose |
---|---|
alarm | Schedule an alarm; it lets the system generate a SIGALRM signal for the process after a specified time |
getitimer | |
setitimer | |
timer_create | |
timer_delete | |
timer_getoverrun | |
timer_gettime | |
timer_gettime64 | |
timer_settime | |
timer_settime64 | |
timerfd_create | |
timerfd_gettime | |
timerfd_gettime64 | |
timerfd_settime | |
timerfd_settime64 | |
times | Get process and child process times, including CPU time in userspace and by the system for the calling process, and similar for the child processes |
@known
Description: Includes all syscalls that are known to the Linux kernel, plus the ones in @obsolete