seccomp(2) System Calls Manual seccomp(2) seccomp - LIBRARY Standard C library (libc, -lc) #include /* SECCOMP_* */ #include /* struct sock_fprog */ #include /* AUDIT_* */ #include /* SIG* */ #include /* PTRACE_* */ #include /* SYS_* */ #include int syscall(SYS_seccomp, unsigned int operation, unsigned int flags, void *args); Note: glibc provides no wrapper for seccomp(), necessitating the use of syscall(2). seccomp() (Secure Computing, seccomp). Linux operation: SECCOMP_SET_MODE_STRICT The only system calls that the calling thread is permitted to make are read(2), write(2), _exit(2) (but not exit_group(2)), and sigreturn(2). Other system calls result in the termination of the calling thread, or termination of the entire process with the SIGKILL signal when there is only one thread. Strict secure computing mode is useful for number-crunching applications that may need to execute untrusted byte code, perhaps obtained by reading from a pipe or socket. , sigprocmask(2), sigreturn(2) ( SIGKILL SIGSTOP). , alarm(2) () . SIGKILL. timer_create(2) SIGEV_SIGNAL sigev_signo SIGKILL, setrlimit(2) RLIMIT_CPU. , CONFIG_SECCOMP. flags 0, args -- NULL. : prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT); SECCOMP_SET_MODE_FILTER Berkeley Packet Filter (BPF), args. struct sock_fprog; . , seccomp() EINVAL errno. fork(2) clone(2), . execve(2), execve(2). SECCOMP_SET_MODE_FILTER CAP_SYS_ADMIN no_new_privs. , : prctl(PR_SET_NO_NEW_PRIVS, 1); SECCOMP_SET_MODE_FILTER EACCES errno. , set-user-ID execve(2), ( , , setuid(2) ID 0 . , , - , ). prctl(2) seccomp() , . , . SECCOMP_SET_MODE_FILTER , CONFIG_SECCOMP_FILTER. flags 0, : prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args); flags: SECCOMP_FILTER_FLAG_LOG ( Linux 4.14) , , SECCOMP_RET_ALLOW, . , /proc/sys/kernel/seccomp/actions_logged. SECCOMP_FILTER_FLAG_NEW_LISTENER ( Linux 5.0) After successfully installing the filter program, return a new user-space notification file descriptor. (The close-on-exec flag is set for the file descriptor.) When the filter returns SECCOMP_RET_USER_NOTIF a notification will be sent to this file descriptor. At most one seccomp filter using the SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be installed for a thread. See seccomp_unotify(2) for further details. SECCOMP_FILTER_FLAG_SPEC_ALLOW ( Linux 4.17) Speculative Store Bypass. SECCOMP_FILTER_FLAG_TSYNC , seccomp . << >> -- , ( seccomp() , ). - , seccomp, , ID , . , SECCOMP_MODE_STRICT, seccomp , . SECCOMP_GET_ACTION_AVAIL ( Linux 4.14) , . , , , SECCOMP_RET_KILL_PROCESS. flags 0, args 32- , . SECCOMP_GET_NOTIF_SIZES ( Linux 5.0) Get the sizes of the seccomp user-space notification structures. Since these structures may evolve and grow over time, this command can be used to determine how much memory to allocate for sending and receiving notifications. The value of flags must be 0, and args must be a pointer to a struct seccomp_notif_sizes, which has the following form: struct seccomp_notif_sizes __u16 seccomp_notif; /* Size of notification structure */ __u16 seccomp_notif_resp; /* Size of response structure */ __u16 seccomp_data; /* Size of 'struct seccomp_data' */ }; See seccomp_unotify(2) for further details. SECCOMP_SET_MODE_FILTER, args : struct sock_fprog { unsigned short len; /* BPF */ struct sock_filter *filter; /* BPF */ }; BPF: struct sock_filter { /* */ __u16 code; /* */ __u8 jt; /* */ __u8 jf; /* */ __u32 k; /* */ }; ( BPF_ABS) BPF ( ) : struct seccomp_data { int nr; /* */ __u32 arch; /* AUDIT_ARCH_* ( ) */ __u64 instruction_pointer; /* */ __u64 args[6]; /* 6 */ }; (, x86-64) ( , execve(2) , ), , , arch. It is strongly recommended to use an allow-list approach whenever possible because such an approach is more robust and simple. A deny-list will have to be updated whenever a potentially dangerous system call is added (or a dangerous flag or option if those are deny-listed), and it is often possible to alter the representation of a value without altering its meaning, leading to a deny-list bypass. See also Caveats below. arch . x86-64 ABI x32 ABI arch AUDIT_ARCH_X86_64, . ABI __X32_SYSCALL_BIT . This means that a policy must either deny all syscalls with __X32_SYSCALL_BIT or it must recognize syscalls with and without __X32_SYSCALL_BIT set. A list of system calls to be denied based on nr that does not also contain nr values with __X32_SYSCALL_BIT set can be bypassed by a malicious program that sets __X32_SYSCALL_BIT. Additionally, kernels prior to Linux 5.4 incorrectly permitted nr in the ranges 512-547 as well as the corresponding non-x32 syscalls ORed with __X32_SYSCALL_BIT. For example, nr == 521 and nr == (101 | __X32_SYSCALL_BIT) would result in invocations of ptrace(2) with potentially confused x32-vs-x86_64 semantics in the kernel. Policies intended to work on kernels before Linux 5.4 must ensure that they deny or otherwise correctly handle these system calls. On Linux 5.4 and newer, such system calls will fail with the error ENOSYS, without doing anything. instruction_pointer , . /proc/pid/maps () (, mmap(2) mprotect(2) ). When checking values from args, keep in mind that arguments are often silently truncated before being processed, but after the seccomp check. For example, this happens if the i386 ABI is used on an x86-64 kernel: although the kernel will normally not look beyond the 32 lowest bits of the arguments, the values of the full 64-bit registers will be present in the seccomp data. A less surprising example is that if the x86-64 ABI is used to perform a system call that takes an argument of type int, the more-significant half of the argument register is ignored by the system call, but visible in the seccomp data. seccomp 32- , : 16 ( , SECCOMP_RET_ACTION_FULL) <<>>, ; 16 ( SECCOMP_RET_DATA) <<>>, . If multiple filters exist, they are all executed, in reverse order of their addition to the filter tree--that is, the most recently installed filter is executed first. (Note that all filters will be called even if one of the earlier filters returns SECCOMP_RET_KILL. This is done to simplify the kernel code and to provide a tiny speed-up in the execution of sets of filters by avoiding a check for this uncommon case.) The return value for the evaluation of a given system call is the first-seen action value of highest precedence (along with its accompanying data) returned by execution of all of the filters. , seccomp ( ): SECCOMP_RET_KILL_PROCESS ( Linux 4.14) . . SECCOMP_RET_KILL_THREAD, , ( CLONE_THREAD clone(2)). , SIGSYS. SIGSYS , . , ( waitpid(2) ) wstatus, , SIGSYS. SECCOMP_RET_KILL_THREAD ( SECCOMP_RET_KILL) , . . . , SIGSYS. SECCOMP_RET_KILL_PROCESS . Linux 4.11 , , ( , SIGSYS signal(7) , ). Linux 4.11 , . SECCOMP_RET_KILL_PROCESS Linux 4.14 SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD, . Note: the use of SECCOMP_RET_KILL_THREAD to kill a single thread in a multithreaded process is likely to leave the process in a permanently inconsistent and possibly corrupt state. SECCOMP_RET_TRAP SIGSYS ( ). siginfo_t ( sigaction(2)), : o si_signo SIGSYS. o si_call_addr . o si_syscall si_arch . o si_code SYS_SECCOMP. o si_errno SECCOMP_RET_DATA . (. ., ). , ; , - ( - , ENOSYS - ). SECCOMP_RET_ERRNO , SECCOMP_RET_DATA errno . SECCOMP_RET_USER_NOTIF ( Linux 5.0) Forward the system call to an attached user-space supervisor process to allow that process to decide what to do with the system call. If there is no attached supervisor (either because the filter was not installed with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag or because the file descriptor was closed), the filter returns ENOSYS (similar to what happens when a filter returns SECCOMP_RET_TRACE and there is no tracer). See seccomp_unotify(2) for further details. Note that the supervisor process will not be notified if another filter returns an action value with a precedence greater than SECCOMP_RET_USER_NOTIF. SECCOMP_RET_TRACE ptrace(2) . , errno ENOSYS. , PTRACE_O_TRACESECCOMP ptrace(PTRACE_SETOPTIONS). PTRACE_EVENT_SECCOMP, SECCOMP_RET_DATA PTRACE_GETEVENTMSG. , -1. . , , . Before Linux 4.8, the seccomp check will not be run again after the tracer is notified. (This means that, on older kernels, seccomp-based sandboxes must not allow use of ptrace(2)--even of other sandboxed processes--without extreme care; ptracers can use this mechanism to escape from the seccomp sandbox.) Note that a tracer process will not be notified if another filter returns an action value with a precedence greater than SECCOMP_RET_TRACE. SECCOMP_RET_LOG ( Linux 4.14) , . /proc/sys/kernel/seccomp/actions_logged. SECCOMP_RET_ALLOW . , SECCOMP_RET_KILL_PROCESS ( Linux 4.14), SECCOMP_RET_KILL_THREAD ( Linux 4.13 ). /proc /proc/sys/kernel/seccomp seccomp : actions_avail ( Linux 4.14) seccomp . . seccomp , . actions_logged ( Linux 4.14) - seccomp, . , actions_avail. , actions_logged , . actions_logged, , , , SECCOMP_RET_ALLOW. <> actions_logged, SECCOMP_RET_ALLOW. <> EINVAL. seccomp Linux 4.14 , seccomp (audit log). , actions_logged (, audit=1). : o -- SECCOMP_RET_ALLOW, . o , SECCOMP_RET_KILL_PROCESS SECCOMP_RET_KILL_THREAD, actions_logged, . o , ( SECCOMP_FILTER_FLAG_LOG) actions_logged, . o , (autrace(8)), . o . On success, seccomp() returns 0. On error, if SECCOMP_FILTER_FLAG_TSYNC was used, the return value is the ID of the thread that caused the synchronization failure. (This ID is a kernel thread ID of the type returned by clone(2) and gettid(2).) On other errors, -1 is returned, and errno is set to indicate the error. seccomp() : EACCES CAP_SYS_ADMIN no_new_privs SECCOMP_SET_MODE_FILTER. EBUSY While installing a new filter, the SECCOMP_FILTER_FLAG_NEW_LISTENER flag was specified, but a previous filter had already been installed with that flag. EFAULT args . EINVAL operation - . EINVAL flags operation. EINVAL operation BPF_ABS, 32- sizeof(struct seccomp_data). EINVAL , operation . EINVAL operation SECCOMP_SET_MODE_FILTER, , args, 0 BPF_MAXINSNS (4096) . ENOMEM . ENOMEM , , MAX_INSNS_PER_PATH (32768) . , 4 . EOPNOTSUPP operation SECCOMP_GET_ACTION_AVAIL, , , args. ESRCH , ID . Linux. Linux 3.17. seccomp, , libseccomp, seccomp. Seccomp /proc/pid/status seccomp ; proc(5). seccomp() PR_SET_SECCOMP prctl(2) ( flags). Linux 4.4, ptrace(2) PTRACE_SECCOMP_GET_FILTER seccomp . seccomp BPF seccomp BPF : o x86-64, i386, x32 ( Linux 3.5) o ARM ( Linux 3.8) o s390 ( Linux 3.8) o MIPS ( Linux 3.16) o ARM-64 ( Linux 3.19) o PowerPC ( Linux 4.3) o Tile ( Linux 4.3) o PA-RISC ( Linux 4.6) , seccomp : o vdso(7). clock_gettime(2), gettimeofday(2) time(2). seccomp (, vdso(7) seccomp filters ). o seccomp . , , , C, , , . , : o glibc . , exit(2) exit_group(2), fork(2) clone(2). o , , . , . o , glibc. , glibc open(2) , glibc 2.26, openat(2) . , . 2 C , . , , seccomp , , , , . seccomp, . seccomp BPF , BPF seccomp: o BPF_H BPF_B : (4-) (BPF_W). o seccomp_data BPF_ABS. o BPF_LEN , seccomp_data. , , . -- , . BPF, : o , BPF ENOSYS. o , BPF , errno . , execv(3) ( , execve(2)). . , (x86-64), , : $ uname -m x86_64 $ syscall_nr() { cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \ awk '$2 != "x32" && $3 == "'$1'" { print $1 }' } BPF ( [2] ), , . 99: $ errno 99 EADDRNOTAVAIL 99 Cannot assign requested address whoami(1), BPF execve(2), : $ syscall_nr execve 59 $ ./a.out : ./a.out [] : AUDIT_ARCH_I386: 0x40000003 AUDIT_ARCH_X86_64: 0xC000003E $ ./a.out 59 0xC000003E 99 /bin/whoami execv: Cannot assign requested address BPF write(2), , whoami(1) : $ syscall_nr write 1 $ ./a.out 1 0xC000003E 99 /bin/whoami BPF , whoami(1), : $ syscall_nr preadv 295 $ ./a.out 295 0xC000003E 99 /bin/whoami cecilia #include #include #include #include #include #include #include #include #include #define X32_SYSCALL_BIT 0x40000000 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) static int install_filter(int syscall_nr, unsigned int t_arch, int f_errno) { unsigned int upper_nr_limit = 0xffffffff; /* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI (in the x32 ABI, all system calls have bit 30 set in the 'nr' field, meaning the numbers are >= X32_SYSCALL_BIT). */ if (t_arch == AUDIT_ARCH_X86_64) upper_nr_limit = X32_SYSCALL_BIT - 1; struct sock_filter filter[] = { /* [0] Load architecture from 'seccomp_data' buffer into accumulator. */ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, arch))), /* [1] Jump forward 5 instructions if architecture does not match 't_arch'. */ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5), /* [2] Load system call number from 'seccomp_data' buffer into accumulator. */ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))), /* [3] Check ABI - only needed for x86-64 in deny-list use cases. Use BPF_JGT instead of checking against the bit mask to avoid having to reload the syscall number. */ BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0), /* [4] Jump forward 1 instruction if system call number does not match 'syscall_nr'. */ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1), /* [5] Matching architecture and system call: don't execute the system call, and return 'f_errno' in 'errno'. */ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)), /* [6] Destination of system call number mismatch: allow other system calls. */ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), /* [7] Destination of architecture mismatch: kill process. */ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS), }; struct sock_fprog prog = { .len = ARRAY_SIZE(filter), .filter = filter, }; if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog)) { perror("seccomp"); return 1; } return 0; } int main(int argc, char *argv[]) { if (argc < 5) { fprintf(stderr, "Usage: " "%s []\n" "Hint for : AUDIT_ARCH_I386: 0x%X\n" " AUDIT_ARCH_X86_64: 0x%X\n" "\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64); exit(EXIT_FAILURE); } if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) { perror("prctl"); exit(EXIT_FAILURE); } if (install_filter(strtol(argv[1], NULL, 0), strtoul(argv[2], NULL, 0), strtol(argv[3], NULL, 0))) exit(EXIT_FAILURE); execv(argv[4], &argv[4]); perror("execv"); exit(EXIT_FAILURE); } . bpfc(1), strace(1), bpf(2), prctl(2), ptrace(2), seccomp_unotify(2), sigaction(2), proc(5), signal(7), socket(7) Various pages from the libseccomp library, including: scmp_sys_resolver(1), seccomp_export_bpf(3), seccomp_init(3), seccomp_load(3), and seccomp_rule_add(3). Documentation/networking/filter.txt Documentation/userspace-api/seccomp_filter.rst ( Linux 4.13 Documentation/prctl/seccomp_filter.txt). McCanne, S. and Jacobson, V. (1992) The BSD Packet Filter: A New Architecture for User-level Packet Capture, Proceedings of the USENIX Winter 1993 Conference Alexander Golubev , Azamat Hackimov , Hotellook, Nikita , Spiros Georgaras , Vladislav , Yuri Kozlov ; GNU 3 , . . , , . Linux man-pages 6.06 31 2023 . seccomp(2)