eBPF Implementation in Kernel
Published:
Unprivileged eBPF
eBPF allows unprivileged user to load eBPF program if /proc/sys/kernel/unprivileged_bpf_disabled
is 0
.
The implementation is in __sys_bpf
in linux/kernel/bpf/syscall.c
.
static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
{
union bpf_attr attr;
bool capable;
int err;
capable = bpf_capable() || !sysctl_unprivileged_bpf_disabled;
...
}
Where is eBPF byte code allocated?
eBPF byte code is allocated in vmalloc region. The call path is bpf_prog_load
-> bpf_prog_alloc
–> bpf_prog_alloc_no_stats
–> __vmalloc
.
bpf_prog_load
also calls bpf_check
. Does the kernel set the program RO before the checking?
RO Hardening for Byte Code
After bpf_check
, bpf_prog_load
calls bpf_prog_select_runtime
, which calls bpf_prog_lock_ro
to set the data structure and the byte code read-only.
Where is JITed code allocated?
On x86 and RISC-V, the JITed code is in the module region, as described in Documentation/riscv/vm-layout.rst
.
RISC-V Linux Kernel SV39
===================================================================================================
Start addr | Offset | End addr | Size | VM area description
===================================================================================================
| | | |
0000000000000000 | 0 | 0000003fffffffff | 256 GB | user virtual memory, different per mm
_________________|____________|__________________|_________|_______________________________________
|
___________________________________________________________|_______________________________________
| | | |
ffffffc6fee00000 | -228 GB | ffffffc6feffffff | 2 MB | fixmap
ffffffc6ff000000 | -228 GB | ffffffc6ffffffff | 16 MB | PCI io
ffffffc700000000 | -228 GB | ffffffc7ffffffff | 4 GB | vmemmap
ffffffc800000000 | -224 GB | ffffffd7ffffffff | 64 GB | vmalloc/ioremap space
ffffffd800000000 | -160 GB | fffffff6ffffffff | 124 GB | direct mapping of all physical memory
fffffff700000000 | -36 GB | fffffffeffffffff | 32 GB | kasan
_________________|____________|__________________|_________|________________________________________
|
___________________________________________________________|________________________________________
| | | |
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | modules, BPF
ffffffff80000000 | -2 GB | ffffffffffffffff | 2 GB | kernel
_________________|____________|__________________|_________|________________________________________
RISC-V kernel also defines BPF_JIT_REGION_START
.
For the actual JITed code memory allocation, bpf_jit_alloc_exec
calls __vmalloc_node_range
and passes BPF_JIT_REGION_START
as the starting address.
On ARM64, bpf_jit_binary_alloc
-> bpf_jit_alloc_exec
-> [vmalloc
]. Therefore, the JITed code is in the vmalloc memory region.
AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::
Start End Size Use
-----------------------------------------------------------------------
0000000000000000 0000ffffffffffff 256TB user
ffff000000000000 ffff7fffffffffff 128TB kernel logical memory map
[ffff600000000000 ffff7fffffffffff] 32TB [kasan shadow region]
ffff800000000000 ffff800007ffffff 128MB modules
ffff800008000000 fffffbffefffffff 124TB vmalloc
fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down)
fffffbfffe000000 fffffbfffe7fffff 8MB [guard region]
fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space
fffffbffff800000 fffffbffffffffff 8MB [guard region]
fffffc0000000000 fffffdffffffffff 2TB vmemmap
fffffe0000000000 ffffffffffffffff 2TB [guard region]
-----------------------------------------------------------------------
RO Hardening for JIT Code
The BPF JIT compiler sets the JITed native code to ROX (PXN clear). The call path is bpf_int_jit_compile
-> bpf_jit_binary_lock_ro
-> set_memory_rox
-> change_memory_common
change_memory_common
will clear PXN bit, allowing native code execution in kernel mode.