eBPF Implementation in Kernel
Published:
Unprivileged eBPF
eBPF allows unprivileged user to load eBPF program if /proc/sys/kernel/unprivileged_bpf_disabled is 0.
The implementation is in __sys_bpf in linux/kernel/bpf/syscall.c.
static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
{
union bpf_attr attr;
bool capable;
int err;
capable = bpf_capable() || !sysctl_unprivileged_bpf_disabled;
...
}
Where is eBPF byte code allocated?
eBPF byte code is allocated in vmalloc region. The call path is bpf_prog_load -> bpf_prog_alloc –> bpf_prog_alloc_no_stats –> __vmalloc.
bpf_prog_load also calls bpf_check. Does the kernel set the program RO before the checking?
RO Hardening for Byte Code
After bpf_check, bpf_prog_load calls bpf_prog_select_runtime, which calls bpf_prog_lock_ro to set the data structure and the byte code read-only.
Where is JITed code allocated?
On x86 and RISC-V, the JITed code is in the module region, as described in Documentation/riscv/vm-layout.rst.
RISC-V Linux Kernel SV39
===================================================================================================
Start addr | Offset | End addr | Size | VM area description
===================================================================================================
| | | |
0000000000000000 | 0 | 0000003fffffffff | 256 GB | user virtual memory, different per mm
_________________|____________|__________________|_________|_______________________________________
|
___________________________________________________________|_______________________________________
| | | |
ffffffc6fee00000 | -228 GB | ffffffc6feffffff | 2 MB | fixmap
ffffffc6ff000000 | -228 GB | ffffffc6ffffffff | 16 MB | PCI io
ffffffc700000000 | -228 GB | ffffffc7ffffffff | 4 GB | vmemmap
ffffffc800000000 | -224 GB | ffffffd7ffffffff | 64 GB | vmalloc/ioremap space
ffffffd800000000 | -160 GB | fffffff6ffffffff | 124 GB | direct mapping of all physical memory
fffffff700000000 | -36 GB | fffffffeffffffff | 32 GB | kasan
_________________|____________|__________________|_________|________________________________________
|
___________________________________________________________|________________________________________
| | | |
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | modules, BPF
ffffffff80000000 | -2 GB | ffffffffffffffff | 2 GB | kernel
_________________|____________|__________________|_________|________________________________________
RISC-V kernel also defines BPF_JIT_REGION_START.
For the actual JITed code memory allocation, bpf_jit_alloc_exec calls __vmalloc_node_range and passes BPF_JIT_REGION_START as the starting address.
On ARM64, bpf_jit_binary_alloc -> bpf_jit_alloc_exec -> [vmalloc]. Therefore, the JITed code is in the vmalloc memory region.
AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::
Start End Size Use
-----------------------------------------------------------------------
0000000000000000 0000ffffffffffff 256TB user
ffff000000000000 ffff7fffffffffff 128TB kernel logical memory map
[ffff600000000000 ffff7fffffffffff] 32TB [kasan shadow region]
ffff800000000000 ffff800007ffffff 128MB modules
ffff800008000000 fffffbffefffffff 124TB vmalloc
fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down)
fffffbfffe000000 fffffbfffe7fffff 8MB [guard region]
fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space
fffffbffff800000 fffffbffffffffff 8MB [guard region]
fffffc0000000000 fffffdffffffffff 2TB vmemmap
fffffe0000000000 ffffffffffffffff 2TB [guard region]
-----------------------------------------------------------------------
RO Hardening for JIT Code
The BPF JIT compiler sets the JITed native code to ROX (PXN clear). The call path is bpf_int_jit_compile -> bpf_jit_binary_lock_ro -> set_memory_rox -> change_memory_common
change_memory_common will clear PXN bit, allowing native code execution in kernel mode.