System call fork

Fork syscall is defined SYSCALL_DEFINE0(fork) in kernel/fork.c. The call chain is: do_fork –> copy_process, which will call dup_task_struct, copy_mm, copy_thread and so on. In dup_task_struct, it will allocate memory for new task struct and new thread_union, and then just set the stack points to the new thread_union, which means the kernel stack never gets copied, the child process, regardless kernel thread or regular processes, just use a fresh empty kernel stack.

static struct task_struct *dup_task_struct(struct task_struct *orig)
{
        struct task_struct *tsk;
        struct thread_info *ti;
        ......
        tsk = alloc_task_struct_node(node);
        ......
        ti = alloc_thread_info_node(tsk, node);
        ......
        tsk->stack = ti;
}

Kernel thread vs user thread

From cpu_context and pt_regs, we know that after the process forking, the first time the new process gets context switch in, it starts from task_struct.thread.cpu_context.pc.

  • For kernel thread, kernel_thread(fn, ...) –> _do_fork(xx, stack_start). The kernel thread function pointer is passed to kernel_thread as fn, then gets passed to _do_fork as stack_start, then at copy_thread, assign to p->thread.cpu_context.pc.
int copy_thread(......)
{
    ......
    if (likely(!(p->flags & PF_KTHREAD))) {
        ......
    } else { //KERNEL THREAD
        ......
        p->thread.cpu_context.x19 = stack_start; //kernel thread fn
        p->thread.cpu_context.x20 = stk_sz;
    }
    p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
    p->thread.cpu_context.sp = (unsigned long)childregs;
    ......
}

ENTRY(ret_from_fork)
    bl schedule_tail
    cbz x19, 1f    // not a kernel thread
    mov x0, x20
    blr x19    //jump to kernel thread fn
1:  get_thread_info tsk
    b ret_to_user
ENDPROC(ret_from_fork)

After forking, when the new thread first time gets context switch in (get CPU), it starts from ret_from_fork, which will check x19, and jump to it if it is not 0.

  • For user processes, it is tricky, the child process needs to start run right after the fork() syscall in the parent’s code, which means the child will return to same user space address with its parent. Remember that the user space address is pushed to pt_regs in kernel_entry, so the child process only needs to copy pt_regs src, then in ret_from_fork, user space registers saved in pt_regs will be reload in kernel_exit and execution path goes to user space.

How can fork() return two values

We all know that fork() can return different value to child and parent process? What’s the corresponding kernel code? For parent process, fork() is just a syscall, it will return value to user space. The code is at the bottom of do_fork().

if (!IS_ERR(p)) {
        ......
        pid = get_task_pid(p, PIDTYPE_PID);
        nr = pid_vnr(pid);
        ......
} else {
        nr = PTR_ERR(p);
}
return nr;

For child process, from ret_from_fork, it will return back to user space. As all the user space registers, including PC are saved in pt_regs, and the PC register is copied from parent, so the child will return to the same place as its parent does. The return register for the child is x0, and fork() set pt_reg->x0 to 0, so 0 will be the return value for child process. The code is in copy_thread.

if (likely(!(p->flags & PF_KTHREAD))) {
        *childregs = *current_pt_regs();
        childregs->regs[0] = 0;
        ......
}

fork vs vfork

fork and vfork are all implemented by do_fork, only the flags are different. The only difference is that in copy_mm, if the CLONE_VM flag is set, the mm of the forked process will point to its parent’s mm.

static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
{
        struct mm_struct *mm, *oldmm;
        int retval;
        ......
        tsk->mm = NULL;
        tsk->active_mm = NULL;
        ......
        if (clone_flags & CLONE_VM) {
                atomic_inc(&oldmm->mm_users);
                mm = oldmm;
                goto good_mm;
        }
 
        retval = -ENOMEM;
        mm = dup_mm(tsk);
        if (!mm)
                goto fail_nomem;
 
good_mm:
        tsk->mm = mm;
        tsk->active_mm = mm;
        return 0;
 
fail_nomem:
        return retval;
}

If CLONE_VM is set, dup_mm will be skipped. dup_mm will call dup_mmap, which will copy all vma and the copy_page_range in it will copy all the page table.