Understanding process and interrupt contexts

各種 context 有以下的分類

kernel code
- interrupt context: 可能是來自於 hard ware 的 interrupt
- process context: 來自於 system call 或是 exception
user space
- user context

在接下來的內容中，可以留意現在是在討論這三種 context 的那一個分類裡

Understanding the basics of the process VAS

大致上一個 process 的 virtual address space 長成下面這個樣子

Text segment: 這是 machine code 存放的地方
Data segment
- Initialized data segment: 已經初始化的變數
- Uninitialized data segment: 還沒有被初始化的變數，有時候會被稱為 bss
Heap segment: 被 malloc() 或是 mmap() 出來的區域會放在這裡
Libraries (text, data)
Stack: 這個區域會對應到 function call 的過程

Organizing processes, threads, and their stacks – user and kernel space

thread 可以想成是 registers + stack 的組合，其他的資源都是跟 process 共用的這本書會把重點著重於 thread 因為在最原始的 Unix 理念中

Everything is a process; if it’s not a process, it’s a file

這句話雖然在當今也算是正確的，不過

The thread, not the process, is the kernel schedulable entity

在當今會更加貼切一些

每一個 thread 都會有一個對應的 task structure (也被稱為 process descriptor)

下一個重點為：

we require one stack per thread per privilege level supported by the CPU

所以可以得到下一個結論

every user space thread alive has two stacks

A user space stack
A kernel space stack: 進入到 kernel mode 之後才會用這個 stack

但如果是 kernel thread 的話，就只會有一個 kernel thread

整個架構長成這個樣子

cd ~/Linux-Kernel-Programming/ch6/

./countem.sh

從上面的計算可以看到

# of total threads == # of kthread + # of uthread

User space organization

先來看 user space 的部份，每一個 process 都一定會有一個 main thread，並且每一個 process 可以有多個 thread

每一個 process 大致上會有以下的區塊：

Text: code r-x
Data segments: rw- 這裡包含
1. itialized data segment
2. unitialized data segment (or bss)
3. ‘upward-growing’ heap
Library mappings
Downward-growing stack(s)

每一個 user space thread 都會有對應的 user space stack 與 kernel space stack

Kernel space organization

這裡的 kernel thread 只有一個 kernel-mode stack

Summarizing the current situation

Task structures:
- 每一個 thread (user or kernel) 都有一個相對應的 task struct
Stacks:
- 一個 user mode thread 會有兩個 stack
  - 一個 user mode stack
  - 一個 kernel mode stack
- 一個純粹的 kernel mode thread 就只有一個 kernel mode stack

Viewing the user and kernel stacks

在 debug 的時候很需要觀察 stack 裡面裝了什麼，因為 stack 中紀錄了當前的 execution context

Traditional approach to viewing the stacks

Viewing the kernel space stack of a given thread or process

(base) user@thinkpad:~$ pgrep bash 
8762
(base) user@thinkpad:~$ sudo cat /proc/8762/stack
[<0>] do_wait+0x171/0x310
[<0>] kernel_wait4+0xaf/0x150
[<0>] __do_sys_wait4+0x89/0xa0
[<0>] __x64_sys_wait4+0x1c/0x30
[<0>] x64_sys_call+0x1c2e/0x1fa0
[<0>] do_syscall_64+0x56/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0xd6

或者直接使用

(base) user@thinkpad:~$ sudo cat /proc/$(pgrep bash)/stack
[<0>] do_wait+0x171/0x310
[<0>] kernel_wait4+0xaf/0x150
[<0>] __do_sys_wait4+0x89/0xa0
[<0>] __x64_sys_wait4+0x1c/0x30
[<0>] x64_sys_call+0x1c2e/0x1fa0
[<0>] do_syscall_64+0x56/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0xd6 # <-- stack bottom

要注意這裡的輸出跟 memory 的排列是相反的，以我這裡的例子來說 entry_SYSCALL_64_after_hwframe 處在 stack bottom 的位置

這裡的輸出代表 bash 正在執行 do_wait() 並且這是透過 system call 呼叫到這裡來的

Viewing the user space stack of a given thread or process

這裡有點諷刺的是，查看 user space stack 比 kernel space stack 還要困難

user@thinkpad:~$ sudo gdb -p 8762 -batch -ex "thread apply all bt"
Thread 1 (Thread 0x7f29b6a09740 (LWP 8762) "bash"):
#0  0x00007f29b6af63ea in __GI___wait4 (pid=-1, stat_loc=0x7ffc2376e500, options=10, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
#1  0x0000556331e9b135 in ?? ()
#2  0x0000556331dfb6a2 in wait_for ()
#3  0x0000556331de37aa in execute_command_internal ()
#4  0x0000556331de41b8 in execute_command ()
#5  0x0000556331dd53cb in reader_loop ()
#6  0x0000556331dc6c46 in main ()
[Inferior 1 (process 8762) detached]

[e]BPF – the modern approach to viewing both stacks

前面的作法都是比較老一點的作法，現在比較常見的方式是用 eBPF

sudo stackcount-bpfcc -p 29819 -r ".*malloc.*" -v -d

The 10,000-foot view of the process VAS

Understanding and accessing the kernel task structure

每一個 thread 都有一個相對應的 task struct，他紀錄的這個 thread 的基本資料

Looking into the task structure

task_struct 實際上定義在 include/linux/sched.h 中

cd $(KSRC)
vim include/linux/sched.h

這裡看完 1. 原始碼 2. 書上對於原始碼的註記會對於 task_struct 比較有感覺

Accessing the task structure with current

使用 current 這個 macro 可以找到 task_struct 的內容，current 的實做非常 architecture-specific

user@ubuntu:~/kernels/linux-5.4/arch$ find . -name "current.h"
./x86/include/asm/current.h
./xtensa/include/asm/current.h
./nds32/include/asm/current.h
./ia64/include/asm/current.h
./arc/include/asm/current.h
./microblaze/include/asm/current.h
./arm64/include/asm/current.h
./powerpc/include/asm/current.h
./m68k/include/asm/current.h
./riscv/include/asm/current.h
./sparc/include/asm/current.h
./s390/include/asm/current.h

例如 arm64 的實做：

/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __ASM_CURRENT_H
#define __ASM_CURRENT_H

#include <linux/compiler.h>

#ifndef __ASSEMBLY__

struct task_struct;

/*
 * We don't use read_sysreg() as we want the compiler to cache the value where
 * possible.
 */
static __always_inline struct task_struct *get_current(void)
{
    unsigned long sp_el0;

    asm ("mrs %0, sp_el0" : "=r" (sp_el0));

    return (struct task_struct *)sp_el0;

}

#define current get_current()

#endif /* __ASSEMBLY__ */

#endif /* __ASM_CURRENT_H */

使用方式如下：

#include <linux/sched.h>
current->pid, current->comm

Determining the context

Kernel code 會跑在下面兩種 context

Process (or task) context
Interrupt (or atomic) context

#include <linux/preempt.h>
in_task()

in_task() 回傳一個 boolean

return true: process context (通常可以在這個情況下 sleep)
return false: interrupt context (不可以在這個情況下 sleep)

current is only considered valid when running in process context

Working with the task structure via current

cd /home/user/Linux-Kernel-Programming/ch6/current_affairs
vim current_affairs.c

/*
 * ch6/current_affairs/current_affairs.c
 ***************************************************************
 * This program is part of the source code released for the book
 *  "Linux Kernel Programming"
 *  (c) Author: Kaiwan N Billimoria
 *  Publisher:  Packt
 *  GitHub repository:
 *  https://github.com/PacktPublishing/Linux-Kernel-Programming
 *
 * From: Ch 6: Kernel and Memory Management Internals -Essentials
 ****************************************************************
 * Brief Description:
 *
 * For details, please refer the book, Ch 6.
 */
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/sched.h>    /* current() */
#include <linux/preempt.h>  /* in_task() */
#include <linux/cred.h>     /* current_{e}{u,g}id() */
#include <linux/uidgid.h>   /* {from,make}_kuid() */

#define OURMODNAME   "current_affairs"

MODULE_AUTHOR("Kaiwan N Billimoria");
MODULE_DESCRIPTION("LKP book:ch6/current_affairs: display a few members of"
" the current process' task structure");
MODULE_LICENSE("Dual MIT/GPL");
MODULE_VERSION("0.1");

static inline void show_ctx(char *nm)
{
    /* Extract the task UID and EUID using helper methods provided */
    unsigned int uid = from_kuid(&init_user_ns, current_uid());
    unsigned int euid = from_kuid(&init_user_ns, current_euid());

    pr_info("%s:%s():%d ", nm, __func__, __LINE__);
    if (likely(in_task())) {
        pr_info("%s: in process context ::\n"
            " PID         : %6d\n"
            " TGID        : %6d\n"
            " UID         : %6u\n"
            " EUID        : %6u (%s root)\n"
            " name        : %s\n"
            " current (ptr to our process context's task_struct) :\n"
            "               0x%pK (0x%px)\n"
            " stack start : 0x%pK (0x%px)\n", nm,
            /* always better to use the helper methods provided */
            task_pid_nr(current), task_tgid_nr(current),
            /* ... rather than the 'usual' direct lookups:
             * current->pid, current->tgid,
             */
            uid, euid,
            (euid == 0 ? "have" : "don't have"),
            current->comm,
            current, current,
            current->stack, current->stack);
    
    } else
        pr_alert("%s: in interrupt context [Should NOT Happen here!]\n", nm);

}

static int __init current_affairs_init(void)
{
    pr_info("%s: inserted\n", OURMODNAME);
    pr_info(" sizeof(struct task_struct)=%zd\n", sizeof(struct task_struct));
    show_ctx(OURMODNAME);
    return 0;       /* success */

}

static void __exit current_affairs_exit(void)
{
    show_ctx(OURMODNAME);
    pr_info("%s: removed\n", OURMODNAME);

}

module_init(current_affairs_init);
module_exit(current_affairs_exit);

從這個範例可以看到要如何使用 current，注意看這裡會使用像是

#include <linux/sched.h>    /* current() */

[...]
current->comm,
current, current,
current->stack, current->stack
[...]

這種用法，current 在 #include <linux/sched.h> 之後，可作為一個 macro 使用

這裡的用意在於列印出當前這個 process 的 task_struct

Built-in kernel helper methods and optimizations

Trying out the kernel module to print process context info

cd ~/Linux-Kernel-Programming/ch6/current_affairs/
make

sudo dmesg -C
sudo insmod ./current_affairs.ko
dmesg

如同這份 code 所預期的，列印出一些當前 process 的資訊

Seeing that the Linux OS is monolithic

Coding for security with printk

Iterating over the kernel’s task lists

所有的 task_struct 是用一個 linked list 存放在 include/linux/types.h:list_head 中

cd ${KSRC}/include/linux/
vim ${KSRC}/include/linux/types.h

struct list_head {
    struct list_head *next, *prev;
};

針對這個 list 的操作，include/linux/signal.h 中提供了很多 macro 可以使用

vim /home/user/kernels/linux-5.4/include/linux/signal.h

接下來會來嘗試完成以下兩個任務

One: Iterate over the kernel task list and display all processes alive.
Two: Iterate over the kernel task list and display all threads alive

Iterating over the task list I – displaying all processes

~/Linux-Kernel-Programming/ch6/foreach/prcs_showall
make
sudo dmesg -C
sudo insmod ./prcs_showall.ko

sudo rmmod prcs_showall

這裡可以對照 prcs_showall.c 與 signal.h

vim ~/Linux-Kernel-Programming/ch6/foreach/prcs_showall/prcs_showall.c

vim ${KSRC}/include/linux/sched/signal.h

重點在於 signal.h 的 for_each_process()

#define for_each_process(p) \
    for (p = &init_task ; (p = next_task(p)) != &init_task ; )

跟 prcs_showall.c 中的使用

[...]
    rcu_read_lock();
    for_each_process(p) {
        memset(tmp, 0, 128);
        n = snprintf(tmp, 128, "%-16s|%8d|%8d|%7u|%7u\n", p->comm, p->tgid, p->pid,
                 /* (old way to disp credentials): p->uid, p->euid -or-
                  * current_uid().val, current_euid().val
                  * better way using kernel helper __kuid_val():
                  */
                 __kuid_val(p->cred->uid), __kuid_val(p->cred->euid)
            );
        numread += n;
        pr_info("%s", tmp);
        //pr_debug("n=%d numread=%d tmp=%s\n", n, numread, tmp);

        cond_resched();
        total++;
    
    }           // for_each_process()
    rcu_read_unlock();
[...]

Iterating over the task list II – displaying all threads

這裡要講解的程式在

cd ~/Linux-Kernel-Programming/ch6/foreach/thrd_showall

先觀察以下的執行結果：

make
sudo insmod thrd_showall.ko

dmesg

Differentiating between the process and thread – the TGID and the PID

同一個 process 的不同 thread 會有一樣的 TGID
不同的 thread 就會有不同的 PID

看下面的例子會比較容易理解

user@ubuntu:~/Linux-Kernel-Programming/ch6/foreach/thrd_showall$ dmesg
[  514.765402 ] thrd_showall: inserted
[  514.765404 ] ------------------------------------------------------------------------------------------
                   TGID     PID         current           stack-start         Thread Name     MT? # thrds
               ------------------------------------------------------------------------------------------
[...]
[  514.765778 ]      998      998   0xffff96660763ae00  0xffffa89040894000             snapd   14
[  514.765780 ]      998     1267   0xffff96661cd39700  0xffffa89040b90000             snapd 
[  514.765783 ]      998     1268   0xffff96661cd38000  0xffffa8904080c000             snapd 
[  514.765786 ]      998     1269   0xffff96661cd3dc00  0xffffa89040df0000             snapd 
[  514.765788 ]      998     1270   0xffff96661cc78000  0xffffa89040ba8000             snapd 
[  514.765791 ]      998     1271   0xffff96661cd3c500  0xffffa89040df8000             snapd 
[  514.765794 ]      998     1273   0xffff966608bb8000  0xffffa89040d98000             snapd 
[  514.765797 ]      998     1274   0xffff96661cc7ae00  0xffffa890404d8000             snapd 
[  514.765799 ]      998     1298   0xffff96661c7b0000  0xffffa89041038000             snapd 
[  514.765802 ]      998     1302   0xffff9666093ec500  0xffffa89041070000             snapd 
[  514.765805 ]      998     1377   0xffff96661cc7dc00  0xffffa89040460000             snapd 
[  514.765807 ]      998     1378   0xffff96661cd3ae00  0xffffa89041058000             snapd 
[  514.765810 ]      998     1379   0xffff96661ccaae00  0xffffa89040c08000             snapd 
[  514.765813 ]      998     1380   0xffff9666093e9700  0xffffa89041060000             snapd

Iterating over the task list III – the code

接著來看 thrd_showall.c 是如何寫成的

/*
 * ch6/foreach/thrd_showall/thrd_showall.c
 ***************************************************************
 * This program is part of the source code released for the book
 *  "Linux Kernel Programming"
 *  (c) Author: Kaiwan N Billimoria
 *  Publisher:  Packt
 *  GitHub repository:
 *  https://github.com/PacktPublishing/Linux-Kernel-Programming
 *
 * From: Ch 6 : Kernel and MM Internals Essentials
 ****************************************************************
 * Brief Description:
 * This kernel module iterates over the task structures of all *threads*
 * currently alive on the box, printing out some details.
 * We use the do_each_thread() { ...  } while_each_thread() macros to do
 * so here.
 *
 * For details, please refer the book, Ch 6.
 */
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/sched.h>     /* current() */
#include <linux/version.h>
#if LINUX_VERSION_CODE > KERNEL_VERSION(4, 10, 0)
#include <linux/sched/signal.h>
#endif

#define OURMODNAME   "thrd_showall"

MODULE_AUTHOR("Kaiwan N Billimoria");
MODULE_DESCRIPTION("LKP book:ch6/foreach/thrd_showall:"
" demo to display all threads by iterating over the task list");
MODULE_LICENSE("Dual MIT/GPL");
MODULE_VERSION("0.1");

/* Display just CPU 0's idle thread, i.e., the pid 0 task,
 * the (terribly named) 'swapper/n'; n = 0, 1, 2,...
 * Again, init_task is always the task structure of the first CPU's
 * idle thread, i.e., we're referencing swapper/0.
 */
static inline void disp_idle_thread(void)
{
    struct task_struct *t = &init_task;

    /* We know that the swapper is a kernel thread */
    pr_info("%8d %8d   0x%px  0x%px [%16s]\n",
        t->pid, t->pid, t, t->stack, t->comm);

}

static int showthrds(void)
{
    struct task_struct *g = NULL, *t = NULL; /* 'g' : process ptr; 't': thread ptr */
    int nr_thrds = 1, total = 1;    /* total init to 1 for the idle thread */
#define BUFMAX      256
#define TMPMAX      128
    char buf[BUFMAX], tmp[TMPMAX];
    const char hdr[] =
"------------------------------------------------------------------------------------------\n"
"    TGID     PID         current           stack-start         Thread Name     MT? # thrds\n"
"------------------------------------------------------------------------------------------\n";

    pr_info("%s", hdr);
    disp_idle_thread();

    /*
     * The do_each_thread() / while_each_thread() is a pair of macros that iterates over
     * _all_ task structures in memory.
     * The task structs are global of course; this implies we should hold a lock of some
     * sort while working on them (even if only reading!). So, doing
     *  read_lock(&tasklist_lock);
     *  [...]
     *  read_unlock(&tasklist_lock);
     * BUT, this lock - tasklist_lock - isn't exported and thus unavailable to modules.
     * So, using an RCU read lock is indicated here (this has been added later to this code).
     * FYI: a) Ch 12 and Ch 13 cover the details on kernel synchronization.
     *      b) Read Copy Update (RCU) is a complex synchronization mechanism; it's
     * conceptually explained really well within this blog article:
     *  https://reberhardt.com/blog/2020/11/18/my-first-kernel-module.html
     */
    rcu_read_lock();
    do_each_thread(g, t) {     /* 'g' : process ptr; 't': thread ptr */
        task_lock(t);

        snprintf(buf, BUFMAX-1, "%8d %8d ", g->tgid, t->pid);

        /* task_struct addr and kernel-mode stack addr */
        snprintf(tmp, TMPMAX-1, "  0x%px", t);
        /*
         * To concatenate the temp string to our buffer, we could go with the
         * strncat() here; flawfinder, though, points out this is potentially
         * dangerous; so we simply use another snprintf() to achieve the same.
         * Why not use strlcat() instead? Here, it runs into trouble - being
         * called in an atomic context, which isn't ok (due to the
         * might_sleep() within it's code)...
         */
        snprintf(buf, BUFMAX-1, "%s%s  0x%px", buf, tmp, t->stack);

        if (!g->mm) {   // kernel thread
        /* One might question why we don't use the get_task_comm() to obtain
         * the task's name here; the short reason: it causes a deadlock! We
         * shall explore this (and how to avoid it) in some detail in Ch 17 -
         * Kernel Synchronization Part 2. For now, we just do it the simple way
         */
            snprintf(tmp, TMPMAX-1, " [%16s]", t->comm);
        } else {
            snprintf(tmp, TMPMAX-1, "  %16s ", t->comm);
        
        }
        snprintf(buf, BUFMAX-1, "%s%s", buf, tmp);

        /* Is this the "main" thread of a multithreaded process?
         * We check by seeing if (a) it's a userspace thread,
         * (b) it's TGID == it's PID, and (c), there are >1 threads in
         * the process.
         * If so, display the number of threads in the overall process
         * to the right..
         */
        nr_thrds = get_nr_threads(g);
        if (g->mm && (g->tgid == t->pid) && (nr_thrds > 1)) {
            snprintf(tmp, TMPMAX-1, " %3d", nr_thrds);
            snprintf(buf, BUFMAX-1, "%s%s", buf, tmp);
        
        }

        snprintf(buf, BUFMAX-1, "%s\n", buf);
        pr_info("%s", buf);

        total++;
        memset(buf, 0, sizeof(buf));
        memset(tmp, 0, sizeof(tmp));
        task_unlock(t);
     } while_each_thread(g, t);
    rcu_read_unlock();

    return total;

}

static int __init thrd_showall_init(void)
{
    int total;

    pr_info("%s: inserted\n", OURMODNAME);
    total = showthrds();
    pr_info("%s: total # of threads on the system: %d\n",
        OURMODNAME, total);

    return 0;       /* success */

}

static void __exit thrd_showall_exit(void)
{
    pr_info("%s: removed\n", OURMODNAME);

}

module_init(thrd_showall_init);
module_exit(thrd_showall_exit);

Understanding process and interrupt contexts#

Understanding the basics of the process VAS#

Organizing processes, threads, and their stacks – user and kernel space#

User space organization#

Kernel space organization#

Summarizing the current situation#

Viewing the user and kernel stacks#

Traditional approach to viewing the stacks#

Viewing the kernel space stack of a given thread or process#

Viewing the user space stack of a given thread or process#

[e]BPF – the modern approach to viewing both stacks#

The 10,000-foot view of the process VAS#

Understanding and accessing the kernel task structure#

Looking into the task structure#

Accessing the task structure with current#

Determining the context#

Working with the task structure via current#

Built-in kernel helper methods and optimizations#

Trying out the kernel module to print process context info#

Seeing that the Linux OS is monolithic#

Coding for security with printk#

Iterating over the kernel’s task lists#

Iterating over the task list I – displaying all processes#

Iterating over the task list II – displaying all threads#

Differentiating between the process and thread – the TGID and the PID#

Iterating over the task list III – the code#