偉林,中年碼農(nóng),從事過電信、手機(jī)、安全、芯片等行業(yè),目前依舊從事Linux方向開發(fā)工作,個(gè)人愛好Linux相關(guān)知識分享,個(gè)人微博CSDN pwl999,歡迎大家關(guān)注!
文章目錄
1、bpf()系統(tǒng)調(diào)用
1.1、bpf加載
1.1.1、bpf內(nèi)存空間分配
1.1.2、bpf verifier
1.1.3、bpf JIT/kernel interpreter
1.1.4、fd分配
1.2、bpf map操作
1.2.1、map的創(chuàng)建
1.2.2、map的查找
1.2.3、BPF_FUNC_map_lookup_elem
1.3、obj pin
1.3.1、bpf_obj_pin() 1.3.2、bpf_obj_get()
2、Tracing類型的BPF程序
2.1、bpf程序的綁定
2.2、bpf程序的執(zhí)行
3、Filter類型的BPF程序
BPF的字面上意思Berkeley Packet Filter意味著它是從包過濾而來。如果在開始前對BPF缺乏感性的認(rèn)識建議先看一下參考文檔:“3.1、Berkeley Packet Filter (BPF) (Kernel Document)”、“3.2、BPF and XDP Reference Guide”。
本質(zhì)上它是一種內(nèi)核代碼注入的技術(shù):
-
內(nèi)核中實(shí)現(xiàn)了一個(gè)cBPF/eBPF虛擬機(jī);
-
用戶態(tài)可以用C來寫運(yùn)行的代碼,再通過一個(gè)Clang&LLVM的編譯器將C代碼編譯成BPF目標(biāo)碼;
-
用戶態(tài)通過系統(tǒng)調(diào)用bpf()將BPF目標(biāo)碼注入到內(nèi)核當(dāng)中;
-
內(nèi)核通過JIT(Just-In-Time)將BPF目編碼轉(zhuǎn)換成本地指令碼;如果當(dāng)前架構(gòu)不支持JIT轉(zhuǎn)換內(nèi)核則會使用一個(gè)解析器(interpreter)來模擬運(yùn)行,這種運(yùn)行效率較低;
-
內(nèi)核在packet filter和tracing等應(yīng)用中提供了一系列的鉤子來運(yùn)行BPF代碼。目前支持以下類型的BPF代碼:
static int __init register_kprobe_prog_ops(void)
{
bpf_register_prog_type(&kprobe_tl);
bpf_register_prog_type(&tracepoint_tl);
bpf_register_prog_type(&perf_event_tl);
return 0;
}
static int __init register_sk_filter_ops(void)
{
bpf_register_prog_type(&sk_filter_type);
bpf_register_prog_type(&sched_cls_type);
bpf_register_prog_type(&sched_act_type);
bpf_register_prog_type(&xdp_type);
bpf_register_prog_type(&cg_skb_type);
return 0;
}
BPF的好處在哪里?是因?yàn)樗峁┝艘环N在不修改內(nèi)核代碼的情況下,可以靈活修改內(nèi)核處理策略的方法。
這在包過濾和系統(tǒng)tracing這種需要頻繁修改規(guī)則的場合非常有用。因?yàn)槿绻辉谟脩魬B(tài)修改策略的話那么所有數(shù)據(jù)需要復(fù)制一份給用戶態(tài)開銷較大;如果在內(nèi)核態(tài)修改策略的話需要修改內(nèi)核代碼重新編譯內(nèi)核,而且容易引人安全問題。BPF這種內(nèi)核代碼注入技術(shù)的生存空間就是它可以在這兩者間取得一個(gè)平衡。
Systamp就是解決了這個(gè)問題得以發(fā)展的,它使用了ko的方式來實(shí)現(xiàn)內(nèi)核代碼注入(有點(diǎn)笨拙,但是也解決了實(shí)際問題)。
Systemtap工作原理:是通過將腳本語句翻譯成C語句,編譯成內(nèi)核模塊。模塊加載之后,將所有探測的事件以Kprobe鉤子的方式掛到內(nèi)核上,當(dāng)任何處理器上的某個(gè)事件發(fā)生時(shí),相應(yīng)鉤子上句柄就會被執(zhí)行。最后,當(dāng)systemtap會話結(jié)束之后,鉤子從內(nèi)核上取下,移除模塊。整個(gè)過程用一個(gè)命令stap就可以完成。
既然是提供向內(nèi)核注入代碼的技術(shù),那么安全問題肯定是重中之重。平時(shí)防范他人通過漏洞向內(nèi)核中注入代碼,這下子專門開了一個(gè)口子不是大開方便之門。所以內(nèi)核指定了很多的規(guī)則來限制BPF代碼,確保它的錯(cuò)誤不會影響到內(nèi)核:
-
一個(gè)BPF程序的代碼數(shù)量不能超過BPF_MAXINSNS (4K),它的總運(yùn)行步數(shù)不能超過32K (4.9內(nèi)核中這個(gè)值改成了96k);
-
BPF代碼中禁止循環(huán),這也是為了保證出錯(cuò)時(shí)不會出現(xiàn)死循環(huán)來hang死內(nèi)核。一個(gè)BPF程序總的可能的分支數(shù)也被限制到1K;
-
為了限制它的作用域,BPF代碼不能訪問全局變量,只能訪問局部變量。一個(gè)BPF程序只有512字節(jié)的堆棧。在開始時(shí)會傳入一個(gè)ctx指針,BPF程序的數(shù)據(jù)訪問就被限制在ctx變量和堆棧局部變量中;
-
如果BPF需要訪問全局變量,它只能訪問BPF map對象。BPF map對象是同時(shí)能被用戶態(tài)、BPF程序、內(nèi)核態(tài)共同訪問的,BPF對map的訪問通過helper function來實(shí)現(xiàn);
-
舊版本BPF代碼中不支持BPF對BPF函數(shù)的調(diào)用,所以所有的BPF函數(shù)必須聲明成always_inline。在Linux內(nèi)核4.16和LLVM 6.0以后,才支持BPF to BPF Calls;
-
BPF雖然不能函數(shù)調(diào)用,但是它可以使用Tail Call機(jī)制從一個(gè)BPF程序直接跳轉(zhuǎn)到另一個(gè)BPF程序。它需要通過BPF_MAP_TYPE_PROG_ARRAY類型的map來知道另一個(gè)BPF程序的指針。這種跳轉(zhuǎn)的次數(shù)也是有限制的,32次;
-
BPF程序可以調(diào)用一些內(nèi)核函數(shù)來輔助做一些事情(helper function);
-
有些架構(gòu)(64 bit x86_64, arm64, ppc64, s390x, mips64, sparc64 and 32 bit arm)已經(jīng)支持BPF的JIT,它可以高效的幾乎一比一的把BPF代碼轉(zhuǎn)換成本機(jī)代碼(因?yàn)閑BPF的指令集已經(jīng)做了優(yōu)化,非常類似最新的arm/x86架構(gòu),ABI也類似)。如果當(dāng)前架構(gòu)不支持JTI只能使用內(nèi)核的解析器(interpreter)來模擬運(yùn)行;
-
內(nèi)核還可以通過一些額外的手段來加固BPF的安全性(Hardening)。主要包括:把BPF代碼映像和JIT代碼映像的page都鎖成只讀,JIT編譯時(shí)把常量致盲(constant blinding),以及對bpf()系統(tǒng)調(diào)用的權(quán)限限制;
對BPF這些安全規(guī)則的檢查主要是在BPF代碼加載時(shí),通過BPF verifier來實(shí)現(xiàn)的。大概分為兩步:
-
第一步,通過DAG(Directed Acyclic Graph 有向無環(huán)圖)的DFS(Depth-first Search)深度優(yōu)先算法來遍歷BPF程序的代碼路徑,確保沒有環(huán)路發(fā)生;
-
第二步,逐條分析BPF每條指令的運(yùn)行,對register和對stack的影響,最壞情況下是否有越界行為(對變量的訪問是否越界,運(yùn)行的指令數(shù)是否越界)。這里也有一個(gè)快速分析的優(yōu)化方法:修剪(Pruning)。如果當(dāng)前指令的當(dāng)前分支的狀態(tài),和當(dāng)前指令另一個(gè)已分析分支的狀態(tài)相等或者是它的一個(gè)子集,那么當(dāng)前指令的當(dāng)前分支就不需要分析了,因?yàn)樗隙ㄊ欠弦?guī)則的。
整個(gè)BPF的開發(fā)過程大概如下圖所示:
1.bpf()系統(tǒng)調(diào)用
核心代碼在bpf()系統(tǒng)調(diào)用中,我們從入口開始分析。
SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
{
union bpf_attr attr = {};
int err;
if (!capable(CAP_SYS_ADMIN) && sysctl_unprivileged_bpf_disabled)
return -EPERM;
if (!access_ok(VERIFY_READ, uattr, 1))
return -EFAULT;
if (size > PAGE_SIZE) /* silly large */
return -E2BIG;
/* If we're handed a bigger struct than we know of,
* ensure all the unknown bits are 0 - i.e. new
* user-space does not rely on any kernel feature
* extensions we dont know about yet.
*/
if (size > sizeof(attr)) {
unsigned char __user *addr;
unsigned char __user *end;
unsigned char val;
addr = (void __user *)uattr + sizeof(attr);
end = (void __user *)uattr + size;
for (; addr < end; addr++) {
err = get_user(val, addr);
if (err)
return err;
if (val)
return -E2BIG;
}
size = sizeof(attr);
}
/* copy attributes from user space, may be less than sizeof(bpf_attr) */
if (copy_from_user(&attr, uattr, size) != 0)
return -EFAULT;
switch (cmd) {
case BPF_MAP_CREATE:
err = map_create(&attr);
break;
case BPF_MAP_LOOKUP_ELEM:
err = map_lookup_elem(&attr);
break;
case BPF_MAP_UPDATE_ELEM:
err = map_update_elem(&attr);
break;
case BPF_MAP_DELETE_ELEM:
err = map_delete_elem(&attr);
break;
case BPF_MAP_GET_NEXT_KEY:
err = map_get_next_key(&attr);
break;
case BPF_PROG_LOAD:
err = bpf_prog_load(&attr);
break;
case BPF_OBJ_PIN:
err = bpf_obj_pin(&attr);
break;
case BPF_OBJ_GET:
err = bpf_obj_get(&attr);
break;
#ifdef CONFIG_CGROUP_BPF
case BPF_PROG_ATTACH:
err = bpf_prog_attach(&attr);
break;
case BPF_PROG_DETACH:
err = bpf_prog_detach(&attr);
break;
#endif
default:
err = -EINVAL;
break;
}
return err;
}
1.1、bpf加載
BPF_PROG_LOAD命令負(fù)責(zé)加載一段BPF程序到內(nèi)核當(dāng)中:
-
拷貝程序到內(nèi)核;
-
校驗(yàn)它的安全性;
-
如果可能對它進(jìn)行JIT編譯;
-
然后分配一個(gè)文件句柄fd給它。
完成這一切后,后續(xù)再把這段BPF程序掛載到需要運(yùn)行的鉤子上面。
1.1.1、bpf內(nèi)存空間分配
static int bpf_prog_load(union bpf_attr *attr)
{
enum bpf_prog_type type = attr->prog_type;
struct bpf_prog *prog;
int err;
char license[128];
bool is_gpl;
if (CHECK_ATTR(BPF_PROG_LOAD))
return -EINVAL;
/* copy eBPF program license from user space */
/* (1.1) 根據(jù)attr->license地址,從用戶空間拷貝license字符串到內(nèi)核 */
if (strncpy_from_user(license, u64_to_ptr(attr->license),
sizeof(license) - 1) < 0)
return -EFAULT;
license[sizeof(license) - 1] = 0;
/* eBPF programs must be GPL compatible to use GPL-ed functions */
/* (1.2) 判斷l(xiāng)icense是否符合GPL協(xié)議 */
is_gpl = license_is_gpl_compatible(license);
/* (1.3) 判斷BPF的總指令數(shù)是否超過BPF_MAXINSNS(4k) */
if (attr->insn_cnt >= BPF_MAXINSNS)
return -EINVAL;
/* (1.4) 如果加載BPF_PROG_TYPE_KPROBE類型的BPF程序,指定的內(nèi)核版本需要和當(dāng)前內(nèi)核版本匹配。
不然由于內(nèi)核的改動(dòng),可能會附加到錯(cuò)誤的地址上。
*/
if (type == BPF_PROG_TYPE_KPROBE &&
attr->kern_version != LINUX_VERSION_CODE)
return -EINVAL;
/* (1.5) 對BPF_PROG_TYPE_SOCKET_FILTER和BPF_PROG_TYPE_CGROUP_SKB以外的BPF程序加載,需要管理員權(quán)限 */
if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
type != BPF_PROG_TYPE_CGROUP_SKB &&
!capable(CAP_SYS_ADMIN))
return -EPERM;
/* plain bpf_prog allocation */
/* (2.1) 根據(jù)BPF指令數(shù)分配bpf_prog空間,和bpf_prog->aux空間 */
prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER);
if (!prog)
return -ENOMEM;
/* (2.2) 把整個(gè)bpf_prog空間在當(dāng)前進(jìn)程的memlock_limit中鎖定 */
err = bpf_prog_charge_memlock(prog);
if (err)
goto free_prog_nouncharge;
prog->len = attr->insn_cnt;
err = -EFAULT;
/* (2.3) 把BPF代碼從用戶空間地址attr->insns,拷貝到內(nèi)核空間地址prog->insns */
if (copy_from_user(prog->insns, u64_to_ptr(attr->insns),
prog->len * sizeof(struct bpf_insn)) != 0)
goto free_prog;
prog->orig_prog = NULL;
prog->jited = 0;
atomic_set(&prog->aux->refcnt, 1);
prog->gpl_compatible = is_gpl ? 1 : 0;
/* find program type: socket_filter vs tracing_filter */
/* (2.4) 根據(jù)attr->prog_type指定的type值,找到對應(yīng)的bpf_prog_types,
給bpf_prog->aux->ops賦值,這個(gè)ops是一個(gè)函數(shù)操作集
*/
err = find_prog_type(type, prog);
if (err < 0)
goto free_prog;
/* run eBPF verifier */
/* (3) 使用verifer對BPF程序進(jìn)行合法性掃描 */
err = bpf_check(&prog, attr);
if (err < 0)
goto free_used_maps;
/* eBPF program is ready to be JITed */
/* (4) 嘗試對BPF程序進(jìn)行JIT轉(zhuǎn)換 */
prog = bpf_prog_select_runtime(prog, &err);
if (err < 0)
goto free_used_maps;
/* (5) 給BPF程序分配一個(gè)文件句柄fd */
err = bpf_prog_new_fd(prog);
if (err < 0)
/* failed to allocate fd */
goto free_used_maps;
return err;
free_used_maps:
free_used_maps(prog->aux);
free_prog:
bpf_prog_uncharge_memlock(prog);
free_prog_nouncharge:
bpf_prog_free(prog);
return err;
}
這其中對BPF來說有個(gè)重要的數(shù)據(jù)結(jié)構(gòu)就是struct bpf_prog:
struct bpf_prog {
u16 pages; /* Number of allocated pages */
kmemcheck_bitfield_begin(meta);
u16 jited:1, /* Is our filter JIT'ed? */
gpl_compatible:1, /* Is filter GPL compatible? */
cb_access:1, /* Is control block accessed? */
dst_needed:1; /* Do we need dst entry? */
kmemcheck_bitfield_end(meta);
u32 len; /* Number of filter blocks */
enum bpf_prog_type type; /* Type of BPF program */
struct bpf_prog_aux *aux; /* Auxiliary fields */
struct sock_fprog_kern *orig_prog; /* Original BPF program */
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
/* Instructions for interpreter */
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
};
};
其中重要的成員如下:
-
len:程序包含bpf指令的數(shù)量;
-
type:當(dāng)前bpf程序的類型(kprobe/tracepoint/perf_event/sk_filter/sched_cls/sched_act/xdp/cg_skb);
-
aux:主要用來輔助verifier校驗(yàn)和轉(zhuǎn)換的數(shù)據(jù);
-
orig_prog:
-
bpf_func:運(yùn)行時(shí)BPF程序的入口。如果JIT轉(zhuǎn)換成功,這里指向的就是BPF程序JIT轉(zhuǎn)換后的映像;否則這里指向內(nèi)核解析器(interpreter)的通用入口__bpf_prog_run();
-
insnsi[]:從用戶態(tài)拷貝過來的,BPF程序原始指令的存放空間;
1.1.2、bpf verifier
關(guān)于verifier的步驟和規(guī)則,在“3.1、Berkeley Packet Filter (BPF) (Kernel Document)”一文的“eBPF verifier”一節(jié)有詳細(xì)描述。
另外,在kernel/bpf/verifier.c文件的開頭對eBPF verifier也有一段詳細(xì)的注釋:
bpf_check()是一個(gè)靜態(tài)代碼分析器,它按指令遍歷eBPF程序指令并更新寄存器/堆棧狀態(tài)。分析條件分支的所有路徑,直到'bpf_exit'指令。
1、第一步是深度優(yōu)先搜索,檢查程序是否為DAG(Directed Acyclic Graph 有向無環(huán)圖)。它將會拒絕以下程序:
- 大于BPF_MAXINSNS條指令(BPF_MAXINSNS=4096)
- 如果出現(xiàn)循環(huán)(通過back-edge檢測)
- 不可達(dá)的指令存在(不應(yīng)該是森林,程序等于一個(gè)函數(shù))
- 越界或畸形的跳躍
2、第二步是從第一步所有可能路徑的展開。
- 因?yàn)樗治隽顺绦蛩械穆窂?,這個(gè)分析的最大長度限制為32k個(gè)指令,即使指令總數(shù)小于4k也會受到影響,因?yàn)橛刑嗟姆种Ц淖兞硕褩?寄存器。
- 分支的分析數(shù)量被限制為1k。
在進(jìn)入每條指令時(shí),每個(gè)寄存器都有一個(gè)類型,該指令根據(jù)指令語義改變寄存器的類型:
- rule 1、如果指令是BPF_MOV64_REG(BPF_REG_1, BPF_REG_5),則將R5的類型復(fù)制到R1。
所有寄存器都是64位的。
* R0 -返回寄存器
* R1-R5參數(shù)傳遞寄存器
* R6-R9被調(diào)用方保存寄存器
* R10 -幀指針只讀
- rule 2、在BPF程序開始時(shí),寄存器R1包含一個(gè)指向bpf_context的指針,類型為PTR_TO_CTX。
- rule 3、verifier跟蹤指針上的算術(shù)運(yùn)算:
`
BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -20),
`
第一條指令將R10(它具有FRAME_PTR)類型復(fù)制到R1中,第二條算術(shù)指令是匹配的模式,用于識別它想要構(gòu)造一個(gè)指向堆棧中某個(gè)元素的指針。
因此,在第二條指令之后,寄存器R1的類型為PTR_TO_STACK(-20常數(shù)需要進(jìn)一步的堆棧邊界檢查)。表示這個(gè)reg是一個(gè)指針由堆棧加上常數(shù)。
- rule 4、大多數(shù)時(shí)候寄存器都有UNKNOWN_VALUE類型,這意味著寄存器有一些值,但它不是一個(gè)有效的指針。(就像指針+指針變成了UNKNOWN_VALUE類型)
- rule 5、當(dāng)verifier看到load指令或store指令時(shí),基本寄存器的類型可以是:PTR_TO_MAP_VALUE、PTR_TO_CTX、FRAME_PTR。這是由check_mem_access()函數(shù)識別的三種指針類型。
- rule 6、PTR_TO_MAP_VALUE表示這個(gè)寄存器指向‘map元素的值’,并且可以訪問[ptr, ptr + map value_size)的范圍。
- rule 7、寄存器用于向函數(shù)調(diào)用傳遞參數(shù),將根據(jù)函數(shù)參數(shù)約束進(jìn)行檢查。
ARG_PTR_TO_MAP_KEY就是這樣的參數(shù)約束之一。
這意味著傳遞給這個(gè)函數(shù)的寄存器類型必須是PTR_TO_STACK,它將作為‘map element key的指針’在函數(shù)內(nèi)部使用。
例如bpf_map_lookup_elem()的參數(shù)約束:
`
.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
.arg1_type = ARG_CONST_MAP_PTR,
.arg2_type = ARG_PTR_TO_MAP_KEY,
`
ret_type表示該函數(shù)返回“指向map element value的指針或null”。
函數(shù)期望第一個(gè)參數(shù)是指向‘struct bpf_map’的const指針,第二個(gè)參數(shù)應(yīng)該是指向stack的指針,這個(gè)指針在helper函數(shù)中用作map element key的指針。
在內(nèi)核側(cè)的helper函數(shù)如下:
`
u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
{
struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
void *key = (void *) (unsigned long) r2;
void *value;
here kernel can access 'key' and 'map' pointers safely, knowing that
[key, key + map->key_size) bytes are valid and were initialized on
the stack of eBPF program.
}
`
相應(yīng)的eBPF程序如下:
`
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), // after this insn R2 type is FRAME_PTR
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // after this insn R2 type is PTR_TO_STACK
BPF_LD_MAP_FD(BPF_REG_1, map_fd), // after this insn R1 type is CONST_PTR_TO_MAP
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
`
這里verifier查看map_lookup_elem()的原型,看到:
- .arg1_type == ARG_CONST_MAP_PTR and R1->type == CONST_PTR_TO_MAP, 這個(gè)是ok的。現(xiàn)在verifier知道m(xù)ap key的尺寸了:R1->map_ptr->key_size。
- 然后.arg2_type == ARG_PTR_TO_MAP_KEY and R2->type == PTR_TO_STACK也是ok的。
現(xiàn)在verifier檢測 [R2, R2 + map's key_size]是否在堆棧限制內(nèi),并且在調(diào)用之前被初始化。
- 如果可以,那么verifier允許這個(gè)BPF_CALL指令,并查看.ret_type RET_PTR_TO_MAP_VALUE_OR_NULL,因此它設(shè)置R0->類型= PTR_TO_MAP_VALUE_OR_NULL,這意味著bpf_map_lookup_elem()函數(shù)返回map value指針或NULL。
當(dāng)類型PTR_TO_MAP_VALUE_OR_NULL通過'if (reg != 0) goto +off' 指令判斷時(shí),在真分支中持有指針的寄存器將狀態(tài)更改為PTR_TO_MAP_VALUE,在假分支中相同的寄存器將狀態(tài)更改為CONST_IMM。看check_cond_jmp_op()的實(shí)現(xiàn)。
函數(shù)調(diào)用以后R0設(shè)置為返回函數(shù)類型后,將寄存器R1-R5設(shè)置為NOT_INIT,以指示它們不再可讀。
原文如下:
/* bpf_check() is a static code analyzer that walks eBPF program
* instruction by instruction and updates register/stack state.
* All paths of conditional branches are analyzed until 'bpf_exit' insn.
*
* The first pass is depth-first-search to check that the program is a DAG.
* It rejects the following programs:
* - larger than BPF_MAXINSNS insns
* - if loop is present (detected via back-edge)
* - unreachable insns exist (shouldn't be a forest. program = one function)
* - out of bounds or malformed jumps
* The second pass is all possible path descent from the 1st insn.
* Since it's analyzing all pathes through the program, the length of the
* analysis is limited to 32k insn, which may be hit even if total number of
* insn is less then 4K, but there are too many branches that change stack/regs.
* Number of 'branches to be analyzed' is limited to 1k
*
* On entry to each instruction, each register has a type, and the instruction
* changes the types of the registers depending on instruction semantics.
* If instruction is BPF_MOV64_REG(BPF_REG_1, BPF_REG_5), then type of R5 is
* copied to R1.
*
* All registers are 64-bit.
* R0 - return register
* R1-R5 argument passing registers
* R6-R9 callee saved registers
* R10 - frame pointer read-only
*
* At the start of BPF program the register R1 contains a pointer to bpf_context
* and has type PTR_TO_CTX.
*
* Verifier tracks arithmetic operations on pointers in case:
* BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
* BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -20),
* 1st insn copies R10 (which has FRAME_PTR) type into R1
* and 2nd arithmetic instruction is pattern matched to recognize
* that it wants to construct a pointer to some element within stack.
* So after 2nd insn, the register R1 has type PTR_TO_STACK
* (and -20 constant is saved for further stack bounds checking).
* Meaning that this reg is a pointer to stack plus known immediate constant.
*
* Most of the time the registers have UNKNOWN_VALUE type, which
* means the register has some value, but it's not a valid pointer.
* (like pointer plus pointer becomes UNKNOWN_VALUE type)
*
* When verifier sees load or store instructions the type of base register
* can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, FRAME_PTR. These are three pointer
* types recognized by check_mem_access() function.
*
* PTR_TO_MAP_VALUE means that this register is pointing to 'map element value'
* and the range of [ptr, ptr + map's value_size) is accessible.
*
* registers used to pass values to function calls are checked against
* function argument constraints.
*
* ARG_PTR_TO_MAP_KEY is one of such argument constraints.
* It means that the register type passed to this function must be
* PTR_TO_STACK and it will be used inside the function as
* 'pointer to map element key'
*
* For example the argument constraints for bpf_map_lookup_elem():
* .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
* .arg1_type = ARG_CONST_MAP_PTR,
* .arg2_type = ARG_PTR_TO_MAP_KEY,
*
* ret_type says that this function returns 'pointer to map elem value or null'
* function expects 1st argument to be a const pointer to 'struct bpf_map' and
* 2nd argument should be a pointer to stack, which will be used inside
* the helper function as a pointer to map element key.
*
* On the kernel side the helper function looks like:
* u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
* {
* struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
* void *key = (void *) (unsigned long) r2;
* void *value;
*
* here kernel can access 'key' and 'map' pointers safely, knowing that
* [key, key + map->key_size) bytes are valid and were initialized on
* the stack of eBPF program.
* }
*
* Corresponding eBPF program may look like:
* BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), // after this insn R2 type is FRAME_PTR
* BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // after this insn R2 type is PTR_TO_STACK
* BPF_LD_MAP_FD(BPF_REG_1, map_fd), // after this insn R1 type is CONST_PTR_TO_MAP
* BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
* here verifier looks at prototype of map_lookup_elem() and sees:
* .arg1_type == ARG_CONST_MAP_PTR and R1->type == CONST_PTR_TO_MAP, which is ok,
* Now verifier knows that this map has key of R1->map_ptr->key_size bytes
*
* Then .arg2_type == ARG_PTR_TO_MAP_KEY and R2->type == PTR_TO_STACK, ok so far,
* Now verifier checks that [R2, R2 + map's key_size) are within stack limits
* and were initialized prior to this call.
* If it's ok, then verifier allows this BPF_CALL insn and looks at
* .ret_type which is RET_PTR_TO_MAP_VALUE_OR_NULL, so it sets
* R0->type = PTR_TO_MAP_VALUE_OR_NULL which means bpf_map_lookup_elem() function
* returns ether pointer to map value or NULL.
*
* When type PTR_TO_MAP_VALUE_OR_NULL passes through 'if (reg != 0) goto +off'
* insn, the register holding that pointer in the true branch changes state to
* PTR_TO_MAP_VALUE and the same register changes state to CONST_IMM in the false
* branch. See check_cond_jmp_op().
*
* After the call R0 is set to return type of the function and registers R1-R5
* are set to NOT_INIT to indicate that they are no longer readable.
*/
BPF verifier總體代碼流程如下:
int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
{
char __user *log_ubuf = NULL;
struct bpf_verifier_env *env;
int ret = -EINVAL;
if ((*prog)->len <= 0 || (*prog)->len > BPF_MAXINSNS)
return -E2BIG;
/* 'struct bpf_verifier_env' can be global, but since it's not small,
* allocate/free it every time bpf_check() is called
*/
/* (3.1) 分配verifier靜態(tài)掃描需要的數(shù)據(jù)結(jié)構(gòu) */
env = kzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL);
if (!env)
return -ENOMEM;
env->insn_aux_data = vzalloc(sizeof(struct bpf_insn_aux_data) *
(*prog)->len);
ret = -ENOMEM;
if (!env->insn_aux_data)
goto err_free_env;
env->prog = *prog;
/* grab the mutex to protect few globals used by verifier */
mutex_lock(&bpf_verifier_lock);
/* (3.2) 如果用戶指定了attr->log_buf,說明用戶需要具體的代碼掃描log,這個(gè)在出錯(cuò)時(shí)非常有用
先在內(nèi)核中分配log空間,在返回時(shí)拷貝給用戶
*/
if (attr->log_level || attr->log_buf || attr->log_size) {
/* user requested verbose verifier output
* and supplied buffer to store the verification trace
*/
log_level = attr->log_level;
log_ubuf = (char __user *) (unsigned long) attr->log_buf;
log_size = attr->log_size;
log_len = 0;
ret = -EINVAL;
/* log_* values have to be sane */
if (log_size < 128 || log_size > UINT_MAX >> 8 ||
log_level == 0 || log_ubuf == NULL)
goto err_unlock;
ret = -ENOMEM;
log_buf = vmalloc(log_size);
if (!log_buf)
goto err_unlock;
} else {
log_level = 0;
}
/* (3.3) 把BPF程序中操作map的指令,從map_fd替換成實(shí)際的map指針
由此可見用戶態(tài)的loader程序,肯定是先根據(jù)__section("maps")中定義的map調(diào)用bpf()創(chuàng)建map,再加載其他的程序section;
*/
ret = replace_map_fd_with_map_ptr(env);
if (ret < 0)
goto skip_full_check;
env->explored_states = kcalloc(env->prog->len,
sizeof(struct bpf_verifier_state_list *),
GFP_USER);
ret = -ENOMEM;
if (!env->explored_states)
goto skip_full_check;
/* (3.4) step1、檢查有沒有環(huán)路 */
ret = check_cfg(env);
if (ret < 0)
goto skip_full_check;
env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
/* (3.5) step2、詳細(xì)掃描BPF代碼的運(yùn)行過程,跟蹤分析寄存器和堆棧,檢查是否有不符合規(guī)則的情況出現(xiàn) */
ret = do_check(env);
skip_full_check:
while (pop_stack(env, NULL) >= 0);
free_states(env);
/* (3.6) 把掃描分析出來的dead代碼(就是不會運(yùn)行的代碼)轉(zhuǎn)成nop指令 */
if (ret == 0)
sanitize_dead_code(env);
/* (3.7) 根據(jù)程序的type,轉(zhuǎn)換對ctx指針成員的訪問 */
if (ret == 0)
/* program is valid, convert *(u32*)(ctx + off) accesses */
ret = convert_ctx_accesses(env);
/* (3.8) 修復(fù)BPF指令中對內(nèi)核helper function函數(shù)的調(diào)用,把函數(shù)編號替換成實(shí)際的函數(shù)指針 */
if (ret == 0)
ret = fixup_bpf_calls(env);
if (log_level && log_len >= log_size - 1) {
BUG_ON(log_len >= log_size);
/* verifier log exceeded user supplied buffer */
ret = -ENOSPC;
/* fall through to return what was recorded */
}
/* (3.9) 拷貝verifier log到用戶空間 */
/* copy verifier log back to user space including trailing zero */
if (log_level && copy_to_user(log_ubuf, log_buf, log_len + 1) != 0) {
ret = -EFAULT;
goto free_log_buf;
}
/* (3.10) 備份BPF程序?qū)ap的引用信息,到prog->aux->used_maps中 */
if (ret == 0 && env->used_map_cnt) {
/* if program passed verifier, update used_maps in bpf_prog_info */
env->prog->aux->used_maps = kmalloc_array(env->used_map_cnt,
sizeof(env->used_maps[0]),
GFP_KERNEL);
if (!env->prog->aux->used_maps) {
ret = -ENOMEM;
goto free_log_buf;
}
memcpy(env->prog->aux->used_maps, env->used_maps,
sizeof(env->used_maps[0]) * env->used_map_cnt);
env->prog->aux->used_map_cnt = env->used_map_cnt;
/* program is valid. Convert pseudo bpf_ld_imm64 into generic
* bpf_ld_imm64 instructions
*/
convert_pseudo_ld_imm64(env);
}
free_log_buf:
if (log_level)
vfree(log_buf);
if (!env->prog->aux->used_maps)
/* if we didn't copy map pointers into bpf_prog_info, release
* them now. Otherwise free_bpf_prog_info() will release them.
*/
release_maps(env);
*prog = env->prog;
err_unlock:
mutex_unlock(&bpf_verifier_lock);
vfree(env->insn_aux_data);
err_free_env:
kfree(env);
return ret;
}
-
1、把BPF程序中操作map的指令,從map_fd替換成實(shí)際的map指針。
由此可見用戶態(tài)的loader程序,肯定是先根據(jù)__section(“maps”)中定義的map調(diào)用bpf()創(chuàng)建map,再加載其他的程序section。
符合條件:(insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) && (insn[0]->src_reg == BPF_PSEUDO_MAP_FD) 的指令為map指針加載指針。
把原始的立即數(shù)作為fd找到對應(yīng)的map指針。
把64bit的map指針拆分成兩個(gè)32bit的立即數(shù),存儲到insn[0].imm、insn[1].imm中。
static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env)
{
struct bpf_insn *insn = env->prog->insnsi;
int insn_cnt = env->prog->len;
int i, j, err;
/* (3.3.1) 遍歷所有BPF指令 */
for (i = 0; i < insn_cnt; i++, insn++) {
if (BPF_CLASS(insn->code) == BPF_LDX &&
(BPF_MODE(insn->code) != BPF_MEM || insn->imm != 0)) {
verbose("BPF_LDX uses reserved fields ");
return -EINVAL;
}
if (BPF_CLASS(insn->code) == BPF_STX &&
((BPF_MODE(insn->code) != BPF_MEM &&
BPF_MODE(insn->code) != BPF_XADD) || insn->imm != 0)) {
verbose("BPF_STX uses reserved fields ");
return -EINVAL;
}
/* (3.3.2) 符合條件:(insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) && (insn[0]->src_reg == BPF_PSEUDO_MAP_FD)
的指令為map指針加載指針
*/
if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) {
struct bpf_map *map;
struct fd f;
if (i == insn_cnt - 1 || insn[1].code != 0 ||
insn[1].dst_reg != 0 || insn[1].src_reg != 0 ||
insn[1].off != 0) {
verbose("invalid bpf_ld_imm64 insn ");
return -EINVAL;
}
if (insn->src_reg == 0)
/* valid generic load 64-bit imm */
goto next_insn;
if (insn->src_reg != BPF_PSEUDO_MAP_FD) {
verbose("unrecognized bpf_ld_imm64 insn ");
return -EINVAL;
}
/* (3.3.3) 根據(jù)指令中的立即數(shù)insn[0]->imm指定的fd,得到實(shí)際的map指針 */
f = fdget(insn->imm);
map = __bpf_map_get(f);
if (IS_ERR(map)) {
verbose("fd %d is not pointing to valid bpf_map ",
insn->imm);
return PTR_ERR(map);
}
的·
/* (3.3.4) 檢查map和當(dāng)前類型BPF程序的兼容性 */
err = check_map_prog_compatibility(map, env->prog);
if (err) {
fdput(f);
return err;
}
/* (3.3.5) 把64bit的map指針拆分成兩個(gè)32bit的立即數(shù),存儲到insn[0].imm、insn[1].imm中 */
/* store map pointer inside BPF_LD_IMM64 instruction */
insn[0].imm = (u32) (unsigned long) map;
insn[1].imm = ((u64) (unsigned long) map) >> 32;
/* check whether we recorded this map already */
for (j = 0; j < env->used_map_cnt; j++)
if (env->used_maps[j] == map) {
fdput(f);
goto next_insn;
}
/* (3.3.6) 一個(gè)prog最多引用64個(gè)map */
if (env->used_map_cnt >= MAX_USED_MAPS) {
fdput(f);
return -E2BIG;
}
/* hold the map. If the program is rejected by verifier,
* the map will be released by release_maps() or it
* will be used by the valid program until it's unloaded
* and all maps are released in free_bpf_prog_info()
*/
map = bpf_map_inc(map, false);
if (IS_ERR(map)) {
fdput(f);
return PTR_ERR(map);
}
/* (3.3.7) 記錄prog對map的引用 */
env->used_maps[env->used_map_cnt++] = map;
fdput(f);
next_insn:
insn++;
i++;
}
}
/* now all pseudo BPF_LD_IMM64 instructions load valid
* 'struct bpf_map *' into a register instead of user map_fd.
* These pointers will be used later by verifier to validate map access.
*/
return 0;
}
-
2、Step 1、通過DAG(Directed Acyclic Graph 有向無環(huán)圖)的
DFS(Depth-first Search)深度優(yōu)先算法來遍歷BPF程序的代碼路徑,確保沒有環(huán)路發(fā)生;
DAG的DFS算法可以參考“Graph”一文。其中最重要的概念如下圖:
一個(gè)圖形"Graph"經(jīng)過DAG的DFS算法遍歷后,對每一個(gè)根節(jié)點(diǎn)都會形成一顆樹“DFS Tree”,多個(gè)根節(jié)點(diǎn)得到的多棵樹形成一個(gè)森林"DFS Forest"。根據(jù)搜索的結(jié)構(gòu)整個(gè)“Graph”的邊“Edge”可以分成四類:
-
Tree Edges:在DFS樹上的邊;
-
Back Edges:從子節(jié)點(diǎn)連向祖先節(jié)點(diǎn)的邊(形成環(huán));
-
Forward Edges:直接連向?qū)O節(jié)點(diǎn)的邊(跨子節(jié)點(diǎn)的連接);
-
Cross Edges:葉子之間的連接,或者樹之間的連接;
對BPF verifier來說,檢查BPF程序的運(yùn)行路徑圖中是否有“Back Edges”的存在,確保程序中沒有環(huán)路。
具體的代碼如下:
static int check_cfg(struct bpf_verifier_env *env)
{
struct bpf_insn *insns = env->prog->insnsi;
int insn_cnt = env->prog->len;
int ret = 0;
int i, t;
insn_state = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
if (!insn_state)
return -ENOMEM;
insn_stack = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
if (!insn_stack) {
kfree(insn_state);
return -ENOMEM;
}
insn_state[0] = DISCOVERED; /* mark 1st insn as discovered */
insn_stack[0] = 0; /* 0 is the first instruction */
cur_stack = 1;
/* (3.4.1) DFS深度優(yōu)先算法的循環(huán) */
peek_stack:
if (cur_stack == 0)
goto check_state;
t = insn_stack[cur_stack - 1];
/* (3.4.2) 分支指令 */
if (BPF_CLASS(insns[t].code) == BPF_JMP) {
u8 opcode = BPF_OP(insns[t].code);
/* (3.4.2.1) 碰到BPF_EXIT指令,路徑終結(jié),開始回溯確認(rèn) */
if (opcode == BPF_EXIT) {
goto mark_explored;
/* (3.4.2.2) 碰到BPF_CALL指令,繼續(xù)探索
并且把env->explored_states[]設(shè)置成STATE_LIST_MARK,標(biāo)識call函數(shù)調(diào)用后需要重新跟蹤計(jì)算寄存器和堆棧
*/
} else if (opcode == BPF_CALL) {
ret = push_insn(t, t + 1, FALLTHROUGH, env);
if (ret == 1)
goto peek_stack;
else if (ret < 0)
goto err_free;
if (t + 1 < insn_cnt)
env->explored_states[t + 1] = STATE_LIST_MARK;
/* (3.4.2.3) 碰到BPF_JA指令,繼續(xù)探索
并且把env->explored_states[]設(shè)置成STATE_LIST_MARK,標(biāo)識call函數(shù)調(diào)用后需要重新跟蹤計(jì)算寄存器和堆棧
*/
} else if (opcode == BPF_JA) {
if (BPF_SRC(insns[t].code) != BPF_K) {
ret = -EINVAL;
goto err_free;
}
/* unconditional jump with single edge */
ret = push_insn(t, t + insns[t].off + 1,
FALLTHROUGH, env);
if (ret == 1)
goto peek_stack;
else if (ret < 0)
goto err_free;
/* tell verifier to check for equivalent states
* after every call and jump
*/
if (t + 1 < insn_cnt)
env->explored_states[t + 1] = STATE_LIST_MARK;
/* (3.4.2.4) 剩下的是有條件跳轉(zhuǎn)指令,首先探測條件失敗路徑,再探測條件成功路徑
并且把env->explored_states[]設(shè)置成STATE_LIST_MARK,標(biāo)識call函數(shù)調(diào)用后需要重新跟蹤計(jì)算寄存器和堆棧
*/
} else {
/* conditional jump with two edges */
env->explored_states[t] = STATE_LIST_MARK;
/* 條件失敗路徑 */
ret = push_insn(t, t + 1, FALLTHROUGH, env);
if (ret == 1)
goto peek_stack;
else if (ret < 0)
goto err_free;
/* 條件成功路徑 */
ret = push_insn(t, t + insns[t].off + 1, BRANCH, env);
if (ret == 1)
goto peek_stack;
else if (ret < 0)
goto err_free;
}
/* (3.4.3) 非分支指令 */
} else {
/* all other non-branch instructions with single
* fall-through edge
*/
ret = push_insn(t, t + 1, FALLTHROUGH, env);
/* (3.4.3.1) ret的含義如下
ret == 1:繼續(xù)探索路徑
ret == 0:已經(jīng)是葉子節(jié)點(diǎn)了,跳轉(zhuǎn)到mark_explored確認(rèn)并回溯
ret < 0:探測到"back-edge"環(huán)路,或者其他錯(cuò)誤
*/
if (ret == 1)
goto peek_stack;
else if (ret < 0)
goto err_free;
}
/* (3.4.4) 確認(rèn)并回溯,狀態(tài)標(biāo)記為EXPLORED
*/
mark_explored:
insn_state[t] = EXPLORED;
if (cur_stack-- <= 0) {
verbose("pop stack internal bug ");
ret = -EFAULT;
goto err_free;
}
goto peek_stack;
/* (3.4.5) 確認(rèn)沒有unreachable的指令,就是路徑?jīng)]法抵達(dá) */
check_state:
for (i = 0; i < insn_cnt; i++) {
if (insn_state[i] != EXPLORED) {
verbose("unreachable insn %d ", i);
ret = -EINVAL;
goto err_free;
}
}
ret = 0; /* cfg looks good */
err_free:
kfree(insn_state);
kfree(insn_stack);
return ret;
}
-
3、step2、詳細(xì)掃描BPF代碼的運(yùn)行過程,跟蹤分析寄存器和堆棧,檢查是否有不符合規(guī)則的情況出現(xiàn)。
這段代碼的具體算法就是把step1的路徑重新走一遍,并且跟蹤寄存器和堆棧的變化,判斷最壞情況下是否有違反規(guī)則的情況出現(xiàn)。
在碰到指令對應(yīng)explored_states[]被設(shè)置成STATE_LIST_MARK,需要給當(dāng)前指令獨(dú)立分配一個(gè)bpf_verifier_state_list鏈表,來存儲這個(gè)指令在多個(gè)分支上的不同狀況。
這里也有一個(gè)快速分析的優(yōu)化方法:修剪(Pruning)。如果當(dāng)前指令的當(dāng)前分支的狀態(tài)cur_state,和當(dāng)前指令另一個(gè)已分析分支的狀態(tài)(當(dāng)前指令explored_states[]鏈表中的一個(gè)bpf_verifier_state_list成員)相等或者是它的一個(gè)子集,那么當(dāng)前指令的當(dāng)前分支就不需要分析了,因?yàn)樗隙ㄊ欠弦?guī)則的。
static int do_check(struct bpf_verifier_env *env)
{
struct bpf_verifier_state *state = &env->cur_state;
struct bpf_insn *insns = env->prog->insnsi;
struct bpf_reg_state *regs = state->regs;
int insn_cnt = env->prog->len;
int insn_idx, prev_insn_idx = 0;
int insn_processed = 0;
bool do_print_state = false;
init_reg_state(regs);
insn_idx = 0;
env->varlen_map_value_access = false;
for (;;) {
struct bpf_insn *insn;
u8 class;
int err;
if (insn_idx >= insn_cnt) {
verbose("invalid insn idx %d insn_cnt %d ",
insn_idx, insn_cnt);
return -EFAULT;
}
insn = &insns[insn_idx];
class = BPF_CLASS(insn->code);
if (++insn_processed > BPF_COMPLEXITY_LIMIT_INSNS) {
verbose("BPF program is too large. Proccessed %d insn ",
insn_processed);
return -E2BIG;
}
err = is_state_visited(env, insn_idx);
if (err < 0)
return err;
if (err == 1) {
/* found equivalent state, can prune the search */
if (log_level) {
if (do_print_state)
verbose(" from %d to %d: safe ",
prev_insn_idx, insn_idx);
else
verbose("%d: safe ", insn_idx);
}
goto process_bpf_exit;
}
if (need_resched())
cond_resched();
if (log_level && do_print_state) {
verbose(" from %d to %d:", prev_insn_idx, insn_idx);
print_verifier_state(&env->cur_state);
do_print_state = false;
}
if (log_level) {
verbose("%d: ", insn_idx);
print_bpf_insn(env, insn);
}
err = ext_analyzer_insn_hook(env, insn_idx, prev_insn_idx);
if (err)
return err;
env->insn_aux_data[insn_idx].seen = true;
if (class == BPF_ALU || class == BPF_ALU64) {
err = check_alu_op(env, insn);
if (err)
return err;
} else if (class == BPF_LDX) {
enum bpf_reg_type *prev_src_type, src_reg_type;
/* check for reserved fields is already done */
/* check src operand */
err = check_reg_arg(regs, insn->src_reg, SRC_OP);
if (err)
return err;
err = check_reg_arg(regs, insn->dst_reg, DST_OP_NO_MARK);
if (err)
return err;
src_reg_type = regs[insn->src_reg].type;
/* check that memory (src_reg + off) is readable,
* the state of dst_reg will be updated by this func
*/
err = check_mem_access(env, insn->src_reg, insn->off,
BPF_SIZE(insn->code), BPF_READ,
insn->dst_reg);
if (err)
return err;
reset_reg_range_values(regs, insn->dst_reg);
if (BPF_SIZE(insn->code) != BPF_W &&
BPF_SIZE(insn->code) != BPF_DW) {
insn_idx++;
continue;
}
prev_src_type = &env->insn_aux_data[insn_idx].ptr_type;
if (*prev_src_type == NOT_INIT) {
/* saw a valid insn
* dst_reg = *(u32 *)(src_reg + off)
* save type to validate intersecting paths
*/
*prev_src_type = src_reg_type;
} else if (src_reg_type != *prev_src_type &&
(src_reg_type == PTR_TO_CTX ||
*prev_src_type == PTR_TO_CTX)) {
/* ABuser program is trying to use the same insn
* dst_reg = *(u32*) (src_reg + off)
* with different pointer types:
* src_reg == ctx in one branch and
* src_reg == stack|map in some other branch.
* Reject it.
*/
verbose("same insn cannot be used with different pointers ");
return -EINVAL;
}
} else if (class == BPF_STX) {
enum bpf_reg_type *prev_dst_type, dst_reg_type;
if (BPF_MODE(insn->code) == BPF_XADD) {
err = check_xadd(env, insn);
if (err)
return err;
insn_idx++;
continue;
}
/* check src1 operand */
err = check_reg_arg(regs, insn->src_reg, SRC_OP);
if (err)
return err;
/* check src2 operand */
err = check_reg_arg(regs, insn->dst_reg, SRC_OP);
if (err)
return err;
dst_reg_type = regs[insn->dst_reg].type;
/* check that memory (dst_reg + off) is writeable */
err = check_mem_access(env, insn->dst_reg, insn->off,
BPF_SIZE(insn->code), BPF_WRITE,
insn->src_reg);
if (err)
return err;
prev_dst_type = &env->insn_aux_data[insn_idx].ptr_type;
if (*prev_dst_type == NOT_INIT) {
*prev_dst_type = dst_reg_type;
} else if (dst_reg_type != *prev_dst_type &&
(dst_reg_type == PTR_TO_CTX ||
*prev_dst_type == PTR_TO_CTX)) {
verbose("same insn cannot be used with different pointers ");
return -EINVAL;
}
} else if (class == BPF_ST) {
if (BPF_MODE(insn->code) != BPF_MEM ||
insn->src_reg != BPF_REG_0) {
verbose("BPF_ST uses reserved fields ");
return -EINVAL;
}
/* check src operand */
err = check_reg_arg(regs, insn->dst_reg, SRC_OP);
if (err)
return err;
if (is_ctx_reg(env, insn->dst_reg)) {
verbose("BPF_ST stores into R%d context is not allowed ",
insn->dst_reg);
return -EACCES;
}
/* check that memory (dst_reg + off) is writeable */
err = check_mem_access(env, insn->dst_reg, insn->off,
BPF_SIZE(insn->code), BPF_WRITE,
-1);
if (err)
return err;
} else if (class == BPF_JMP) {
u8 opcode = BPF_OP(insn->code);
if (opcode == BPF_CALL) {
if (BPF_SRC(insn->code) != BPF_K ||
insn->off != 0 ||
insn->src_reg != BPF_REG_0 ||
insn->dst_reg != BPF_REG_0) {
verbose("BPF_CALL uses reserved fields ");
return -EINVAL;
}
err = check_call(env, insn->imm, insn_idx);
if (err)
return err;
} else if (opcode == BPF_JA) {
if (BPF_SRC(insn->code) != BPF_K ||
insn->imm != 0 ||
insn->src_reg != BPF_REG_0 ||
insn->dst_reg != BPF_REG_0) {
verbose("BPF_JA uses reserved fields ");
return -EINVAL;
}
insn_idx += insn->off + 1;
continue;
} else if (opcode == BPF_EXIT) {
if (BPF_SRC(insn->code) != BPF_K ||
insn->imm != 0 ||
insn->src_reg != BPF_REG_0 ||
insn->dst_reg != BPF_REG_0) {
verbose("BPF_EXIT uses reserved fields ");
return -EINVAL;
}
/* eBPF calling convetion is such that R0 is used
* to return the value from eBPF program.
* Make sure that it's readable at this time
* of bpf_exit, which means that program wrote
* something into it earlier
*/
err = check_reg_arg(regs, BPF_REG_0, SRC_OP);
if (err)
return err;
if (is_pointer_value(env, BPF_REG_0)) {
verbose("R0 leaks addr as return value ");
return -EACCES;
}
process_bpf_exit:
insn_idx = pop_stack(env, &prev_insn_idx);
if (insn_idx < 0) {
break;
} else {
do_print_state = true;
continue;
}
} else {
err = check_cond_jmp_op(env, insn, &insn_idx);
if (err)
return err;
}
} else if (class == BPF_LD) {
u8 mode = BPF_MODE(insn->code);
if (mode == BPF_ABS || mode == BPF_IND) {
err = check_ld_abs(env, insn);
if (err)
return err;
} else if (mode == BPF_IMM) {
err = check_ld_imm(env, insn);
if (err)
return err;
insn_idx++;
env->insn_aux_data[insn_idx].seen = true;
} else {
verbose("invalid BPF_LD mode ");
return -EINVAL;
}
reset_reg_range_values(regs, insn->dst_reg);
} else {
verbose("unknown insn class %d ", class);
return -EINVAL;
}
insn_idx++;
}
verbose("processed %d insns ", insn_processed);
return 0;
}
-
4、修復(fù)BPF指令中對內(nèi)核helper function函數(shù)的調(diào)用,把函數(shù)編號替換成實(shí)際的函數(shù)指針。
符合條件:(insn->code == (BPF_JMP | BPF_CALL)) 的指令,即是調(diào)用helper function的指令。
通用helper function的處理:根據(jù)insn->imm指定的編號找打?qū)?yīng)的函數(shù)指針,然后再把函數(shù)指針和__bpf_call_base之間的offset,賦值到insn->imm中。
static int fixup_bpf_calls(struct bpf_verifier_env *env)
{
struct bpf_prog *prog = env->prog;
struct bpf_insn *insn = prog->insnsi;
const struct bpf_func_proto *fn;
const int insn_cnt = prog->len;
struct bpf_insn insn_buf[16];
struct bpf_prog *new_prog;
struct bpf_map *map_ptr;
int i, cnt, delta = 0;
/* (3.8.1) 遍歷指令 */
for (i = 0; i < insn_cnt; i++, insn++) {
/* (3.8.2) 修復(fù)ALU指令的一個(gè)bug */
if (insn->code == (BPF_ALU | BPF_MOD | BPF_X) ||
insn->code == (BPF_ALU | BPF_DIV | BPF_X)) {
/* due to JIT bugs clear upper 32-bits of src register
* before div/mod operation
*/
insn_buf[0] = BPF_MOV32_REG(insn->src_reg, insn->src_reg);
insn_buf[1] = *insn;
cnt = 2;
new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
if (!new_prog)
return -ENOMEM;
delta += cnt - 1;
env->prog = prog = new_prog;
insn = new_prog->insnsi + i + delta;
continue;
}
/* (3.8.3) 符合條件:(insn->code == (BPF_JMP | BPF_CALL))
的指令,即是調(diào)用helper function的指令
*/
if (insn->code != (BPF_JMP | BPF_CALL))
continue;
/* (3.8.3.1) 幾種特殊helper function的處理 */
if (insn->imm == BPF_FUNC_get_route_realm)
prog->dst_needed = 1;
if (insn->imm == BPF_FUNC_get_prandom_u32)
bpf_user_rnd_init_once();
if (insn->imm == BPF_FUNC_tail_call) {
/* mark bpf_tail_call as different opcode to avoid
* conditional branch in the interpeter for every normal
* call and to prevent accidental JITing by JIT compiler
* that doesn't support bpf_tail_call yet
*/
insn->imm = 0;
insn->code |= BPF_X;
/* instead of changing every JIT dealing with tail_call
* emit two extra insns:
* if (index >= max_entries) goto out;
* index &= array->index_mask;
* to avoid out-of-bounds cpu speculation
*/
map_ptr = env->insn_aux_data[i + delta].map_ptr;
if (!map_ptr->unpriv_array)
continue;
insn_buf[0] = BPF_JMP_IMM(BPF_JGE, BPF_REG_3,
map_ptr->max_entries, 2);
insn_buf[1] = BPF_ALU32_IMM(BPF_AND, BPF_REG_3,
container_of(map_ptr,
struct bpf_array,
map)->index_mask);
insn_buf[2] = *insn;
cnt = 3;
new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
if (!new_prog)
return -ENOMEM;
delta += cnt - 1;
env->prog = prog = new_prog;
insn = new_prog->insnsi + i + delta;
continue;
}
/* (3.8.3.2) 通用helper function的處理:根據(jù)insn->imm指定的編號找打?qū)?yīng)的函數(shù)指針 */
fn = prog->aux->ops->get_func_proto(insn->imm);
/* all functions that have prototype and verifier allowed
* programs to call them, must be real in-kernel functions
*/
if (!fn->func) {
verbose("kernel subsystem misconfigured func %d ",
insn->imm);
return -EFAULT;
}
/* (3.8.3.3) 然后再把函數(shù)指針和__bpf_call_base之間的offset,賦值到insn->imm中 */
insn->imm = fn->func - __bpf_call_base;
}
return 0;
}
1.1.3、bpf JIT/kernel interpreter
在verifier驗(yàn)證通過以后,內(nèi)核通過JIT(Just-In-Time)將BPF目編碼轉(zhuǎn)換成本地指令碼;如果當(dāng)前架構(gòu)不支持JIT轉(zhuǎn)換內(nèi)核則會使用一個(gè)解析器(interpreter)來模擬運(yùn)行,這種運(yùn)行效率較低;
有些架構(gòu)(64 bit x86_64, arm64, ppc64, s390x, mips64, sparc64 and 32 bit arm)已經(jīng)支持BPF的JIT,它可以高效的幾乎一比一的把BPF代碼轉(zhuǎn)換成本機(jī)代碼(因?yàn)閑BPF的指令集已經(jīng)做了優(yōu)化,非常類似最新的arm/x86架構(gòu),ABI也類似)。如果當(dāng)前架構(gòu)不支持JTI只能使用內(nèi)核的解析器(interpreter)來模擬運(yùn)行;
struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
{
#ifndef CONFIG_BPF_JIT_ALWAYS_ON
/* (4.1) 在不支持JIT只能使用解析器(interpreter)時(shí),BPF程序的運(yùn)行入口 */
fp->bpf_func = (void *) __bpf_prog_run;
#else
fp->bpf_func = (void *) __bpf_prog_ret0;
#endif
/* eBPF JITs can rewrite the program in case constant
* blinding is active. However, in case of error during
* blinding, bpf_int_jit_compile() must always return a
* valid program, which in this case would simply not
* be JITed, but falls back to the interpreter.
*/
/* (4.2) 嘗試對BPF程序進(jìn)行JIT轉(zhuǎn)換 */
fp = bpf_int_jit_compile(fp);
#ifdef CONFIG_BPF_JIT_ALWAYS_ON
if (!fp->jited) {
*err = -ENOTSUPP;
return fp;
}
#endif
bpf_prog_lock_ro(fp);
/* The tail call compatibility check can only be done at
* this late stage as we need to determine, if we deal
* with JITed or non JITed program concatenations and not
* all eBPF JITs might immediately support all features.
*/
/* (4.3) 對tail call使用的BPF_MAP_TYPE_PROG_ARRAY類型的map,進(jìn)行一些檢查 */
*err = bpf_check_tail_call(fp);
return fp;
}
-
1、JIT
以arm64的JIT轉(zhuǎn)換為例:
struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
{
struct bpf_prog *tmp, *orig_prog = prog;
struct bpf_binary_header *header;
bool tmp_blinded = false;
struct jit_ctx ctx;
int image_size;
u8 *image_ptr;
if (!bpf_jit_enable)
return orig_prog;
/* (4.2.1) 把常量致盲(constant blinding) */
tmp = bpf_jit_blind_constants(prog);
/* If blinding was requested and we failed during blinding,
* we must fall back to the interpreter.
*/
if (IS_ERR(tmp))
return orig_prog;
if (tmp != prog) {
tmp_blinded = true;
prog = tmp;
}
memset(&ctx, 0, sizeof(ctx));
ctx.prog = prog;
ctx.offset = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
if (ctx.offset == NULL) {
prog = orig_prog;
goto out;
}
/* 1. Initial fake pass to compute ctx->idx. */
/* (4.2.2) JIT指令轉(zhuǎn)換,但是不儲存轉(zhuǎn)換結(jié)果,只是記錄 “prologue + body + epilogue”轉(zhuǎn)換后需要的總長度 */
/* Fake pass to fill in ctx->offset. */
if (build_body(&ctx)) {
prog = orig_prog;
goto out_off;
}
if (build_prologue(&ctx)) {
prog = orig_prog;
goto out_off;
}
ctx.epilogue_offset = ctx.idx;
build_epilogue(&ctx);
/* Now we know the actual image size. */
/* (4.2.3) 根據(jù)計(jì)算的總長度,分配JIT轉(zhuǎn)換后指令的存儲空間:ctx.image */
image_size = sizeof(u32) * ctx.idx;
header = bpf_jit_binary_alloc(image_size, &image_ptr,
sizeof(u32), jit_fill_hole);
if (header == NULL) {
prog = orig_prog;
goto out_off;
}
/* 2. Now, the actual pass. */
/* (4.2.4) 重新做一次JIT轉(zhuǎn)換,把轉(zhuǎn)換后的指令存儲到 ctx.image */
ctx.image = (u32 *)image_ptr;
ctx.idx = 0;
/* (4.2.4.1) 構(gòu)造轉(zhuǎn)換后image的頭,負(fù)責(zé)構(gòu)造BPF程序運(yùn)行時(shí)的堆棧,8條指令 */
build_prologue(&ctx);
/* (4.2.4.2) 把BPF程序進(jìn)行JIT本地化指令轉(zhuǎn)換 */
if (build_body(&ctx)) {
bpf_jit_binary_free(header);
prog = orig_prog;
goto out_off;
}
/* (4.2.4.3) 構(gòu)造轉(zhuǎn)換后image的尾部,負(fù)載清理工作,7條指令 */
build_epilogue(&ctx);
/* 3. Extra pass to validate JITed code. */
/* (4.2.5) 確保轉(zhuǎn)換后的指令中沒有AARCH64_BREAK_FAULT */
if (validate_code(&ctx)) {
bpf_jit_binary_free(header);
prog = orig_prog;
goto out_off;
}
/* And we're done. */
if (bpf_jit_enable > 1)
bpf_jit_dump(prog->len, image_size, 2, ctx.image);
/* (4.2.6) 刷新新image對應(yīng)的icache */
bpf_flush_icache(header, ctx.image + ctx.idx);
/* (4.2.7) 把image對應(yīng)的page設(shè)置為read only */
set_memory_ro((unsigned long)header, header->pages);
/* (4.2.8) 把轉(zhuǎn)換后的image賦值給prog->bpf_func */
prog->bpf_func = (void *)ctx.image;
prog->jited = 1;
out_off:
kfree(ctx.offset);
out:
if (tmp_blinded)
bpf_jit_prog_release_other(prog, prog == orig_prog ?
tmp : orig_prog);
return prog;
}
JIT的核心轉(zhuǎn)換分為3部分:prologue + body + epilogue。
prologue:新增的指令,負(fù)責(zé)BPF運(yùn)行堆棧的構(gòu)建和運(yùn)行現(xiàn)場的保護(hù);
body:BPF主體部分;
epilogue:負(fù)責(zé)BPF運(yùn)行完現(xiàn)場的恢復(fù)和清理;
-
1.1、prologue
A64_:開頭的是本機(jī)的相關(guān)寄存器
BPF_:開頭的是BPF虛擬機(jī)的寄存器
整個(gè)過程還是比較巧妙的:
首先將A64_FP/A64_LR保存進(jìn)堆棧A64_SP,然后把當(dāng)前A64_SP保存進(jìn)A64_FP;
繼續(xù)保存callee saved registers進(jìn)堆棧A64_SP:r6, r7, r8, r9, fp, tcc,然后把當(dāng)前A64_SP保存進(jìn)BPF_FP;
把A64_SP減去STACK_SIZE,給BPF_FP留出512字節(jié)的堆棧空間;
這樣BPF程序使用的是BPF_FP開始的512字節(jié)堆??臻g,普通kernel函數(shù)使用的是A64_SP繼續(xù)向下的堆??臻g,互不干擾;
static int build_prologue(struct jit_ctx *ctx)
{
const u8 r6 = bpf2a64[BPF_REG_6];
const u8 r7 = bpf2a64[BPF_REG_7];
const u8 r8 = bpf2a64[BPF_REG_8];
const u8 r9 = bpf2a64[BPF_REG_9];
const u8 fp = bpf2a64[BPF_REG_FP];
const u8 tcc = bpf2a64[TCALL_CNT];
const int idx0 = ctx->idx;
int cur_offset;
/*
* BPF prog stack layout
*
* high
* original A64_SP => 0:+-----+ BPF prologue
* |FP/LR|
* current A64_FP => -16:+-----+
* | ... | callee saved registers
* BPF fp register => -64:+-----+ <= (BPF_FP)
* | |
* | ... | BPF prog stack
* | |
* +-----+ <= (BPF_FP - MAX_BPF_STACK)
* |RSVD | JIT scratchpad
* current A64_SP => +-----+ <= (BPF_FP - STACK_SIZE)
* | |
* | ... | Function call stack
* | |
* +-----+
* low
*
*/
/* Save FP and LR registers to stay align with ARM64 AAPCS */
emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
emit(A64_MOV(1, A64_FP, A64_SP), ctx);
/* Save callee-saved registers */
emit(A64_PUSH(r6, r7, A64_SP), ctx);
emit(A64_PUSH(r8, r9, A64_SP), ctx);
emit(A64_PUSH(fp, tcc, A64_SP), ctx);
/* Set up BPF prog stack base register */
emit(A64_MOV(1, fp, A64_SP), ctx);
/* Initialize tail_call_cnt */
emit(A64_MOVZ(1, tcc, 0, 0), ctx);
/* Set up function call stack */
emit(A64_SUB_I(1, A64_SP, A64_SP, STACK_SIZE), ctx);
cur_offset = ctx->idx - idx0;
if (cur_offset != PROLOGUE_OFFSET) {
pr_err_once("PROLOGUE_OFFSET = %d, expected %d! ",
cur_offset, PROLOGUE_OFFSET);
return -1;
}
return 0;
}
-
1.2、body
把BPF指令翻譯成本地arm64指令:
static int build_body(struct jit_ctx *ctx)
{
const struct bpf_prog *prog = ctx->prog;
int i;
for (i = 0; i < prog->len; i++) {
const struct bpf_insn *insn = &prog->insnsi[i];
int ret;
ret = build_insn(insn, ctx);
if (ret > 0) {
i++;
if (ctx->image == NULL)
ctx->offset[i] = ctx->idx;
continue;
}
if (ctx->image == NULL)
ctx->offset[i] = ctx->idx;
if (ret)
return ret;
}
return 0;
}
↓
/* JITs an eBPF instruction.
* Returns:
* 0 - successfully JITed an 8-byte eBPF instruction.
* >0 - successfully JITed a 16-byte eBPF instruction.
* <0 - failed to JIT.
*/
static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
{
const u8 code = insn->code;
const u8 dst = bpf2a64[insn->dst_reg];
const u8 src = bpf2a64[insn->src_reg];
const u8 tmp = bpf2a64[TMP_REG_1];
const u8 tmp2 = bpf2a64[TMP_REG_2];
const s16 off = insn->off;
const s32 imm = insn->imm;
const int i = insn - ctx->prog->insnsi;
const bool is64 = BPF_CLASS(code) == BPF_ALU64;
u8 jmp_cond;
s32 jmp_offset;
#define check_imm(bits, imm) do {
if ((((imm) > 0) && ((imm) >> (bits))) ||
(((imm) < 0) && (~(imm) >> (bits)))) {
pr_info("[%2d] imm=%d(0x%x) out of range ",
i, imm, imm);
return -EINVAL;
}
} while (0)
#define check_imm19(imm) check_imm(19, imm)
#define check_imm26(imm) check_imm(26, imm)
switch (code) {
/* dst = src */
case BPF_ALU | BPF_MOV | BPF_X:
case BPF_ALU64 | BPF_MOV | BPF_X:
emit(A64_MOV(is64, dst, src), ctx);
break;
/* dst = dst OP src */
case BPF_ALU | BPF_ADD | BPF_X:
case BPF_ALU64 | BPF_ADD | BPF_X:
emit(A64_ADD(is64, dst, dst, src), ctx);
break;
case BPF_ALU | BPF_SUB | BPF_X:
case BPF_ALU64 | BPF_SUB | BPF_X:
emit(A64_SUB(is64, dst, dst, src), ctx);
break;
case BPF_ALU | BPF_AND | BPF_X:
case BPF_ALU64 | BPF_AND | BPF_X:
emit(A64_AND(is64, dst, dst, src), ctx);
break;
case BPF_ALU | BPF_OR | BPF_X:
case BPF_ALU64 | BPF_OR | BPF_X:
emit(A64_ORR(is64, dst, dst, src), ctx);
break;
case BPF_ALU | BPF_XOR | BPF_X:
case BPF_ALU64 | BPF_XOR | BPF_X:
emit(A64_EOR(is64, dst, dst, src), ctx);
break;
case BPF_ALU | BPF_MUL | BPF_X:
case BPF_ALU64 | BPF_MUL | BPF_X:
emit(A64_MUL(is64, dst, dst, src), ctx);
break;
case BPF_ALU | BPF_DIV | BPF_X:
case BPF_ALU64 | BPF_DIV | BPF_X:
case BPF_ALU | BPF_MOD | BPF_X:
case BPF_ALU64 | BPF_MOD | BPF_X:
{
const u8 r0 = bpf2a64[BPF_REG_0];
/* if (src == 0) return 0 */
jmp_offset = 3; /* skip ahead to else path */
check_imm19(jmp_offset);
emit(A64_CBNZ(is64, src, jmp_offset), ctx);
emit(A64_MOVZ(1, r0, 0, 0), ctx);
jmp_offset = epilogue_offset(ctx);
check_imm26(jmp_offset);
emit(A64_B(jmp_offset), ctx);
/* else */
switch (BPF_OP(code)) {
case BPF_DIV:
emit(A64_UDIV(is64, dst, dst, src), ctx);
break;
case BPF_MOD:
emit(A64_UDIV(is64, tmp, dst, src), ctx);
emit(A64_MUL(is64, tmp, tmp, src), ctx);
emit(A64_SUB(is64, dst, dst, tmp), ctx);
break;
}
break;
}
case BPF_ALU | BPF_LSH | BPF_X:
case BPF_ALU64 | BPF_LSH | BPF_X:
emit(A64_LSLV(is64, dst, dst, src), ctx);
break;
case BPF_ALU | BPF_RSH | BPF_X:
case BPF_ALU64 | BPF_RSH | BPF_X:
emit(A64_LSRV(is64, dst, dst, src), ctx);
break;
case BPF_ALU | BPF_ARSH | BPF_X:
case BPF_ALU64 | BPF_ARSH | BPF_X:
emit(A64_ASRV(is64, dst, dst, src), ctx);
break;
/* dst = -dst */
case BPF_ALU | BPF_NEG:
case BPF_ALU64 | BPF_NEG:
emit(A64_NEG(is64, dst, dst), ctx);
break;
/* dst = BSWAP##imm(dst) */
case BPF_ALU | BPF_END | BPF_FROM_LE:
case BPF_ALU | BPF_END | BPF_FROM_BE:
#ifdef CONFIG_CPU_BIG_ENDIAN
if (BPF_SRC(code) == BPF_FROM_BE)
goto emit_bswap_uxt;
#else /* !CONFIG_CPU_BIG_ENDIAN */
if (BPF_SRC(code) == BPF_FROM_LE)
goto emit_bswap_uxt;
#endif
switch (imm) {
case 16:
emit(A64_REV16(is64, dst, dst), ctx);
/* zero-extend 16 bits into 64 bits */
emit(A64_UXTH(is64, dst, dst), ctx);
break;
case 32:
emit(A64_REV32(is64, dst, dst), ctx);
/* upper 32 bits already cleared */
break;
case 64:
emit(A64_REV64(dst, dst), ctx);
break;
}
break;
emit_bswap_uxt:
switch (imm) {
case 16:
/* zero-extend 16 bits into 64 bits */
emit(A64_UXTH(is64, dst, dst), ctx);
break;
case 32:
/* zero-extend 32 bits into 64 bits */
emit(A64_UXTW(is64, dst, dst), ctx);
break;
case 64:
/* nop */
break;
}
break;
/* dst = imm */
case BPF_ALU | BPF_MOV | BPF_K:
case BPF_ALU64 | BPF_MOV | BPF_K:
emit_a64_mov_i(is64, dst, imm, ctx);
break;
/* dst = dst OP imm */
case BPF_ALU | BPF_ADD | BPF_K:
case BPF_ALU64 | BPF_ADD | BPF_K:
emit_a64_mov_i(is64, tmp, imm, ctx);
emit(A64_ADD(is64, dst, dst, tmp), ctx);
break;
case BPF_ALU | BPF_SUB | BPF_K:
case BPF_ALU64 | BPF_SUB | BPF_K:
emit_a64_mov_i(is64, tmp, imm, ctx);
emit(A64_SUB(is64, dst, dst, tmp), ctx);
break;
case BPF_ALU | BPF_AND | BPF_K:
case BPF_ALU64 | BPF_AND | BPF_K:
emit_a64_mov_i(is64, tmp, imm, ctx);
emit(A64_AND(is64, dst, dst, tmp), ctx);
break;
case BPF_ALU | BPF_OR | BPF_K:
case BPF_ALU64 | BPF_OR | BPF_K:
emit_a64_mov_i(is64, tmp, imm, ctx);
emit(A64_ORR(is64, dst, dst, tmp), ctx);
break;
case BPF_ALU | BPF_XOR | BPF_K:
case BPF_ALU64 | BPF_XOR | BPF_K:
emit_a64_mov_i(is64, tmp, imm, ctx);
emit(A64_EOR(is64, dst, dst, tmp), ctx);
break;
case BPF_ALU | BPF_MUL | BPF_K:
case BPF_ALU64 | BPF_MUL | BPF_K:
emit_a64_mov_i(is64, tmp, imm, ctx);
emit(A64_MUL(is64, dst, dst, tmp), ctx);
break;
case BPF_ALU | BPF_DIV | BPF_K:
case BPF_ALU64 | BPF_DIV | BPF_K:
emit_a64_mov_i(is64, tmp, imm, ctx);
emit(A64_UDIV(is64, dst, dst, tmp), ctx);
break;
case BPF_ALU | BPF_MOD | BPF_K:
case BPF_ALU64 | BPF_MOD | BPF_K:
emit_a64_mov_i(is64, tmp2, imm, ctx);
emit(A64_UDIV(is64, tmp, dst, tmp2), ctx);
emit(A64_MUL(is64, tmp, tmp, tmp2), ctx);
emit(A64_SUB(is64, dst, dst, tmp), ctx);
break;
case BPF_ALU | BPF_LSH | BPF_K:
case BPF_ALU64 | BPF_LSH | BPF_K:
emit(A64_LSL(is64, dst, dst, imm), ctx);
break;
case BPF_ALU | BPF_RSH | BPF_K:
case BPF_ALU64 | BPF_RSH | BPF_K:
emit(A64_LSR(is64, dst, dst, imm), ctx);
break;
case BPF_ALU | BPF_ARSH | BPF_K:
case BPF_ALU64 | BPF_ARSH | BPF_K:
emit(A64_ASR(is64, dst, dst, imm), ctx);
break;
/* JUMP off */
case BPF_JMP | BPF_JA:
jmp_offset = bpf2a64_offset(i + off, i, ctx);
check_imm26(jmp_offset);
emit(A64_B(jmp_offset), ctx);
break;
/* IF (dst COND src) JUMP off */
case BPF_JMP | BPF_JEQ | BPF_X:
case BPF_JMP | BPF_JGT | BPF_X:
case BPF_JMP | BPF_JGE | BPF_X:
case BPF_JMP | BPF_JNE | BPF_X:
case BPF_JMP | BPF_JSGT | BPF_X:
case BPF_JMP | BPF_JSGE | BPF_X:
emit(A64_CMP(1, dst, src), ctx);
emit_cond_jmp:
jmp_offset = bpf2a64_offset(i + off, i, ctx);
check_imm19(jmp_offset);
switch (BPF_OP(code)) {
case BPF_JEQ:
jmp_cond = A64_COND_EQ;
break;
case BPF_JGT:
jmp_cond = A64_COND_HI;
break;
case BPF_JGE:
jmp_cond = A64_COND_CS;
break;
case BPF_JSET:
case BPF_JNE:
jmp_cond = A64_COND_NE;
break;
case BPF_JSGT:
jmp_cond = A64_COND_GT;
break;
case BPF_JSGE:
jmp_cond = A64_COND_GE;
break;
default:
return -EFAULT;
}
emit(A64_B_(jmp_cond, jmp_offset), ctx);
break;
case BPF_JMP | BPF_JSET | BPF_X:
emit(A64_TST(1, dst, src), ctx);
goto emit_cond_jmp;
/* IF (dst COND imm) JUMP off */
case BPF_JMP | BPF_JEQ | BPF_K:
case BPF_JMP | BPF_JGT | BPF_K:
case BPF_JMP | BPF_JGE | BPF_K:
case BPF_JMP | BPF_JNE | BPF_K:
case BPF_JMP | BPF_JSGT | BPF_K:
case BPF_JMP | BPF_JSGE | BPF_K:
emit_a64_mov_i(1, tmp, imm, ctx);
emit(A64_CMP(1, dst, tmp), ctx);
goto emit_cond_jmp;
case BPF_JMP | BPF_JSET | BPF_K:
emit_a64_mov_i(1, tmp, imm, ctx);
emit(A64_TST(1, dst, tmp), ctx);
goto emit_cond_jmp;
/* function call */
case BPF_JMP | BPF_CALL:
{
const u8 r0 = bpf2a64[BPF_REG_0];
const u64 func = (u64)__bpf_call_base + imm;
emit_a64_mov_i64(tmp, func, ctx);
emit(A64_BLR(tmp), ctx);
emit(A64_MOV(1, r0, A64_R(0)), ctx);
break;
}
/* tail call */
case BPF_JMP | BPF_CALL | BPF_X:
if (emit_bpf_tail_call(ctx))
return -EFAULT;
break;
/* function return */
case BPF_JMP | BPF_EXIT:
/* Optimization: when last instruction is EXIT,
simply fallthrough to epilogue. */
if (i == ctx->prog->len - 1)
break;
jmp_offset = epilogue_offset(ctx);
check_imm26(jmp_offset);
emit(A64_B(jmp_offset), ctx);
break;
/* dst = imm64 */
case BPF_LD | BPF_IMM | BPF_DW:
{
const struct bpf_insn insn1 = insn[1];
u64 imm64;
if (insn1.code != 0 || insn1.src_reg != 0 ||
insn1.dst_reg != 0 || insn1.off != 0) {
/* Note: verifier in BPF core must catch invalid
* instructions.
*/
pr_err_once("Invalid BPF_LD_IMM64 instruction ");
return -EINVAL;
}
imm64 = (u64)insn1.imm << 32 | (u32)imm;
emit_a64_mov_i64(dst, imm64, ctx);
return 1;
}
/* LDX: dst = *(size *)(src + off) */
case BPF_LDX | BPF_MEM | BPF_W:
case BPF_LDX | BPF_MEM | BPF_H:
case BPF_LDX | BPF_MEM | BPF_B:
case BPF_LDX | BPF_MEM | BPF_DW:
emit_a64_mov_i(1, tmp, off, ctx);
switch (BPF_SIZE(code)) {
case BPF_W:
emit(A64_LDR32(dst, src, tmp), ctx);
break;
case BPF_H:
emit(A64_LDRH(dst, src, tmp), ctx);
break;
case BPF_B:
emit(A64_LDRB(dst, src, tmp), ctx);
break;
case BPF_DW:
emit(A64_LDR64(dst, src, tmp), ctx);
break;
}
break;
/* ST: *(size *)(dst + off) = imm */
case BPF_ST | BPF_MEM | BPF_W:
case BPF_ST | BPF_MEM | BPF_H:
case BPF_ST | BPF_MEM | BPF_B:
case BPF_ST | BPF_MEM | BPF_DW:
/* Load imm to a register then store it */
emit_a64_mov_i(1, tmp2, off, ctx);
emit_a64_mov_i(1, tmp, imm, ctx);
switch (BPF_SIZE(code)) {
case BPF_W:
emit(A64_STR32(tmp, dst, tmp2), ctx);
break;
case BPF_H:
emit(A64_STRH(tmp, dst, tmp2), ctx);
break;
case BPF_B:
emit(A64_STRB(tmp, dst, tmp2), ctx);
break;
case BPF_DW:
emit(A64_STR64(tmp, dst, tmp2), ctx);
break;
}
break;
/* STX: *(size *)(dst + off) = src */
case BPF_STX | BPF_MEM | BPF_W:
case BPF_STX | BPF_MEM | BPF_H:
case BPF_STX | BPF_MEM | BPF_B:
case BPF_STX | BPF_MEM | BPF_DW:
emit_a64_mov_i(1, tmp, off, ctx);
switch (BPF_SIZE(code)) {
case BPF_W:
emit(A64_STR32(src, dst, tmp), ctx);
break;
case BPF_H:
emit(A64_STRH(src, dst, tmp), ctx);
break;
case BPF_B:
emit(A64_STRB(src, dst, tmp), ctx);
break;
case BPF_DW:
emit(A64_STR64(src, dst, tmp), ctx);
break;
}
break;
/* STX XADD: lock *(u32 *)(dst + off) += src */
case BPF_STX | BPF_XADD | BPF_W:
/* STX XADD: lock *(u64 *)(dst + off) += src */
case BPF_STX | BPF_XADD | BPF_DW:
goto notyet;
/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
case BPF_LD | BPF_ABS | BPF_W:
case BPF_LD | BPF_ABS | BPF_H:
case BPF_LD | BPF_ABS | BPF_B:
/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
case BPF_LD | BPF_IND | BPF_W:
case BPF_LD | BPF_IND | BPF_H:
case BPF_LD | BPF_IND | BPF_B:
{
const u8 r0 = bpf2a64[BPF_REG_0]; /* r0 = return value */
const u8 r6 = bpf2a64[BPF_REG_6]; /* r6 = pointer to sk_buff */
const u8 fp = bpf2a64[BPF_REG_FP];
const u8 r1 = bpf2a64[BPF_REG_1]; /* r1: struct sk_buff *skb */
const u8 r2 = bpf2a64[BPF_REG_2]; /* r2: int k */
const u8 r3 = bpf2a64[BPF_REG_3]; /* r3: unsigned int size */
const u8 r4 = bpf2a64[BPF_REG_4]; /* r4: void *buffer */
const u8 r5 = bpf2a64[BPF_REG_5]; /* r5: void *(*func)(...) */
int size;
emit(A64_MOV(1, r1, r6), ctx);
emit_a64_mov_i(0, r2, imm, ctx);
if (BPF_MODE(code) == BPF_IND)
emit(A64_ADD(0, r2, r2, src), ctx);
switch (BPF_SIZE(code)) {
case BPF_W:
size = 4;
break;
case BPF_H:
size = 2;
break;
case BPF_B:
size = 1;
break;
default:
return -EINVAL;
}
emit_a64_mov_i64(r3, size, ctx);
emit(A64_SUB_I(1, r4, fp, STACK_SIZE), ctx);
emit_a64_mov_i64(r5, (unsigned long)bpf_load_pointer, ctx);
emit(A64_BLR(r5), ctx);
emit(A64_MOV(1, r0, A64_R(0)), ctx);
jmp_offset = epilogue_offset(ctx);
check_imm19(jmp_offset);
emit(A64_CBZ(1, r0, jmp_offset), ctx);
emit(A64_MOV(1, r5, r0), ctx);
switch (BPF_SIZE(code)) {
case BPF_W:
emit(A64_LDR32(r0, r5, A64_ZR), ctx);
#ifndef CONFIG_CPU_BIG_ENDIAN
emit(A64_REV32(0, r0, r0), ctx);
#endif
break;
case BPF_H:
emit(A64_LDRH(r0, r5, A64_ZR), ctx);
#ifndef CONFIG_CPU_BIG_ENDIAN
emit(A64_REV16(0, r0, r0), ctx);
#endif
break;
case BPF_B:
emit(A64_LDRB(r0, r5, A64_ZR), ctx);
break;
}
break;
}
notyet:
pr_info_once("*** NOT YET: opcode %02x *** ", code);
return -EFAULT;
default:
pr_err_once("unknown opcode %02x ", code);
return -EINVAL;
}
return 0;
}
↓
static inline void emit(const u32 insn, struct jit_ctx *ctx)
{
if (ctx->image != NULL)
ctx->image[ctx->idx] = cpu_to_le32(insn);
ctx->idx++;
}
-
1.3、epilogue
做和prologue相反的工作,恢復(fù)和清理堆棧:
static void build_epilogue(struct jit_ctx *ctx)
{
const u8 r0 = bpf2a64[BPF_REG_0];
const u8 r6 = bpf2a64[BPF_REG_6];
const u8 r7 = bpf2a64[BPF_REG_7];
const u8 r8 = bpf2a64[BPF_REG_8];
const u8 r9 = bpf2a64[BPF_REG_9];
const u8 fp = bpf2a64[BPF_REG_FP];
/* We're done with BPF stack */
emit(A64_ADD_I(1, A64_SP, A64_SP, STACK_SIZE), ctx);
/* Restore fs (x25) and x26 */
emit(A64_POP(fp, A64_R(26), A64_SP), ctx);
/* Restore callee-saved register */
emit(A64_POP(r8, r9, A64_SP), ctx);
emit(A64_POP(r6, r7, A64_SP), ctx);
/* Restore FP/LR registers */
emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
/* Set return value */
emit(A64_MOV(1, A64_R(0), r0), ctx);
emit(A64_RET(A64_LR), ctx);
}
-
2、interpreter
對于不支持JIT的情況,內(nèi)核只能使用一個(gè)解析器來解釋prog->insnsi[]中BPF的指令含義,模擬BPF指令的運(yùn)行:
使用“u64 stack[MAX_BPF_STACK / sizeof(u64)]”局部變量來模擬BPF堆棧空間;
使用“u64 regs[MAX_BPF_REG]”局部變量來模擬BPF寄存器;
/**
* __bpf_prog_run - run eBPF program on a given context
* @ctx: is the data we are operating on
* @insn: is the array of eBPF instructions
*
* Decode and execute eBPF instructions.
*/
static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)
{
u64 stack[MAX_BPF_STACK / sizeof(u64)];
u64 regs[MAX_BPF_REG], tmp;
static const void *jumptable[256] = {
[0 ... 255] = &&default_label,
/* Now overwrite non-defaults ... */
/* 32 bit ALU operations */
[BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
[BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
[BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
[BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
[BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
[BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
[BPF_ALU | BPF_OR | BPF_X] = &&ALU_OR_X,
[BPF_ALU | BPF_OR | BPF_K] = &&ALU_OR_K,
[BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,
[BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,
[BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,
[BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,
[BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,
[BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,
[BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,
[BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,
[BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
[BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
[BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,
[BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,
[BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,
[BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,
[BPF_ALU | BPF_NEG] = &&ALU_NEG,
[BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,
[BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,
/* 64 bit ALU operations */
[BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,
[BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,
[BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,
[BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,
[BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,
[BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,
[BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,
[BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,
[BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,
[BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,
[BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,
[BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,
[BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,
[BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,
[BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,
[BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,
[BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,
[BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,
[BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,
[BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,
[BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,
[BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,
[BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,
[BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,
[BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,
/* Call instruction */
[BPF_JMP | BPF_CALL] = &&JMP_CALL,
[BPF_JMP | BPF_CALL | BPF_X] = &&JMP_TAIL_CALL,
/* Jumps */
[BPF_JMP | BPF_JA] = &&JMP_JA,
[BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,
[BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,
[BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,
[BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,
[BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,
[BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,
[BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,
[BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,
[BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,
[BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,
[BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,
[BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,
[BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,
[BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,
/* Program return */
[BPF_JMP | BPF_EXIT] = &&JMP_EXIT,
/* Store instructions */
[BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,
[BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,
[BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,
[BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,
[BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,
[BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,
[BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,
[BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,
[BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,
[BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,
/* Load instructions */
[BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,
[BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,
[BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,
[BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,
[BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
[BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
[BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
[BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
[BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
[BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
[BPF_LD | BPF_IMM | BPF_DW] = &&LD_IMM_DW,
};
u32 tail_call_cnt = 0;
void *ptr;
int off;
#define CONT ({ insn++; goto select_insn; })
#define CONT_JMP ({ insn++; goto select_insn; })
FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];
ARG1 = (u64) (unsigned long) ctx;
select_insn:
goto *jumptable[insn->code];
/* ALU */
#define ALU(OPCODE, OP)
ALU64_##OPCODE##_X:
DST = DST OP SRC;
CONT;
ALU_##OPCODE##_X:
DST = (u32) DST OP (u32) SRC;
CONT;
ALU64_##OPCODE##_K:
DST = DST OP IMM;
CONT;
ALU_##OPCODE##_K:
DST = (u32) DST OP (u32) IMM;
CONT;
ALU(ADD, +)
ALU(SUB, -)
ALU(AND, &)
ALU(OR, |)
ALU(LSH, <<)
ALU(RSH, >>)
ALU(XOR, ^)
ALU(MUL, *)
#undef ALU
ALU_NEG:
DST = (u32) -DST;
CONT;
ALU64_NEG:
DST = -DST;
CONT;
ALU_MOV_X:
DST = (u32) SRC;
CONT;
ALU_MOV_K:
DST = (u32) IMM;
CONT;
ALU64_MOV_X:
DST = SRC;
CONT;
ALU64_MOV_K:
DST = IMM;
CONT;
LD_IMM_DW:
DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32;
insn++;
CONT;
ALU64_ARSH_X:
(*(s64 *) &DST) >>= SRC;
CONT;
ALU64_ARSH_K:
(*(s64 *) &DST) >>= IMM;
CONT;
ALU64_MOD_X:
if (unlikely(SRC == 0))
return 0;
div64_u64_rem(DST, SRC, &tmp);
DST = tmp;
CONT;
ALU_MOD_X:
if (unlikely((u32)SRC == 0))
return 0;
tmp = (u32) DST;
DST = do_div(tmp, (u32) SRC);
CONT;
ALU64_MOD_K:
div64_u64_rem(DST, IMM, &tmp);
DST = tmp;
CONT;
ALU_MOD_K:
tmp = (u32) DST;
DST = do_div(tmp, (u32) IMM);
CONT;
ALU64_DIV_X:
if (unlikely(SRC == 0))
return 0;
DST = div64_u64(DST, SRC);
CONT;
ALU_DIV_X:
if (unlikely((u32)SRC == 0))
return 0;
tmp = (u32) DST;
do_div(tmp, (u32) SRC);
DST = (u32) tmp;
CONT;
ALU64_DIV_K:
DST = div64_u64(DST, IMM);
CONT;
ALU_DIV_K:
tmp = (u32) DST;
do_div(tmp, (u32) IMM);
DST = (u32) tmp;
CONT;
ALU_END_TO_BE:
switch (IMM) {
case 16:
DST = (__force u16) cpu_to_be16(DST);
break;
case 32:
DST = (__force u32) cpu_to_be32(DST);
break;
case 64:
DST = (__force u64) cpu_to_be64(DST);
break;
}
CONT;
ALU_END_TO_LE:
switch (IMM) {
case 16:
DST = (__force u16) cpu_to_le16(DST);
break;
case 32:
DST = (__force u32) cpu_to_le32(DST);
break;
case 64:
DST = (__force u64) cpu_to_le64(DST);
break;
}
CONT;
/* CALL */
JMP_CALL:
/* Function call scratches BPF_R1-BPF_R5 registers,
* preserves BPF_R6-BPF_R9, and stores return value
* into BPF_R0.
*/
BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
BPF_R4, BPF_R5);
CONT;
JMP_TAIL_CALL: {
struct bpf_map *map = (struct bpf_map *) (unsigned long) BPF_R2;
struct bpf_array *array = container_of(map, struct bpf_array, map);
struct bpf_prog *prog;
u32 index = BPF_R3;
if (unlikely(index >= array->map.max_entries))
goto out;
if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
goto out;
tail_call_cnt++;
prog = READ_ONCE(array->ptrs[index]);
if (!prog)
goto out;
/* ARG1 at this point is guaranteed to point to CTX from
* the verifier side due to the fact that the tail call is
* handeled like a helper, that is, bpf_tail_call_proto,
* where arg1_type is ARG_PTR_TO_CTX.
*/
insn = prog->insnsi;
goto select_insn;
out:
CONT;
}
/* JMP */
JMP_JA:
insn += insn->off;
CONT;
JMP_JEQ_X:
if (DST == SRC) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JEQ_K:
if (DST == IMM) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JNE_X:
if (DST != SRC) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JNE_K:
if (DST != IMM) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JGT_X:
if (DST > SRC) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JGT_K:
if (DST > IMM) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JGE_X:
if (DST >= SRC) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JGE_K:
if (DST >= IMM) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JSGT_X:
if (((s64) DST) > ((s64) SRC)) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JSGT_K:
if (((s64) DST) > ((s64) IMM)) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JSGE_X:
if (((s64) DST) >= ((s64) SRC)) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JSGE_K:
if (((s64) DST) >= ((s64) IMM)) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JSET_X:
if (DST & SRC) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_JSET_K:
if (DST & IMM) {
insn += insn->off;
CONT_JMP;
}
CONT;
JMP_EXIT:
return BPF_R0;
/* STX and ST and LDX*/
#define LDST(SIZEOP, SIZE)
STX_MEM_##SIZEOP:
*(SIZE *)(unsigned long) (DST + insn->off) = SRC;
CONT;
ST_MEM_##SIZEOP:
*(SIZE *)(unsigned long) (DST + insn->off) = IMM;
CONT;
LDX_MEM_##SIZEOP:
DST = *(SIZE *)(unsigned long) (SRC + insn->off);
CONT;
LDST(B, u8)
LDST(H, u16)
LDST(W, u32)
LDST(DW, u64)
#undef LDST
STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */
atomic_add((u32) SRC, (atomic_t *)(unsigned long)
(DST + insn->off));
CONT;
STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */
atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)
(DST + insn->off));
CONT;
LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + imm32)) */
off = IMM;
load_word:
/* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are
* only appearing in the programs where ctx ==
* skb. All programs keep 'ctx' in regs[BPF_REG_CTX]
* == BPF_R6, bpf_convert_filter() saves it in BPF_R6,
* internal BPF verifier will check that BPF_R6 ==
* ctx.
*
* BPF_ABS and BPF_IND are wrappers of function calls,
* so they scratch BPF_R1-BPF_R5 registers, preserve
* BPF_R6-BPF_R9, and store return value into BPF_R0.
*
* Implicit input:
* ctx == skb == BPF_R6 == CTX
*
* Explicit input:
* SRC == any register
* IMM == 32-bit immediate
*
* Output:
* BPF_R0 - 8/16/32-bit skb data converted to cpu endianness
*/
ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 4, &tmp);
if (likely(ptr != NULL)) {
BPF_R0 = get_unaligned_be32(ptr);
CONT;
}
return 0;
LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + imm32)) */
off = IMM;
load_half:
ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 2, &tmp);
if (likely(ptr != NULL)) {
BPF_R0 = get_unaligned_be16(ptr);
CONT;
}
return 0;
LD_ABS_B: /* BPF_R0 = *(u8 *) (skb->data + imm32) */
off = IMM;
load_byte:
ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 1, &tmp);
if (likely(ptr != NULL)) {
BPF_R0 = *(u8 *)ptr;
CONT;
}
return 0;
LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + src_reg + imm32)) */
off = IMM + SRC;
goto load_word;
LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + src_reg + imm32)) */
off = IMM + SRC;
goto load_half;
LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + src_reg + imm32) */
off = IMM + SRC;
goto load_byte;
default_label:
/* If we ever reach this, we have a bug somewhere. */
WARN_RATELIMIT(1, "unknown opcode %02x ", insn->code);
return 0;
}
3、BPF_PROG_RUN()
不論是轉(zhuǎn)換成JIT的映像,或者是使用interpreter解釋器。最后BPF程序運(yùn)行的時(shí)候都是使用BPF_PROG_RUN()這個(gè)宏來調(diào)用的:
ret = BPF_PROG_RUN(prog, ctx);
↓
#defineBPF_PROG_RUN(filter,ctx)(*filter->bpf_func)(ctx,filter->insnsi)
1.1.4、fd分配
對于加載到內(nèi)核空間的BPF程序,最后會給它分配一個(gè)文件句柄fd,將prog存儲到對應(yīng)的file->private_data上。方便后續(xù)的引用。
int bpf_prog_new_fd(struct bpf_prog *prog)
{
return anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog,
O_RDWR | O_CLOEXEC);
}
↓
int anon_inode_getfd(const char *name, const struct file_operations *fops,
void *priv, int flags)
{
int error, fd;
struct file *file;
error = get_unused_fd_flags(flags);
if (error < 0)
return error;
fd = error;
file = anon_inode_getfile(name, fops, priv, flags);
if (IS_ERR(file)) {
error = PTR_ERR(file);
goto err_put_unused_fd;
}
fd_install(fd, file);
return fd;
err_put_unused_fd:
put_unused_fd(fd);
return error;
}
↓
struct file *anon_inode_getfile(const char *name,
const struct file_operations *fops,
void *priv, int flags)
{
struct qstr this;
struct path path;
struct file *file;
if (IS_ERR(anon_inode_inode))
return ERR_PTR(-ENODEV);
if (fops->owner && !try_module_get(fops->owner))
return ERR_PTR(-ENOENT);
/*
* Link the inode to a directory entry by creating a unique name
* using the inode sequence number.
*/
file = ERR_PTR(-ENOMEM);
this.name = name;
this.len = strlen(name);
this.hash = 0;
path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);
if (!path.dentry)
goto err_module;
path.mnt = mntget(anon_inode_mnt);
/*
* We know the anon_inode inode count is always greater than zero,
* so ihold() is safe.
*/
ihold(anon_inode_inode);
d_instantiate(path.dentry, anon_inode_inode);
file = alloc_file(&path, OPEN_FMODE(flags), fops);
if (IS_ERR(file))
goto err_dput;
file->f_mapping = anon_inode_inode->i_mapping;
file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
file->private_data = priv;
return file;
err_dput:
path_put(&path);
err_module:
module_put(fops->owner);
return file;
}
1.2、bpf map操作
BPF map的應(yīng)用場景有幾種:
-
BPF程序和用戶態(tài)態(tài)的交互:BPF程序運(yùn)行完,得到的結(jié)果存儲到map中,供用戶態(tài)訪問;
-
BPF程序內(nèi)部交互:如果BPF程序內(nèi)部需要用全局變量來交互,但是由于安全原因BPF程序不允許訪問全局變量,可以使用map來充當(dāng)全局變量;
-
BPF Tail call:Tail call是一個(gè)BPF程序跳轉(zhuǎn)到另一BPF程序,BPF程序首先通過BPF_MAP_TYPE_PROG_ARRAY類型的map來知道另一個(gè)BPF程序的指針,然后調(diào)用tail_call()的helper function來執(zhí)行Tail call。
-
BPF程序和內(nèi)核態(tài)的交互:和BPF程序以外的內(nèi)核程序交互,也可以使用map作為中介;
目前,支持的map種類:
static int __init register_array_map(void)
{
bpf_register_map_type(&array_type);
bpf_register_map_type(&percpu_array_type);
return 0;
}
static int __init register_cgroup_array_map(void)
{
bpf_register_map_type(&cgroup_array_type);
return 0;
}
static int __init register_htab_map(void)
{
bpf_register_map_type(&htab_type);
bpf_register_map_type(&htab_percpu_type);
return 0;
}
static int __init register_perf_event_array_map(void)
{
bpf_register_map_type(&perf_event_array_type);
return 0;
}
static int __init register_prog_array_map(void)
{
bpf_register_map_type(&prog_array_type);
return 0;
}
static int __init register_stack_map(void)
{
bpf_register_map_type(&stack_map_type);
return 0;
}
不論哪種map,對map的使用都是用"鍵-值“對(key-value)的形式來使用的。
1.2.1、map的創(chuàng)建
如果用戶態(tài)的BPF c程序有定義map,map最后會被編譯進(jìn)__section(“maps”)。
用戶態(tài)的loader在加載BPF程序的時(shí)候,首先會根據(jù)__section(“maps”)中的成員來調(diào)用bpf()系統(tǒng)調(diào)用來創(chuàng)建map對象。
static int map_create(union bpf_attr *attr)
{
struct bpf_map *map;
int err;
err = CHECK_ATTR(BPF_MAP_CREATE);
if (err)
return -EINVAL;
/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
/* (1) 根據(jù)map的類型分配空間 */
map = find_and_alloc_map(attr);
if (IS_ERR(map))
return PTR_ERR(map);
atomic_set(&map->refcnt, 1);
atomic_set(&map->usercnt, 1);
/* (2) 在進(jìn)程vm中給map鎖定空間 */
err = bpf_map_charge_memlock(map);
if (err)
goto free_map_nouncharge;
/* (3) 給map分配對應(yīng)的文件句柄 */
err = bpf_map_new_fd(map);
if (err < 0)
/* failed to allocate fd */
goto free_map;
return err;
free_map:
bpf_map_uncharge_memlock(map);
free_map_nouncharge:
map->ops->map_free(map);
return err;
}
|→
static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
{
struct bpf_map_type_list *tl;
struct bpf_map *map;
list_for_each_entry(tl, &bpf_map_types, list_node) {
if (tl->type == attr->map_type) {
/* (1.1) 根據(jù)type找到對應(yīng)的tl,分配map空間 */
map = tl->ops->map_alloc(attr);
if (IS_ERR(map))
return map;
map->ops = tl->ops;
map->map_type = attr->map_type;
return map;
}
}
return ERR_PTR(-EINVAL);
}
|→
int bpf_map_new_fd(struct bpf_map *map)
{
/* (3.1) 給map分配對應(yīng)的文件句柄fd,把map指針賦值給file->private_data */
return anon_inode_getfd("bpf-map", &bpf_map_fops, map,
O_RDWR | O_CLOEXEC);
}
-
1、BPF_MAP_TYPE_ARRAY
我們以BPF_MAP_TYPE_ARRAY類型的map為例,來看看map的分配過程:
從用戶態(tài)傳過來的attr成員意義如下:
attr->map_type:map的類型;
attr->key_size:鍵key成員的大?。?/p>
attr->value_size:值value成員的大??;
attr->max_entries:需要存儲多少個(gè)條目("鍵-值“對)
static const struct bpf_map_ops array_ops = {
.map_alloc = array_map_alloc,
.map_free = array_map_free,
.map_get_next_key = array_map_get_next_key,
.map_lookup_elem = array_map_lookup_elem,
.map_update_elem = array_map_update_elem,
.map_delete_elem = array_map_delete_elem,
};
static struct bpf_map_type_list array_type __read_mostly = {
.ops = &array_ops,
.type = BPF_MAP_TYPE_ARRAY,
};
↓
static struct bpf_map *array_map_alloc(union bpf_attr *attr)
{
bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
u32 elem_size, index_mask, max_entries;
bool unpriv = !capable(CAP_SYS_ADMIN);
struct bpf_array *array;
u64 array_size, mask64;
/* check sanity of attributes */
if (attr->max_entries == 0 || attr->key_size != 4 ||
attr->value_size == 0 || attr->map_flags)
return ERR_PTR(-EINVAL);
if (attr->value_size >= 1 << (KMALLOC_SHIFT_MAX - 1))
/* if value_size is bigger, the user space won't be able to
* access the elements.
*/
return ERR_PTR(-E2BIG);
/* (1.1.1) 計(jì)算value的size,key的size不用計(jì)算也不用存儲,因?yàn)檫@里的key直接就是index */
elem_size = round_up(attr->value_size, 8);
max_entries = attr->max_entries;
/* On 32 bit archs roundup_pow_of_two() with max_entries that has
* upper most bit set in u32 space is undefined behavior due to
* resulting 1U << 32, so do it manually here in u64 space.
*/
mask64 = fls_long(max_entries - 1);
mask64 = 1ULL << mask64;
mask64 -= 1;
index_mask = mask64;
if (unpriv) {
/* round up array size to nearest power of 2,
* since cpu will speculate within index_mask limits
*/
max_entries = index_mask + 1;
/* Check for overflows. */
if (max_entries < attr->max_entries)
return ERR_PTR(-E2BIG);
}
/* (1.1.2) 計(jì)算bpf_array + value數(shù)組的總大小,bpf_array包含了map的通用結(jié)構(gòu)bpf_map */
array_size = sizeof(*array);
if (percpu)
array_size += (u64) max_entries * sizeof(void *);
else
array_size += (u64) max_entries * elem_size;
/* make sure there is no u32 overflow later in round_up() */
if (array_size >= U32_MAX - PAGE_SIZE)
return ERR_PTR(-ENOMEM);
/* allocate all map elements and zero-initialize them */
/* (1.1.3) 根據(jù)總大小,分配bpf_array空間 */
array = bpf_map_area_alloc(array_size);
if (!array)
return ERR_PTR(-ENOMEM);
array->index_mask = index_mask;
array->map.unpriv_array = unpriv;
/* copy mandatory map attributes */
/* (1.1.4) 拷貝attr到array->map中 */
array->map.map_type = attr->map_type;
array->map.key_size = attr->key_size;
array->map.value_size = attr->value_size;
array->map.max_entries = attr->max_entries;
array->elem_size = elem_size;
if (!percpu)
goto out;
array_size += (u64) attr->max_entries * elem_size * num_possible_cpus();
if (array_size >= U32_MAX - PAGE_SIZE ||
elem_size > PCPU_MIN_UNIT_SIZE || bpf_array_alloc_percpu(array)) {
bpf_map_area_free(array);
return ERR_PTR(-ENOMEM);
}
out:
array->map.pages = round_up(array_size, PAGE_SIZE) >> PAGE_SHIFT;
return &array->map;
}
-
2、BPF_MAP_TYPE_HASH
我們以BPF_MAP_TYPE_HASH類型的map為例,來看看map的分配過程:
static const struct bpf_map_ops htab_ops = {
.map_alloc = htab_map_alloc,
.map_free = htab_map_free,
.map_get_next_key = htab_map_get_next_key,
.map_lookup_elem = htab_map_lookup_elem,
.map_update_elem = htab_map_update_elem,
.map_delete_elem = htab_map_delete_elem,
};
static struct bpf_map_type_list htab_type __read_mostly = {
.ops = &htab_ops,
.type = BPF_MAP_TYPE_HASH,
};
↓
static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
{
bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_HASH;
struct bpf_htab *htab;
int err, i;
u64 cost;
if (attr->map_flags & ~BPF_F_NO_PREALLOC)
/* reserved bits should not be used */
return ERR_PTR(-EINVAL);
/* (1.1.1) 因?yàn)閔ash是用鏈表存儲的,所以bpf_htab結(jié)構(gòu)是固定的,優(yōu)先分配 */
htab = kzalloc(sizeof(*htab), GFP_USER);
if (!htab)
return ERR_PTR(-ENOMEM);
/* mandatory map attributes */
htab->map.map_type = attr->map_type;
htab->map.key_size = attr->key_size;
htab->map.value_size = attr->value_size;
htab->map.max_entries = attr->max_entries;
htab->map.map_flags = attr->map_flags;
/* check sanity of attributes.
* value_size == 0 may be allowed in the future to use map as a set
*/
err = -EINVAL;
if (htab->map.max_entries == 0 || htab->map.key_size == 0 ||
htab->map.value_size == 0)
goto free_htab;
/* hash table size must be power of 2 */
/* (1.1.2) 鏈表頭buckets的個(gè)數(shù),等于和最大條目值最接近的2的n次方 */
htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
err = -E2BIG;
if (htab->map.key_size > MAX_BPF_STACK)
/* eBPF programs initialize keys on stack, so they cannot be
* larger than max stack size
*/
goto free_htab;
if (htab->map.value_size >= (1 << (KMALLOC_SHIFT_MAX - 1)) -
MAX_BPF_STACK - sizeof(struct htab_elem))
/* if value_size is bigger, the user space won't be able to
* access the elements via bpf syscall. This check also makes
* sure that the elem_size doesn't overflow and it's
* kmalloc-able later in htab_map_update_elem()
*/
goto free_htab;
if (percpu && round_up(htab->map.value_size, 8) > PCPU_MIN_UNIT_SIZE)
/* make sure the size for pcpu_alloc() is reasonable */
goto free_htab;
/* (1.1.3) hash的一個(gè)element size = htab_elem + key_size + value_size */
htab->elem_size = sizeof(struct htab_elem) +
round_up(htab->map.key_size, 8);
if (percpu)
htab->elem_size += sizeof(void *);
else
htab->elem_size += round_up(htab->map.value_size, 8);
/* prevent zero size kmalloc and check for u32 overflow */
if (htab->n_buckets == 0 ||
htab->n_buckets > U32_MAX / sizeof(struct bucket))
goto free_htab;
/* (1.1.4) 總占用內(nèi)存的大小cost = bucket_size*max_entries + elem_size*max_entries + extra_element_size,
其中extra_element_size = elem_size * num_possible_cpus();
*/
cost = (u64) htab->n_buckets * sizeof(struct bucket) +
(u64) htab->elem_size * htab->map.max_entries;
if (percpu)
cost += (u64) round_up(htab->map.value_size, 8) *
num_possible_cpus() * htab->map.max_entries;
else
cost += (u64) htab->elem_size * num_possible_cpus();
if (cost >= U32_MAX - PAGE_SIZE)
/* make sure page count doesn't overflow */
goto free_htab;
htab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
/* if map size is larger than memlock limit, reject it early */
err = bpf_map_precharge_memlock(htab->map.pages);
if (err)
goto free_htab;
err = -ENOMEM;
/* (1.1.5) 分配bucket空間 */
htab->buckets = bpf_map_area_alloc(htab->n_buckets *
sizeof(struct bucket));
if (!htab->buckets)
goto free_htab;
for (i = 0; i < htab->n_buckets; i++) {
INIT_HLIST_HEAD(&htab->buckets[i].head);
raw_spin_lock_init(&htab->buckets[i].lock);
}
/* (1.1.6) 分配extra elems空間 */
if (!percpu) {
err = alloc_extra_elems(htab);
if (err)
goto free_buckets;
}
/* (1.1.7) 分配elems空間,并且將其平均掛載到htab->freelist的percpu鏈表上 */
if (!(attr->map_flags & BPF_F_NO_PREALLOC)) {
err = prealloc_elems_and_freelist(htab);
if (err)
goto free_extra_elems;
}
return &htab->map;
free_extra_elems:
free_percpu(htab->extra_elems);
free_buckets:
bpf_map_area_free(htab->buckets);
free_htab:
kfree(htab);
return ERR_PTR(err);
}
1.2.2、map的查找
查找就是通過key來找到對應(yīng)的value。
static int map_lookup_elem(union bpf_attr *attr)
{
void __user *ukey = u64_to_ptr(attr->key);
void __user *uvalue = u64_to_ptr(attr->value);
int ufd = attr->map_fd;
struct bpf_map *map;
void *key, *value, *ptr;
u32 value_size;
struct fd f;
int err;
if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM))
return -EINVAL;
f = fdget(ufd);
map = __bpf_map_get(f);
if (IS_ERR(map))
return PTR_ERR(map);
err = -ENOMEM;
key = kmalloc(map->key_size, GFP_USER);
if (!key)
goto err_put;
err = -EFAULT;
if (copy_from_user(key, ukey, map->key_size) != 0)
goto free_key;
if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
value_size = round_up(map->value_size, 8) * num_possible_cpus();
else
value_size = map->value_size;
err = -ENOMEM;
value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
if (!value)
goto free_key;
/* (1) 幾種特殊類型map的處理 */
if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH) {
err = bpf_percpu_hash_copy(map, key, value);
} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
err = bpf_percpu_array_copy(map, key, value);
} else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
err = bpf_stackmap_copy(map, key, value);
/* (2) 其他類型map都會調(diào)用map->ops->map_lookup_elem()函數(shù) */
} else {
rcu_read_lock();
ptr = map->ops->map_lookup_elem(map, key);
/* (3.1) 賦值給value */
if (ptr)
memcpy(value, ptr, value_size);
rcu_read_unlock();
err = ptr ? 0 : -ENOENT;
}
if (err)
goto free_value;
err = -EFAULT;
/* (3.2) 將value值拷貝會給用戶空間 */
if (copy_to_user(uvalue, value, value_size) != 0)
goto free_value;
err = 0;
free_value:
kfree(value);
free_key:
kfree(key);
err_put:
fdput(f);
return err;
}
1、BPF_MAP_TYPE_ARRAY
BPF_MAP_TYPE_ARRAY類型的map最終調(diào)用到array_map_lookup_elem():
static void *array_map_lookup_elem(struct bpf_map *map, void *key)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
/* (2.1) key就是index */
u32 index = *(u32 *)key;
if (unlikely(index >= array->map.max_entries))
return NULL;
/* (2.2) 根據(jù)index,找到array->value[]數(shù)組中的value指針 */
return array->value + array->elem_size * (index & array->index_mask);
}
-
2、BPF_MAP_TYPE_HASH
BPF_MAP_TYPE_HASH類型的map最終調(diào)用到htab_map_lookup_elem():
static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
{
struct htab_elem *l = __htab_map_lookup_elem(map, key);
if (l)
return l->key + round_up(map->key_size, 8);
return NULL;
}
↓
static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)
{
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
struct hlist_head *head;
struct htab_elem *l;
u32 hash, key_size;
/* Must be called with rcu_read_lock. */
WARN_ON_ONCE(!rcu_read_lock_held());
key_size = map->key_size;
/* (2.1) 根據(jù)key計(jì)算出hash值 */
hash = htab_map_hash(key, key_size);
/* (2.2) 根據(jù)hash值找到鏈表頭bucket */
head = select_bucket(htab, hash);
/* (2.3) 在bucket鏈表中搜索key相等的htab_elem,如果找不到返回NULL */
l = lookup_elem_raw(head, hash, key, key_size);
return l;
}
1.2.3、BPF_FUNC_map_lookup_elem
除了用戶態(tài)空間需要通過bpf()系統(tǒng)調(diào)用來查找key對應(yīng)的value值。BPF程序中也需要根據(jù)key查找到value的地址,然后在BPF程序中使用。BPF程序時(shí)通過調(diào)用BPF_FUNC_map_lookup_elem helper function來實(shí)現(xiàn)的。
我們以perf_event為例,看看BPF_FUNC_map_lookup_elem helper function的實(shí)現(xiàn):
static const struct bpf_verifier_ops perf_event_prog_ops = {
.get_func_proto = tp_prog_func_proto,
.is_valid_access = pe_prog_is_valid_access,
.convert_ctx_access = pe_prog_convert_ctx_access,
};
static struct bpf_prog_type_list perf_event_tl = {
.ops = &perf_event_prog_ops,
.type = BPF_PROG_TYPE_PERF_EVENT,
};
↓
static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
case BPF_FUNC_perf_event_output:
return &bpf_perf_event_output_proto_tp;
case BPF_FUNC_get_stackid:
return &bpf_get_stackid_proto_tp;
default:
return tracing_func_proto(func_id);
}
}
↓
static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
case BPF_FUNC_map_lookup_elem:
return &bpf_map_lookup_elem_proto;
case BPF_FUNC_map_update_elem:
return &bpf_map_update_elem_proto;
case BPF_FUNC_map_delete_elem:
return &bpf_map_delete_elem_proto;
case BPF_FUNC_probe_read:
return &bpf_probe_read_proto;
case BPF_FUNC_ktime_get_ns:
return &bpf_ktime_get_ns_proto;
case BPF_FUNC_tail_call:
return &bpf_tail_call_proto;
case BPF_FUNC_get_current_pid_tgid:
return &bpf_get_current_pid_tgid_proto;
case BPF_FUNC_get_current_task:
return &bpf_get_current_task_proto;
case BPF_FUNC_get_current_uid_gid:
return &bpf_get_current_uid_gid_proto;
case BPF_FUNC_get_current_comm:
return &bpf_get_current_comm_proto;
case BPF_FUNC_trace_printk:
return bpf_get_trace_printk_proto();
case BPF_FUNC_get_smp_processor_id:
return &bpf_get_smp_processor_id_proto;
case BPF_FUNC_perf_event_read:
return &bpf_perf_event_read_proto;
case BPF_FUNC_probe_write_user:
return bpf_get_probe_write_proto();
case BPF_FUNC_current_task_under_cgroup:
return &bpf_current_task_under_cgroup_proto;
case BPF_FUNC_get_prandom_u32:
return &bpf_get_prandom_u32_proto;
default:
return NULL;
}
}
↓
const struct bpf_func_proto bpf_map_lookup_elem_proto = {
.func = bpf_map_lookup_elem,
.gpl_only = false,
.pkt_access = true,
.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
.arg1_type = ARG_CONST_MAP_PTR,
.arg2_type = ARG_PTR_TO_MAP_KEY,
};
↓
BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
{
WARN_ON_ONCE(!rcu_read_lock_held());
return (unsigned long) map->ops->map_lookup_elem(map, key);
}
和bpf()系統(tǒng)調(diào)用一樣,最后調(diào)用的都是map->ops->map_lookup_elem()函數(shù),只不過BPF程序需要返回的是value的指針,而bpf()系統(tǒng)調(diào)用需要返回的是value的值。
關(guān)于map的helper function,還有BPF_FUNC_map_update_elem、BPF_FUNC_map_delete_elem可以使用,原理一樣。
1.3、obj pin
系統(tǒng)把bpf_prog和bpf_map都和文件句柄綁定起來。有一系列的好處:比如可以在用戶態(tài)使用一系列的通用文件操作;也有一系列的壞處:因?yàn)閒d生存在進(jìn)程空間的,其他進(jìn)程不能訪問,而且一旦本進(jìn)程退出,這些對象都會處于失聯(lián)狀態(tài)無法訪問。
所以系統(tǒng)也支持把bpf對象進(jìn)行全局化的聲明,具體的做法是把這些對象綁定到一個(gè)專用的文件系統(tǒng)當(dāng)中:
# ls /sys/fs/bpf/
#
具體分為pin操作和get操作。
1.3.1、bpf_obj_pin()
static int bpf_obj_pin(const union bpf_attr *attr)
{
if (CHECK_ATTR(BPF_OBJ))
return -EINVAL;
return bpf_obj_pin_user(attr->bpf_fd, u64_to_ptr(attr->pathname));
}
↓
int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
{
struct filename *pname;
enum bpf_type type;
void *raw;
int ret;
/* (1) 根據(jù)字符串獲取路徑 */
pname = getname(pathname);
if (IS_ERR(pname))
return PTR_ERR(pname);
/* (2) 根據(jù)fd獲取到bpf_map/bpf_prog對象 */
raw = bpf_fd_probe_obj(ufd, &type);
if (IS_ERR(raw)) {
ret = PTR_ERR(raw);
goto out;
}
/* (3) 創(chuàng)建文件節(jié)點(diǎn),和bpf對象聯(lián)結(jié)起來 */
ret = bpf_obj_do_pin(pname, raw, type);
if (ret != 0)
bpf_any_put(raw, type);
out:
putname(pname);
return ret;
}
|→
static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
{
void *raw;
/* (2.1) 根據(jù)fd,嘗試獲取map對象 */
*type = BPF_TYPE_MAP;
raw = bpf_map_get_with_uref(ufd);
if (IS_ERR(raw)) {
/* (2.2) 如果失敗,根據(jù)fd,嘗試獲取prog對象 */
*type = BPF_TYPE_PROG;
raw = bpf_prog_get(ufd);
}
return raw;
}
|→
static int bpf_obj_do_pin(const struct filename *pathname, void *raw,
enum bpf_type type)
{
struct dentry *dentry;
struct inode *dir;
struct path path;
umode_t mode;
dev_t devt;
int ret;
/* (3.1) 創(chuàng)建dentry對象 */
dentry = kern_path_create(AT_FDCWD, pathname->name, &path, 0);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
/* (3.2) type存儲在devt中 */
devt = MKDEV(UNNAMED_MAJOR, type);
ret = security_path_mknod(&path, dentry, mode, devt);
if (ret)
goto out;
dir = d_inode(path.dentry);
if (dir->i_op != &bpf_dir_iops) {
ret = -EPERM;
goto out;
}
/* (3.3) 對象指針raw存放到dentry->d_fsdata中,再來創(chuàng)建inode */
dentry->d_fsdata = raw;
ret = vfs_mknod(dir, dentry, mode, devt);
dentry->d_fsdata = NULL;
out:
done_path_create(&path, dentry);
return ret;
}
1.3.2、bpf_obj_get()
static int bpf_obj_get(const union bpf_attr *attr)
{
if (CHECK_ATTR(BPF_OBJ) || attr->bpf_fd != 0)
return -EINVAL;
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
}
↓
int bpf_obj_get_user(const char __user *pathname)
{
enum bpf_type type = BPF_TYPE_UNSPEC;
struct filename *pname;
int ret = -ENOENT;
void *raw;
/* (1) 根據(jù)字符串獲取路徑 */
pname = getname(pathname);
if (IS_ERR(pname))
return PTR_ERR(pname);
/* (2) 根據(jù)路徑,在對應(yīng)inode中找到bpf對象的raw指針和type */
raw = bpf_obj_do_get(pname, &type);
if (IS_ERR(raw)) {
ret = PTR_ERR(raw);
goto out;
}
/* (3) 根據(jù)對象type,在本進(jìn)程中給bpf對象分配一個(gè)fd */
if (type == BPF_TYPE_PROG)
ret = bpf_prog_new_fd(raw);
else if (type == BPF_TYPE_MAP)
ret = bpf_map_new_fd(raw);
else
goto out;
if (ret < 0)
bpf_any_put(raw, type);
out:
putname(pname);
return ret;
}
↓
static void *bpf_obj_do_get(const struct filename *pathname,
enum bpf_type *type)
{
struct inode *inode;
struct path path;
void *raw;
int ret;
/* (2.1) 根據(jù)路徑,獲取到dentry */
ret = kern_path(pathname->name, LOOKUP_FOLLOW, &path);
if (ret)
return ERR_PTR(ret);
/* (2.2) 根據(jù)dentry,獲取到inode */
inode = d_backing_inode(path.dentry);
ret = inode_permission(inode, MAY_WRITE);
if (ret)
goto out;
/* (2.3) 根據(jù)inode,獲取到type */
ret = bpf_inode_type(inode, type);
if (ret)
goto out;
/* (2.4) 根據(jù)inode和type,獲取到raw指針 */
raw = bpf_any_get(inode->i_private, *type);
if (!IS_ERR(raw))
touch_atime(&path);
path_put(&path);
return raw;
out:
path_put(&path);
return ERR_PTR(ret);
}
2.Tracing類型的BPF程序
經(jīng)過上一節(jié)的內(nèi)容,bpf程序和map已經(jīng)加載到內(nèi)核當(dāng)中了。什么時(shí)候bpf程序才能發(fā)揮它的作用呢?
這就需要bpf的應(yīng)用系統(tǒng)把其掛載到適當(dāng)?shù)你^子上,當(dāng)鉤子所在點(diǎn)的路徑被執(zhí)行,鉤子被觸發(fā),BPF程序得以執(zhí)行。
目前應(yīng)用bpf的子系統(tǒng)分為兩大類:
-
tracing:kprobe、tracepoint、perf_event
-
filter:sk_filter、sched_cls、sched_act、xdp、cg_skb
我們仔細(xì)分析一下tracing類子系統(tǒng)應(yīng)用bpf的過程,tracing類型的bpf操作都是通過perf來完成的。
2.1、bpf程序的綁定
在使用perf_event_open()系統(tǒng)調(diào)用創(chuàng)建perf_event并且返回一個(gè)文件句柄后,可以使用ioctl的PERF_EVENT_IOC_SET_BPF命令把加載好的bpf程序和當(dāng)前perf_event綁定起來。
static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
struct perf_event *event = file->private_data;
struct perf_event_context *ctx;
long ret;
ctx = perf_event_ctx_lock(event);
ret = _perf_ioctl(event, cmd, arg);
perf_event_ctx_unlock(event, ctx);
return ret;
}
↓
static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned long arg)
{
void (*func)(struct perf_event *);
u32 flags = arg;
switch (cmd) {
case PERF_EVENT_IOC_ENABLE:
func = _perf_event_enable;
break;
case PERF_EVENT_IOC_DISABLE:
func = _perf_event_disable;
break;
case PERF_EVENT_IOC_RESET:
func = _perf_event_reset;
break;
case PERF_EVENT_IOC_REFRESH:
return _perf_event_refresh(event, arg);
case PERF_EVENT_IOC_PERIOD:
return perf_event_period(event, (u64 __user *)arg);
case PERF_EVENT_IOC_ID:
{
u64 id = primary_event_id(event);
if (copy_to_user((void __user *)arg, &id, sizeof(id)))
return -EFAULT;
return 0;
}
case PERF_EVENT_IOC_SET_OUTPUT:
{
int ret;
if (arg != -1) {
struct perf_event *output_event;
struct fd output;
ret = perf_fget_light(arg, &output);
if (ret)
return ret;
output_event = output.file->private_data;
ret = perf_event_set_output(event, output_event);
fdput(output);
} else {
ret = perf_event_set_output(event, NULL);
}
return ret;
}
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);
case PERF_EVENT_IOC_SET_BPF:
return perf_event_set_bpf_prog(event, arg);
case PERF_EVENT_IOC_PAUSE_OUTPUT: {
struct ring_buffer *rb;
rcu_read_lock();
rb = rcu_dereference(event->rb);
if (!rb || !rb->nr_pages) {
rcu_read_unlock();
return -EINVAL;
}
rb_toggle_paused(rb, !!arg);
rcu_read_unlock();
return 0;
}
default:
return -ENOTTY;
}
if (flags & PERF_IOC_FLAG_GROUP)
perf_event_for_each(event, func);
else
perf_event_for_each_child(event, func);
return 0;
}
↓
static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
{
bool is_kprobe, is_tracepoint;
struct bpf_prog *prog;
/* (1) 對于PERF_TYPE_HARDWARE、PERF_TYPE_SOFTWARE類型的perf_event,需要綁定BPF_PROG_TYPE_PERF_EVENT類型的BPF prog
event->prog = prog;
*/
if (event->attr.type == PERF_TYPE_HARDWARE ||
event->attr.type == PERF_TYPE_SOFTWARE)
return perf_event_set_bpf_handler(event, prog_fd);
if (event->attr.type != PERF_TYPE_TRACEPOINT)
return -EINVAL;
if (event->tp_event->prog)
return -EEXIST;
is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;
is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;
if (!is_kprobe && !is_tracepoint)
/* bpf programs can only be attached to u/kprobe or tracepoint */
return -EINVAL;
prog = bpf_prog_get(prog_fd);
if (IS_ERR(prog))
return PTR_ERR(prog);
/* (2) 對于TRACE_EVENT_FL_TRACEPOINT類型的perf_event,需要綁定BPF_PROG_TYPE_TRACEPOINT類型的BPF prog
對于TRACE_EVENT_FL_UKPROBE類型的perf_event,需要綁定BPF_PROG_TYPE_KPROBE類型的BPF prog
event->tp_event->prog = prog;
*/
if ((is_kprobe && prog->type != BPF_PROG_TYPE_KPROBE) ||
(is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {
/* valid fd, but invalid bpf program type */
bpf_prog_put(prog);
return -EINVAL;
}
/* (3) 如果是tracepoint類型的perf_event,需要注意自定義數(shù)據(jù)的大小不能超過bpf_prog中規(guī)定的context的大小,不然會被認(rèn)為是非法訪問 */
if (is_tracepoint) {
int off = trace_event_get_offsets(event->tp_event);
if (prog->aux->max_ctx_offset > off) {
bpf_prog_put(prog);
return -EACCES;
}
}
event->tp_event->prog = prog;
event->tp_event->bpf_prog_owner = event;
return 0;
}
如上,perf_event綁定bpf_prog的規(guī)則如下:
-
對于PERF_TYPE_HARDWARE、PERF_TYPE_SOFTWARE類型的perf_event,需要綁定BPF_PROG_TYPE_PERF_EVENT類型的BPF prog。event->prog = prog;
-
對于TRACE_EVENT_FL_TRACEPOINT實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event,需要綁定BPF_PROG_TYPE_TRACEPOINT類型的BPF prog。event->tp_event->prog = prog;
-
對于TRACE_EVENT_FL_UKPROBE實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event,需要綁定BPF_PROG_TYPE_KPROBE類型的BPF prog。event->tp_event->prog = prog;
2.2、bpf程序的執(zhí)行
因?yàn)閹追Nperf_event的執(zhí)行路徑不一樣,我們分開描述。
-
1、PERF_TYPE_HARDWARE、PERF_TYPE_SOFTWARE類型的perf_event。
static void bpf_overflow_handler(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs)
{
/* (1) 構(gòu)造context */
struct bpf_perf_event_data_kern ctx = {
.data = data,
.regs = regs,
};
int ret = 0;
preempt_disable();
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
goto out;
rcu_read_lock();
/* (2) 調(diào)用bpf_prog處理 */
ret = BPF_PROG_RUN(event->prog, (void *)&ctx);
rcu_read_unlock();
out:
__this_cpu_dec(bpf_prog_active);
preempt_enable();
if (!ret)
return;
/* (3) perf_event的數(shù)據(jù)處理 */
event->orig_overflow_handler(event, data, regs);
}
-
2、TRACE_EVENT_FL_TRACEPOINT實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event。
-
static notrace void
perf_trace_##call(void *__data, proto)
{
struct trace_event_call *event_call = __data;
struct trace_event_data_offsets_##call __maybe_unused __data_offsets;
struct trace_event_raw_##call *entry;
struct bpf_prog *prog = event_call->prog;
struct pt_regs *__regs;
u64 __count = 1;
struct task_struct *__task = NULL;
struct hlist_head *head;
int __entry_size;
int __data_size;
int rctx;
__data_size = trace_event_get_offsets_##call(&__data_offsets, args);
head = this_cpu_ptr(event_call->perf_events);
if (!prog && __builtin_constant_p(!__task) && !__task &&
hlist_empty(head))
return;
__entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),
sizeof(u64));
__entry_size -= sizeof(u32);
entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx);
if (!entry)
return;
perf_fetch_caller_regs(__regs);
tstruct
{ assign; }
perf_trace_run_bpf_submit(entry, __entry_size, rctx,
event_call, __count, __regs,
head, __task);
}
↓
void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
struct trace_event_call *call, u64 count,
struct pt_regs *regs, struct hlist_head *head,
struct task_struct *task)
{
struct bpf_prog *prog = call->prog;
/* (1) 調(diào)用bpf_prog處理 */
if (prog) {
*(struct pt_regs **)raw_data = regs;
if (!trace_call_bpf(prog, raw_data) || hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
return;
}
}
/* (2) perf_event的數(shù)據(jù)處理 */
perf_tp_event(call->event.type, count, raw_data, size, regs, head,
rctx, task);
}
↓
unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
{
unsigned int ret;
if (in_nmi()) /* not supported yet */
return 1;
preempt_disable();
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
/*
* since some bpf program is already running on this cpu,
* don't call into another bpf program (same or different)
* and don't send kprobe event into ring-buffer,
* so return zero here
*/
ret = 0;
goto out;
}
rcu_read_lock();
/* (1.1) 對bpf_prog的調(diào)用 */
ret = BPF_PROG_RUN(prog, ctx);
rcu_read_unlock();
out:
__this_cpu_dec(bpf_prog_active);
preempt_enable();
return ret;
}
-
3、TRACE_EVENT_FL_UKPROBE實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event。
kprobe類型的實(shí)現(xiàn):
static void
kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
{
struct trace_event_call *call = &tk->tp.call;
struct bpf_prog *prog = call->prog;
struct kprobe_trace_entry_head *entry;
struct hlist_head *head;
int size, __size, dsize;
int rctx;
/* (1) 調(diào)用bpf_prog處理 */
if (prog && !trace_call_bpf(prog, regs))
return;
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return;
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);
entry = perf_trace_buf_alloc(size, NULL, &rctx);
if (!entry)
return;
entry->ip = (unsigned long)tk->rp.kp.addr;
memset(&entry[1], 0, dsize);
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
/* (2) perf_event的數(shù)據(jù)處理 */
perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,
head, NULL);
}
kretprobe類型的實(shí)現(xiàn):
static void
kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
struct pt_regs *regs)
{
struct trace_event_call *call = &tk->tp.call;
struct bpf_prog *prog = call->prog;
struct kretprobe_trace_entry_head *entry;
struct hlist_head *head;
int size, __size, dsize;
int rctx;
/* (1) 調(diào)用bpf_prog處理 */
if (prog && !trace_call_bpf(prog, regs))
return;
head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return;
dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);
entry = perf_trace_buf_alloc(size, NULL, &rctx);
if (!entry)
return;
entry->func = (unsigned long)tk->rp.kp.addr;
entry->ret_ip = (unsigned long)ri->ret_addr;
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
/* (2) perf_event的數(shù)據(jù)處理 */
perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,
head, NULL);
}
3.Filter類型的BPF程序
暫不分析
原文標(biāo)題:BPF內(nèi)核實(shí)現(xiàn)詳解
文章出處:【微信公眾號:Linux閱碼場】歡迎添加關(guān)注!文章轉(zhuǎn)載請注明出處。
-
芯片
+關(guān)注
關(guān)注
456文章
51283瀏覽量
427811 -
Linux
+關(guān)注
關(guān)注
87文章
11351瀏覽量
210512 -
內(nèi)存
+關(guān)注
關(guān)注
8文章
3064瀏覽量
74383
原文標(biāo)題:BPF內(nèi)核實(shí)現(xiàn)詳解
文章出處:【微信號:LinuxDev,微信公眾號:Linux閱碼場】歡迎添加關(guān)注!文章轉(zhuǎn)載請注明出處。
發(fā)布評論請先 登錄
相關(guān)推薦
FT-BPF-4053-C帶通濾波器
BPF2怎么修訂歷史
BPF-A580+帶通濾波器產(chǎn)品介紹
淺析openat系統(tǒng)調(diào)用在arm64下的實(shí)現(xiàn)及使用原理
一篇文章搞懂BPF的內(nèi)部邏輯
TCP-IP詳解卷2_BPF:BSD 分組過濾程序
保證BPF程序安全的BPF驗(yàn)證器介紹
![保證<b class='flag-5'>BPF</b><b class='flag-5'>程序</b>安全的<b class='flag-5'>BPF</b>驗(yàn)證器介紹](https://file.elecfans.com/web1/M00/EB/EA/pIYBAGB-SsiAATrrAABq0suoc4I626.png)
教你們?nèi)绾问褂胑BPF追蹤LINUX內(nèi)核
![教你們?nèi)绾问褂胑BPF追蹤LINUX內(nèi)核](https://file.elecfans.com/web1/M00/EB/71/o4YBAGB-SweAKgvmAABq0suoc4I875.png)
如何使用BPF對Linux內(nèi)核進(jìn)行實(shí)時(shí)跟蹤
![如何使用<b class='flag-5'>BPF</b>對Linux內(nèi)核進(jìn)行實(shí)時(shí)跟蹤](https://file.elecfans.com/web2/M00/04/D5/pYYBAGDcOiaAe6-OAABjIrj66mo825.png)
BPF ring buffer解決的問題及背后的設(shè)計(jì)
BPF編程的環(huán)境搭建方法
![<b class='flag-5'>BPF</b>編程的環(huán)境搭建方法](https://file.elecfans.com/web2/M00/70/0B/poYBAGNJJFGABU6gAAAZOPTeisE747.png)
BPF為內(nèi)核編程提供了一個(gè)新的參考模型
BPF如何在Unix內(nèi)核實(shí)現(xiàn)網(wǎng)絡(luò)數(shù)據(jù)包過濾
![<b class='flag-5'>BPF</b>如何在Unix內(nèi)核實(shí)現(xiàn)網(wǎng)絡(luò)數(shù)據(jù)包過濾](https://file1.elecfans.com/web2/M00/89/77/wKgaomSFd0qAPo1EAAAfbtImUJA187.png)
Linux內(nèi)核革命性技術(shù)之BPF的前世今生
![Linux內(nèi)核革命性技術(shù)之<b class='flag-5'>BPF</b>的前世今生](https://file1.elecfans.com/web2/M00/8D/D6/wKgaomTAoY2AJ0coAAAK4-_xuGY726.png)
評論