欧美性猛交xxxx免费看_牛牛在线视频国产免费_天堂草原电视剧在线观看免费_国产粉嫩高清在线观看_国产欧美日本亚洲精品一5区

0
  • 聊天消息
  • 系統(tǒng)消息
  • 評論與回復(fù)
登錄后你可以
  • 下載海量資料
  • 學(xué)習(xí)在線課程
  • 觀看技術(shù)視頻
  • 寫文章/發(fā)帖/加入社區(qū)
會員中心
創(chuàng)作中心

完善資料讓更多小伙伴認(rèn)識你,還能領(lǐng)取20積分哦,立即完善>

3天內(nèi)不再提示

BPF系統(tǒng)調(diào)用與Tracing類型的BPF程序

Linux閱碼場 ? 來源:Linux閱碼場 ? 作者:Linux閱碼場 ? 2022-03-14 16:42 ? 次閱讀

偉林,中年碼農(nóng),從事過電信、手機(jī)、安全、芯片等行業(yè),目前依舊從事Linux方向開發(fā)工作,個(gè)人愛好Linux相關(guān)知識分享,個(gè)人微博CSDN pwl999,歡迎大家關(guān)注!

文章目錄

1、bpf()系統(tǒng)調(diào)用

1.1、bpf加載

1.1.1、bpf內(nèi)存空間分配

1.1.2、bpf verifier

1.1.3、bpf JIT/kernel interpreter

1.1.4、fd分配

1.2、bpf map操作

1.2.1、map的創(chuàng)建

1.2.2、map的查找

1.2.3、BPF_FUNC_map_lookup_elem

1.3、obj pin

1.3.1、bpf_obj_pin() 1.3.2、bpf_obj_get()

2、Tracing類型的BPF程序

2.1、bpf程序的綁定

2.2、bpf程序的執(zhí)行

3、Filter類型的BPF程序

BPF的字面上意思Berkeley Packet Filter意味著它是從包過濾而來。如果在開始前對BPF缺乏感性的認(rèn)識建議先看一下參考文檔:“3.1、Berkeley Packet Filter (BPF) (Kernel Document)”、“3.2、BPF and XDP Reference Guide”。

本質(zhì)上它是一種內(nèi)核代碼注入的技術(shù):

  • 內(nèi)核中實(shí)現(xiàn)了一個(gè)cBPF/eBPF虛擬機(jī);

  • 用戶態(tài)可以用C來寫運(yùn)行的代碼,再通過一個(gè)Clang&LLVM的編譯器將C代碼編譯成BPF目標(biāo)碼;

  • 用戶態(tài)通過系統(tǒng)調(diào)用bpf()BPF目標(biāo)碼注入到內(nèi)核當(dāng)中;

  • 內(nèi)核通過JIT(Just-In-Time)將BPF目編碼轉(zhuǎn)換成本地指令碼;如果當(dāng)前架構(gòu)不支持JIT轉(zhuǎn)換內(nèi)核則會使用一個(gè)解析器(interpreter)來模擬運(yùn)行,這種運(yùn)行效率較低;

  • 內(nèi)核在packet filter和tracing等應(yīng)用中提供了一系列的鉤子來運(yùn)行BPF代碼。目前支持以下類型的BPF代碼:

static int __init register_kprobe_prog_ops(void){  bpf_register_prog_type(&kprobe_tl);  bpf_register_prog_type(&tracepoint_tl);  bpf_register_prog_type(&perf_event_tl);  return 0;}
static int __init register_sk_filter_ops(void){  bpf_register_prog_type(&sk_filter_type);  bpf_register_prog_type(&sched_cls_type);  bpf_register_prog_type(&sched_act_type);  bpf_register_prog_type(&xdp_type);  bpf_register_prog_type(&cg_skb_type);
  return 0;}

BPF的好處在哪里?是因?yàn)樗峁┝艘环N在不修改內(nèi)核代碼的情況下,可以靈活修改內(nèi)核處理策略的方法。

這在包過濾和系統(tǒng)tracing這種需要頻繁修改規(guī)則的場合非常有用。因?yàn)槿绻辉谟脩魬B(tài)修改策略的話那么所有數(shù)據(jù)需要復(fù)制一份給用戶態(tài)開銷較大;如果在內(nèi)核態(tài)修改策略的話需要修改內(nèi)核代碼重新編譯內(nèi)核,而且容易引人安全問題。BPF這種內(nèi)核代碼注入技術(shù)的生存空間就是它可以在這兩者間取得一個(gè)平衡。

Systamp就是解決了這個(gè)問題得以發(fā)展的,它使用了ko的方式來實(shí)現(xiàn)內(nèi)核代碼注入(有點(diǎn)笨拙,但是也解決了實(shí)際問題)。

Systemtap工作原理是通過將腳本語句翻譯成C語句,編譯成內(nèi)核模塊。模塊加載之后,將所有探測的事件以Kprobe鉤子的方式掛到內(nèi)核上,當(dāng)任何處理器上的某個(gè)事件發(fā)生時(shí),相應(yīng)鉤子上句柄就會被執(zhí)行。最后,當(dāng)systemtap會話結(jié)束之后,鉤子從內(nèi)核上取下,移除模塊。整個(gè)過程用一個(gè)命令stap就可以完成。

既然是提供向內(nèi)核注入代碼的技術(shù),那么安全問題肯定是重中之重。平時(shí)防范他人通過漏洞向內(nèi)核中注入代碼,這下子專門開了一個(gè)口子不是大開方便之門。所以內(nèi)核指定了很多的規(guī)則來限制BPF代碼,確保它的錯(cuò)誤不會影響到內(nèi)核:

  • 一個(gè)BPF程序的代碼數(shù)量不能超過BPF_MAXINSNS (4K),它的總運(yùn)行步數(shù)不能超過32K (4.9內(nèi)核中這個(gè)值改成了96k);

  • BPF代碼中禁止循環(huán),這也是為了保證出錯(cuò)時(shí)不會出現(xiàn)死循環(huán)來hang死內(nèi)核。一個(gè)BPF程序總的可能的分支數(shù)也被限制到1K;

  • 為了限制它的作用域,BPF代碼不能訪問全局變量,只能訪問局部變量。一個(gè)BPF程序只有512字節(jié)的堆棧。在開始時(shí)會傳入一個(gè)ctx指針,BPF程序的數(shù)據(jù)訪問就被限制在ctx變量和堆棧局部變量中;

  • 如果BPF需要訪問全局變量,它只能訪問BPF map對象。BPF map對象是同時(shí)能被用戶態(tài)、BPF程序、內(nèi)核態(tài)共同訪問的,BPFmap的訪問通過helper function來實(shí)現(xiàn);

  • 舊版本BPF代碼中不支持BPFBPF函數(shù)的調(diào)用,所以所有的BPF函數(shù)必須聲明成always_inline。在Linux內(nèi)核4.16和LLVM 6.0以后,才支持BPF to BPF Calls;

  • BPF雖然不能函數(shù)調(diào)用,但是它可以使用Tail Call機(jī)制從一個(gè)BPF程序直接跳轉(zhuǎn)到另一個(gè)BPF程序。它需要通過BPF_MAP_TYPE_PROG_ARRAY類型的map來知道另一個(gè)BPF程序的指針。這種跳轉(zhuǎn)的次數(shù)也是有限制的,32次;

  • BPF程序可以調(diào)用一些內(nèi)核函數(shù)來輔助做一些事情(helper function);

  • 有些架構(gòu)(64 bit x86_64, arm64, ppc64, s390x, mips64, sparc64 and 32 bit arm)已經(jīng)支持BPF的JIT,它可以高效的幾乎一比一的把BPF代碼轉(zhuǎn)換成本機(jī)代碼(因?yàn)閑BPF的指令集已經(jīng)做了優(yōu)化,非常類似最新的arm/x86架構(gòu),ABI也類似)。如果當(dāng)前架構(gòu)不支持JTI只能使用內(nèi)核的解析器(interpreter)來模擬運(yùn)行;

  • 內(nèi)核還可以通過一些額外的手段來加固BPF的安全性(Hardening)。主要包括:把BPF代碼映像和JIT代碼映像的page都鎖成只讀,JIT編譯時(shí)把常量致盲(constant blinding),以及對bpf()系統(tǒng)調(diào)用的權(quán)限限制;

對BPF這些安全規(guī)則的檢查主要是在BPF代碼加載時(shí),通過BPF verifier來實(shí)現(xiàn)的。大概分為兩步:

  • 第一步,通過DAG(Directed Acyclic Graph 有向無環(huán)圖)的DFS(Depth-first Search)深度優(yōu)先算法來遍歷BPF程序的代碼路徑,確保沒有環(huán)路發(fā)生;

  • 第二步,逐條分析BPF每條指令的運(yùn)行,對register和對stack的影響,最壞情況下是否有越界行為(對變量的訪問是否越界,運(yùn)行的指令數(shù)是否越界)。這里也有一個(gè)快速分析的優(yōu)化方法:修剪(Pruning)。如果當(dāng)前指令的當(dāng)前分支的狀態(tài),和當(dāng)前指令另一個(gè)已分析分支的狀態(tài)相等或者是它的一個(gè)子集,那么當(dāng)前指令的當(dāng)前分支就不需要分析了,因?yàn)樗隙ㄊ欠弦?guī)則的。

整個(gè)BPF的開發(fā)過程大概如下圖所示:

185efe8a-93ab-11ec-952b-dac502259ad0.png

1.bpf()系統(tǒng)調(diào)用


核心代碼在bpf()系統(tǒng)調(diào)用中,我們從入口開始分析。

SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size){  union bpf_attr attr = {};  int err;
  if (!capable(CAP_SYS_ADMIN) && sysctl_unprivileged_bpf_disabled)    return -EPERM;
  if (!access_ok(VERIFY_READ, uattr, 1))    return -EFAULT;
  if (size > PAGE_SIZE)  /* silly large */    return -E2BIG;
  /* If we're handed a bigger struct than we know of,   * ensure all the unknown bits are 0 - i.e. new   * user-space does not rely on any kernel feature   * extensions we dont know about yet.   */  if (size > sizeof(attr)) {    unsigned char __user *addr;    unsigned char __user *end;    unsigned char val;
    addr = (void __user *)uattr + sizeof(attr);    end  = (void __user *)uattr + size;
    for (; addr < end; addr++) {      err = get_user(val, addr);      if (err)        return err;      if (val)        return -E2BIG;    }    size = sizeof(attr);  }
  /* copy attributes from user space, may be less than sizeof(bpf_attr) */  if (copy_from_user(&attr, uattr, size) != 0)    return -EFAULT;
  switch (cmd) {  case BPF_MAP_CREATE:    err = map_create(&attr);    break;  case BPF_MAP_LOOKUP_ELEM:    err = map_lookup_elem(&attr);    break;  case BPF_MAP_UPDATE_ELEM:    err = map_update_elem(&attr);    break;  case BPF_MAP_DELETE_ELEM:    err = map_delete_elem(&attr);    break;  case BPF_MAP_GET_NEXT_KEY:    err = map_get_next_key(&attr);    break;  case BPF_PROG_LOAD:    err = bpf_prog_load(&attr);    break;  case BPF_OBJ_PIN:    err = bpf_obj_pin(&attr);    break;  case BPF_OBJ_GET:    err = bpf_obj_get(&attr);    break;
#ifdef CONFIG_CGROUP_BPF  case BPF_PROG_ATTACH:    err = bpf_prog_attach(&attr);    break;  case BPF_PROG_DETACH:    err = bpf_prog_detach(&attr);    break;#endif
  default:    err = -EINVAL;    break;  }
  return err;}

1.1、bpf加載

BPF_PROG_LOAD命令負(fù)責(zé)加載一段BPF程序到內(nèi)核當(dāng)中:

  • 拷貝程序到內(nèi)核;

  • 校驗(yàn)它的安全性;

  • 如果可能對它進(jìn)行JIT編譯;

  • 然后分配一個(gè)文件句柄fd給它。

完成這一切后,后續(xù)再把這段BPF程序掛載到需要運(yùn)行的鉤子上面。

1.1.1、bpf內(nèi)存空間分配

static int bpf_prog_load(union bpf_attr *attr){  enum bpf_prog_type type = attr->prog_type;  struct bpf_prog *prog;  int err;  char license[128];  bool is_gpl;
  if (CHECK_ATTR(BPF_PROG_LOAD))    return -EINVAL;
  /* copy eBPF program license from user space */  /* (1.1) 根據(jù)attr->license地址,從用戶空間拷貝license字符串到內(nèi)核 */  if (strncpy_from_user(license, u64_to_ptr(attr->license),            sizeof(license) - 1) < 0)    return -EFAULT;  license[sizeof(license) - 1] = 0;
  /* eBPF programs must be GPL compatible to use GPL-ed functions */  /* (1.2) 判斷l(xiāng)icense是否符合GPL協(xié)議 */  is_gpl = license_is_gpl_compatible(license);
    /* (1.3) 判斷BPF的總指令數(shù)是否超過BPF_MAXINSNS(4k) */  if (attr->insn_cnt >= BPF_MAXINSNS)    return -EINVAL;
    /* (1.4) 如果加載BPF_PROG_TYPE_KPROBE類型的BPF程序,指定的內(nèi)核版本需要和當(dāng)前內(nèi)核版本匹配。        不然由于內(nèi)核的改動(dòng),可能會附加到錯(cuò)誤的地址上。     */  if (type == BPF_PROG_TYPE_KPROBE &&      attr->kern_version != LINUX_VERSION_CODE)    return -EINVAL;
    /* (1.5) 對BPF_PROG_TYPE_SOCKET_FILTER和BPF_PROG_TYPE_CGROUP_SKB以外的BPF程序加載,需要管理員權(quán)限 */  if (type != BPF_PROG_TYPE_SOCKET_FILTER &&      type != BPF_PROG_TYPE_CGROUP_SKB &&      !capable(CAP_SYS_ADMIN))    return -EPERM;
  /* plain bpf_prog allocation */  /* (2.1) 根據(jù)BPF指令數(shù)分配bpf_prog空間,和bpf_prog->aux空間 */  prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER);  if (!prog)    return -ENOMEM;
    /* (2.2) 把整個(gè)bpf_prog空間在當(dāng)前進(jìn)程的memlock_limit中鎖定 */  err = bpf_prog_charge_memlock(prog);  if (err)    goto free_prog_nouncharge;
  prog->len = attr->insn_cnt;
  err = -EFAULT;  /* (2.3) 把BPF代碼從用戶空間地址attr->insns,拷貝到內(nèi)核空間地址prog->insns */  if (copy_from_user(prog->insns, u64_to_ptr(attr->insns),         prog->len * sizeof(struct bpf_insn)) != 0)    goto free_prog;
  prog->orig_prog = NULL;  prog->jited = 0;
  atomic_set(&prog->aux->refcnt, 1);  prog->gpl_compatible = is_gpl ? 1 : 0;
  /* find program type: socket_filter vs tracing_filter */  /* (2.4) 根據(jù)attr->prog_type指定的type值,找到對應(yīng)的bpf_prog_types,      給bpf_prog->aux->ops賦值,這個(gè)ops是一個(gè)函數(shù)操作集   */  err = find_prog_type(type, prog);  if (err < 0)    goto free_prog;
  /* run eBPF verifier */  /* (3) 使用verifer對BPF程序進(jìn)行合法性掃描 */  err = bpf_check(&prog, attr);  if (err < 0)    goto free_used_maps;
  /* eBPF program is ready to be JITed */  /* (4) 嘗試對BPF程序進(jìn)行JIT轉(zhuǎn)換 */  prog = bpf_prog_select_runtime(prog, &err);  if (err < 0)    goto free_used_maps;
    /* (5) 給BPF程序分配一個(gè)文件句柄fd */  err = bpf_prog_new_fd(prog);  if (err < 0)    /* failed to allocate fd */    goto free_used_maps;
  return err;
free_used_maps:  free_used_maps(prog->aux);free_prog:  bpf_prog_uncharge_memlock(prog);free_prog_nouncharge:  bpf_prog_free(prog);  return err;}

這其中對BPF來說有個(gè)重要的數(shù)據(jù)結(jié)構(gòu)就是struct bpf_prog:

struct bpf_prog {  u16      pages;    /* Number of allocated pages */  kmemcheck_bitfield_begin(meta);  u16      jited:1,  /* Is our filter JIT'ed? */        gpl_compatible:1, /* Is filter GPL compatible? */        cb_access:1,  /* Is control block accessed? */        dst_needed:1;  /* Do we need dst entry? */  kmemcheck_bitfield_end(meta);  u32      len;    /* Number of filter blocks */  enum bpf_prog_type  type;    /* Type of BPF program */  struct bpf_prog_aux  *aux;    /* Auxiliary fields */  struct sock_fprog_kern  *orig_prog;  /* Original BPF program */  unsigned int    (*bpf_func)(const struct sk_buff *skb,              const struct bpf_insn *filter);  /* Instructions for interpreter */  union {    struct sock_filter  insns[0];    struct bpf_insn    insnsi[0];  };};

其中重要的成員如下:

  • len:程序包含bpf指令的數(shù)量;

  • type:當(dāng)前bpf程序的類型(kprobe/tracepoint/perf_event/sk_filter/sched_cls/sched_act/xdp/cg_skb);

  • aux:主要用來輔助verifier校驗(yàn)和轉(zhuǎn)換的數(shù)據(jù);

  • orig_prog

  • bpf_func:運(yùn)行時(shí)BPF程序的入口。如果JIT轉(zhuǎn)換成功,這里指向的就是BPF程序JIT轉(zhuǎn)換后的映像;否則這里指向內(nèi)核解析器(interpreter)的通用入口__bpf_prog_run();

  • insnsi[]:從用戶態(tài)拷貝過來的,BPF程序原始指令的存放空間;

1.1.2、bpf verifier

關(guān)于verifier的步驟和規(guī)則,在“3.1、Berkeley Packet Filter (BPF) (Kernel Document)”一文的“eBPF verifier”一節(jié)有詳細(xì)描述。

另外,在kernel/bpf/verifier.c文件的開頭對eBPF verifier也有一段詳細(xì)的注釋:

bpf_check()是一個(gè)靜態(tài)代碼分析器,它按指令遍歷eBPF程序指令并更新寄存器/堆棧狀態(tài)。分析條件分支的所有路徑,直到'bpf_exit'指令。
1、第一步是深度優(yōu)先搜索,檢查程序是否為DAG(Directed Acyclic Graph 有向無環(huán)圖)。它將會拒絕以下程序:
 - 大于BPF_MAXINSNS條指令(BPF_MAXINSNS=4096) - 如果出現(xiàn)循環(huán)(通過back-edge檢測) - 不可達(dá)的指令存在(不應(yīng)該是森林,程序等于一個(gè)函數(shù)) - 越界或畸形的跳躍
2、第二步是從第一步所有可能路徑的展開。
- 因?yàn)樗治隽顺绦蛩械穆窂?,這個(gè)分析的最大長度限制為32k個(gè)指令,即使指令總數(shù)小于4k也會受到影響,因?yàn)橛刑嗟姆种Ц淖兞硕褩?寄存器。- 分支的分析數(shù)量被限制為1k。
在進(jìn)入每條指令時(shí),每個(gè)寄存器都有一個(gè)類型,該指令根據(jù)指令語義改變寄存器的類型:
- rule 1、如果指令是BPF_MOV64_REG(BPF_REG_1, BPF_REG_5),則將R5的類型復(fù)制到R1。
所有寄存器都是64位的。* R0 -返回寄存器  * R1-R5參數(shù)傳遞寄存器  * R6-R9被調(diào)用方保存寄存器  * R10 -幀指針只讀  
- rule 2、在BPF程序開始時(shí),寄存器R1包含一個(gè)指向bpf_context的指針,類型為PTR_TO_CTX。
- rule 3、verifier跟蹤指針上的算術(shù)運(yùn)算:
`    BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),    BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -20),`
第一條指令將R10(它具有FRAME_PTR)類型復(fù)制到R1中,第二條算術(shù)指令是匹配的模式,用于識別它想要構(gòu)造一個(gè)指向堆棧中某個(gè)元素的指針。因此,在第二條指令之后,寄存器R1的類型為PTR_TO_STACK(-20常數(shù)需要進(jìn)一步的堆棧邊界檢查)。表示這個(gè)reg是一個(gè)指針由堆棧加上常數(shù)。
- rule 4、大多數(shù)時(shí)候寄存器都有UNKNOWN_VALUE類型,這意味著寄存器有一些值,但它不是一個(gè)有效的指針。(就像指針+指針變成了UNKNOWN_VALUE類型)
- rule 5、當(dāng)verifier看到load指令或store指令時(shí),基本寄存器的類型可以是:PTR_TO_MAP_VALUE、PTR_TO_CTX、FRAME_PTR。這是由check_mem_access()函數(shù)識別的三種指針類型。
- rule 6、PTR_TO_MAP_VALUE表示這個(gè)寄存器指向‘map元素的值’,并且可以訪問[ptr, ptr + map value_size)的范圍。
- rule 7、寄存器用于向函數(shù)調(diào)用傳遞參數(shù),將根據(jù)函數(shù)參數(shù)約束進(jìn)行檢查。
ARG_PTR_TO_MAP_KEY就是這樣的參數(shù)約束之一。這意味著傳遞給這個(gè)函數(shù)的寄存器類型必須是PTR_TO_STACK,它將作為‘map element key的指針’在函數(shù)內(nèi)部使用。
例如bpf_map_lookup_elem()的參數(shù)約束:
`   .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,   .arg1_type = ARG_CONST_MAP_PTR,   .arg2_type = ARG_PTR_TO_MAP_KEY,`
ret_type表示該函數(shù)返回“指向map element value的指針或null”。函數(shù)期望第一個(gè)參數(shù)是指向‘struct bpf_map’的const指針,第二個(gè)參數(shù)應(yīng)該是指向stack的指針,這個(gè)指針在helper函數(shù)中用作map element key的指針。
在內(nèi)核側(cè)的helper函數(shù)如下:
` u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) {    struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;    void *key = (void *) (unsigned long) r2;    void *value;
    here kernel can access 'key' and 'map' pointers safely, knowing that    [key, key + map->key_size) bytes are valid and were initialized on    the stack of eBPF program. }`
相應(yīng)的eBPF程序如下:
`    BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),  // after this insn R2 type is FRAME_PTR    BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // after this insn R2 type is PTR_TO_STACK    BPF_LD_MAP_FD(BPF_REG_1, map_fd),      // after this insn R1 type is CONST_PTR_TO_MAP    BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),`
這里verifier查看map_lookup_elem()的原型,看到:  
- .arg1_type == ARG_CONST_MAP_PTR and R1->type == CONST_PTR_TO_MAP, 這個(gè)是ok的。現(xiàn)在verifier知道m(xù)ap key的尺寸了:R1->map_ptr->key_size。
- 然后.arg2_type == ARG_PTR_TO_MAP_KEY and R2->type == PTR_TO_STACK也是ok的。現(xiàn)在verifier檢測 [R2, R2 + map's key_size]是否在堆棧限制內(nèi),并且在調(diào)用之前被初始化。
- 如果可以,那么verifier允許這個(gè)BPF_CALL指令,并查看.ret_type  RET_PTR_TO_MAP_VALUE_OR_NULL,因此它設(shè)置R0->類型= PTR_TO_MAP_VALUE_OR_NULL,這意味著bpf_map_lookup_elem()函數(shù)返回map value指針或NULL。當(dāng)類型PTR_TO_MAP_VALUE_OR_NULL通過'if (reg != 0) goto +off' 指令判斷時(shí),在真分支中持有指針的寄存器將狀態(tài)更改為PTR_TO_MAP_VALUE,在假分支中相同的寄存器將狀態(tài)更改為CONST_IMM。看check_cond_jmp_op()的實(shí)現(xiàn)。函數(shù)調(diào)用以后R0設(shè)置為返回函數(shù)類型后,將寄存器R1-R5設(shè)置為NOT_INIT,以指示它們不再可讀。

原文如下:

/* bpf_check() is a static code analyzer that walks eBPF program * instruction by instruction and updates register/stack state. * All paths of conditional branches are analyzed until 'bpf_exit' insn. * * The first pass is depth-first-search to check that the program is a DAG. * It rejects the following programs: * - larger than BPF_MAXINSNS insns * - if loop is present (detected via back-edge) * - unreachable insns exist (shouldn't be a forest. program = one function) * - out of bounds or malformed jumps * The second pass is all possible path descent from the 1st insn. * Since it's analyzing all pathes through the program, the length of the * analysis is limited to 32k insn, which may be hit even if total number of * insn is less then 4K, but there are too many branches that change stack/regs. * Number of 'branches to be analyzed' is limited to 1k * * On entry to each instruction, each register has a type, and the instruction * changes the types of the registers depending on instruction semantics. * If instruction is BPF_MOV64_REG(BPF_REG_1, BPF_REG_5), then type of R5 is * copied to R1. * * All registers are 64-bit. * R0 - return register * R1-R5 argument passing registers * R6-R9 callee saved registers * R10 - frame pointer read-only * * At the start of BPF program the register R1 contains a pointer to bpf_context * and has type PTR_TO_CTX. * * Verifier tracks arithmetic operations on pointers in case: *    BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), *    BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -20), * 1st insn copies R10 (which has FRAME_PTR) type into R1 * and 2nd arithmetic instruction is pattern matched to recognize * that it wants to construct a pointer to some element within stack. * So after 2nd insn, the register R1 has type PTR_TO_STACK * (and -20 constant is saved for further stack bounds checking). * Meaning that this reg is a pointer to stack plus known immediate constant. * * Most of the time the registers have UNKNOWN_VALUE type, which * means the register has some value, but it's not a valid pointer. * (like pointer plus pointer becomes UNKNOWN_VALUE type) * * When verifier sees load or store instructions the type of base register * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, FRAME_PTR. These are three pointer * types recognized by check_mem_access() function. * * PTR_TO_MAP_VALUE means that this register is pointing to 'map element value' * and the range of [ptr, ptr + map's value_size) is accessible. * * registers used to pass values to function calls are checked against * function argument constraints. * * ARG_PTR_TO_MAP_KEY is one of such argument constraints. * It means that the register type passed to this function must be * PTR_TO_STACK and it will be used inside the function as * 'pointer to map element key' * * For example the argument constraints for bpf_map_lookup_elem(): *   .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, *   .arg1_type = ARG_CONST_MAP_PTR, *   .arg2_type = ARG_PTR_TO_MAP_KEY, * * ret_type says that this function returns 'pointer to map elem value or null' * function expects 1st argument to be a const pointer to 'struct bpf_map' and * 2nd argument should be a pointer to stack, which will be used inside * the helper function as a pointer to map element key. * * On the kernel side the helper function looks like: * u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) * { *    struct bpf_map *map = (struct bpf_map *) (unsigned long) r1; *    void *key = (void *) (unsigned long) r2; *    void *value; * *    here kernel can access 'key' and 'map' pointers safely, knowing that *    [key, key + map->key_size) bytes are valid and were initialized on *    the stack of eBPF program. * } * * Corresponding eBPF program may look like: *    BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),  // after this insn R2 type is FRAME_PTR *    BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // after this insn R2 type is PTR_TO_STACK *    BPF_LD_MAP_FD(BPF_REG_1, map_fd),      // after this insn R1 type is CONST_PTR_TO_MAP *    BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), * here verifier looks at prototype of map_lookup_elem() and sees: * .arg1_type == ARG_CONST_MAP_PTR and R1->type == CONST_PTR_TO_MAP, which is ok, * Now verifier knows that this map has key of R1->map_ptr->key_size bytes * * Then .arg2_type == ARG_PTR_TO_MAP_KEY and R2->type == PTR_TO_STACK, ok so far, * Now verifier checks that [R2, R2 + map's key_size) are within stack limits * and were initialized prior to this call. * If it's ok, then verifier allows this BPF_CALL insn and looks at * .ret_type which is RET_PTR_TO_MAP_VALUE_OR_NULL, so it sets * R0->type = PTR_TO_MAP_VALUE_OR_NULL which means bpf_map_lookup_elem() function * returns ether pointer to map value or NULL. * * When type PTR_TO_MAP_VALUE_OR_NULL passes through 'if (reg != 0) goto +off' * insn, the register holding that pointer in the true branch changes state to * PTR_TO_MAP_VALUE and the same register changes state to CONST_IMM in the false * branch. See check_cond_jmp_op(). * * After the call R0 is set to return type of the function and registers R1-R5 * are set to NOT_INIT to indicate that they are no longer readable.*/

BPF verifier總體代碼流程如下:

int bpf_check(struct bpf_prog **prog, union bpf_attr *attr){  char __user *log_ubuf = NULL;  struct bpf_verifier_env *env;  int ret = -EINVAL;
  if ((*prog)->len <= 0 || (*prog)->len > BPF_MAXINSNS)    return -E2BIG;
  /* 'struct bpf_verifier_env' can be global, but since it's not small,   * allocate/free it every time bpf_check() is called   */  /* (3.1) 分配verifier靜態(tài)掃描需要的數(shù)據(jù)結(jié)構(gòu)  */  env = kzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL);  if (!env)    return -ENOMEM;
  env->insn_aux_data = vzalloc(sizeof(struct bpf_insn_aux_data) *             (*prog)->len);  ret = -ENOMEM;  if (!env->insn_aux_data)    goto err_free_env;  env->prog = *prog;
  /* grab the mutex to protect few globals used by verifier */  mutex_lock(&bpf_verifier_lock);
    /* (3.2) 如果用戶指定了attr->log_buf,說明用戶需要具體的代碼掃描log,這個(gè)在出錯(cuò)時(shí)非常有用         先在內(nèi)核中分配log空間,在返回時(shí)拷貝給用戶     */  if (attr->log_level || attr->log_buf || attr->log_size) {    /* user requested verbose verifier output     * and supplied buffer to store the verification trace     */    log_level = attr->log_level;    log_ubuf = (char __user *) (unsigned long) attr->log_buf;    log_size = attr->log_size;    log_len = 0;
    ret = -EINVAL;    /* log_* values have to be sane */    if (log_size < 128 || log_size > UINT_MAX >> 8 ||        log_level == 0 || log_ubuf == NULL)      goto err_unlock;
    ret = -ENOMEM;    log_buf = vmalloc(log_size);    if (!log_buf)      goto err_unlock;  } else {    log_level = 0;  }
    /* (3.3) 把BPF程序中操作map的指令,從map_fd替換成實(shí)際的map指針         由此可見用戶態(tài)的loader程序,肯定是先根據(jù)__section("maps")中定義的map調(diào)用bpf()創(chuàng)建map,再加載其他的程序section;     */  ret = replace_map_fd_with_map_ptr(env);  if (ret < 0)    goto skip_full_check;
  env->explored_states = kcalloc(env->prog->len,               sizeof(struct bpf_verifier_state_list *),               GFP_USER);  ret = -ENOMEM;  if (!env->explored_states)    goto skip_full_check;
    /* (3.4) step1、檢查有沒有環(huán)路 */  ret = check_cfg(env);  if (ret < 0)    goto skip_full_check;
  env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
    /* (3.5) step2、詳細(xì)掃描BPF代碼的運(yùn)行過程,跟蹤分析寄存器和堆棧,檢查是否有不符合規(guī)則的情況出現(xiàn) */  ret = do_check(env);
skip_full_check:  while (pop_stack(env, NULL) >= 0);  free_states(env);
    /* (3.6) 把掃描分析出來的dead代碼(就是不會運(yùn)行的代碼)轉(zhuǎn)成nop指令 */  if (ret == 0)    sanitize_dead_code(env);
    /* (3.7) 根據(jù)程序的type,轉(zhuǎn)換對ctx指針成員的訪問 */  if (ret == 0)    /* program is valid, convert *(u32*)(ctx + off) accesses */    ret = convert_ctx_accesses(env);
    /* (3.8) 修復(fù)BPF指令中對內(nèi)核helper function函數(shù)的調(diào)用,把函數(shù)編號替換成實(shí)際的函數(shù)指針 */  if (ret == 0)    ret = fixup_bpf_calls(env);
  if (log_level && log_len >= log_size - 1) {    BUG_ON(log_len >= log_size);    /* verifier log exceeded user supplied buffer */    ret = -ENOSPC;    /* fall through to return what was recorded */  }
    /* (3.9) 拷貝verifier log到用戶空間 */  /* copy verifier log back to user space including trailing zero */  if (log_level && copy_to_user(log_ubuf, log_buf, log_len + 1) != 0) {    ret = -EFAULT;    goto free_log_buf;  }
    /* (3.10) 備份BPF程序?qū)ap的引用信息,到prog->aux->used_maps中 */  if (ret == 0 && env->used_map_cnt) {    /* if program passed verifier, update used_maps in bpf_prog_info */    env->prog->aux->used_maps = kmalloc_array(env->used_map_cnt,                sizeof(env->used_maps[0]),                GFP_KERNEL);
    if (!env->prog->aux->used_maps) {      ret = -ENOMEM;      goto free_log_buf;    }
    memcpy(env->prog->aux->used_maps, env->used_maps,           sizeof(env->used_maps[0]) * env->used_map_cnt);    env->prog->aux->used_map_cnt = env->used_map_cnt;
    /* program is valid. Convert pseudo bpf_ld_imm64 into generic     * bpf_ld_imm64 instructions     */    convert_pseudo_ld_imm64(env);  }
free_log_buf:  if (log_level)    vfree(log_buf);  if (!env->prog->aux->used_maps)    /* if we didn't copy map pointers into bpf_prog_info, release     * them now. Otherwise free_bpf_prog_info() will release them.     */    release_maps(env);  *prog = env->prog;err_unlock:  mutex_unlock(&bpf_verifier_lock);  vfree(env->insn_aux_data);err_free_env:  kfree(env);  return ret;}

  • 1、把BPF程序中操作map的指令,從map_fd替換成實(shí)際的map指針。

由此可見用戶態(tài)的loader程序,肯定是先根據(jù)__section(“maps”)中定義的map調(diào)用bpf()創(chuàng)建map,再加載其他的程序section。

符合條件:(insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) && (insn[0]->src_reg == BPF_PSEUDO_MAP_FD) 的指令為map指針加載指針。

把原始的立即數(shù)作為fd找到對應(yīng)的map指針。

把64bit的map指針拆分成兩個(gè)32bit的立即數(shù),存儲到insn[0].imm、insn[1].imm中。

static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env){  struct bpf_insn *insn = env->prog->insnsi;  int insn_cnt = env->prog->len;  int i, j, err;
    /* (3.3.1) 遍歷所有BPF指令 */  for (i = 0; i < insn_cnt; i++, insn++) {    if (BPF_CLASS(insn->code) == BPF_LDX &&        (BPF_MODE(insn->code) != BPF_MEM || insn->imm != 0)) {      verbose("BPF_LDX uses reserved fields
");      return -EINVAL;    }
    if (BPF_CLASS(insn->code) == BPF_STX &&        ((BPF_MODE(insn->code) != BPF_MEM &&          BPF_MODE(insn->code) != BPF_XADD) || insn->imm != 0)) {      verbose("BPF_STX uses reserved fields
");      return -EINVAL;    }
        /* (3.3.2) 符合條件:(insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) && (insn[0]->src_reg == BPF_PSEUDO_MAP_FD)              的指令為map指針加載指針         */    if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) {      struct bpf_map *map;      struct fd f;
      if (i == insn_cnt - 1 || insn[1].code != 0 ||          insn[1].dst_reg != 0 || insn[1].src_reg != 0 ||          insn[1].off != 0) {        verbose("invalid bpf_ld_imm64 insn
");        return -EINVAL;      }
      if (insn->src_reg == 0)        /* valid generic load 64-bit imm */        goto next_insn;
      if (insn->src_reg != BPF_PSEUDO_MAP_FD) {        verbose("unrecognized bpf_ld_imm64 insn
");        return -EINVAL;      }
            /* (3.3.3) 根據(jù)指令中的立即數(shù)insn[0]->imm指定的fd,得到實(shí)際的map指針 */      f = fdget(insn->imm);      map = __bpf_map_get(f);      if (IS_ERR(map)) {        verbose("fd %d is not pointing to valid bpf_map
",          insn->imm);        return PTR_ERR(map);      }的·            /* (3.3.4) 檢查map和當(dāng)前類型BPF程序的兼容性 */      err = check_map_prog_compatibility(map, env->prog);      if (err) {        fdput(f);        return err;      }
            /* (3.3.5) 把64bit的map指針拆分成兩個(gè)32bit的立即數(shù),存儲到insn[0].imm、insn[1].imm中 */      /* store map pointer inside BPF_LD_IMM64 instruction */      insn[0].imm = (u32) (unsigned long) map;      insn[1].imm = ((u64) (unsigned long) map) >> 32;
      /* check whether we recorded this map already */      for (j = 0; j < env->used_map_cnt; j++)        if (env->used_maps[j] == map) {          fdput(f);          goto next_insn;        }
            /* (3.3.6) 一個(gè)prog最多引用64個(gè)map */      if (env->used_map_cnt >= MAX_USED_MAPS) {        fdput(f);        return -E2BIG;      }
      /* hold the map. If the program is rejected by verifier,       * the map will be released by release_maps() or it       * will be used by the valid program until it's unloaded       * and all maps are released in free_bpf_prog_info()       */      map = bpf_map_inc(map, false);      if (IS_ERR(map)) {        fdput(f);        return PTR_ERR(map);      }      /* (3.3.7) 記錄prog對map的引用 */      env->used_maps[env->used_map_cnt++] = map;
      fdput(f);next_insn:      insn++;      i++;    }  }
  /* now all pseudo BPF_LD_IMM64 instructions load valid   * 'struct bpf_map *' into a register instead of user map_fd.   * These pointers will be used later by verifier to validate map access.   */  return 0;}

  • 2、Step 1、通過DAG(Directed Acyclic Graph 有向無環(huán)圖)的

DFS(Depth-first Search)深度優(yōu)先算法來遍歷BPF程序的代碼路徑,確保沒有環(huán)路發(fā)生;

DAG的DFS算法可以參考“Graph”一文。其中最重要的概念如下圖:

18792e40-93ab-11ec-952b-dac502259ad0.png

一個(gè)圖形"Graph"經(jīng)過DAG的DFS算法遍歷后,對每一個(gè)根節(jié)點(diǎn)都會形成一顆樹“DFS Tree”,多個(gè)根節(jié)點(diǎn)得到的多棵樹形成一個(gè)森林"DFS Forest"。根據(jù)搜索的結(jié)構(gòu)整個(gè)“Graph”的邊“Edge”可以分成四類

  • Tree Edges:在DFS樹上的邊;

  • Back Edges:從子節(jié)點(diǎn)連向祖先節(jié)點(diǎn)的邊(形成環(huán));

  • Forward Edges:直接連向?qū)O節(jié)點(diǎn)的邊(跨子節(jié)點(diǎn)的連接);

  • Cross Edges:葉子之間的連接,或者樹之間的連接;

BPF verifier來說,檢查BPF程序的運(yùn)行路徑圖中是否有“Back Edges”的存在,確保程序中沒有環(huán)路。

具體的代碼如下:

static int check_cfg(struct bpf_verifier_env *env){  struct bpf_insn *insns = env->prog->insnsi;  int insn_cnt = env->prog->len;  int ret = 0;  int i, t;
  insn_state = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL);  if (!insn_state)    return -ENOMEM;
  insn_stack = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL);  if (!insn_stack) {    kfree(insn_state);    return -ENOMEM;  }
  insn_state[0] = DISCOVERED; /* mark 1st insn as discovered */  insn_stack[0] = 0; /* 0 is the first instruction */  cur_stack = 1;
    /* (3.4.1) DFS深度優(yōu)先算法的循環(huán) */peek_stack:  if (cur_stack == 0)    goto check_state;  t = insn_stack[cur_stack - 1];
    /* (3.4.2) 分支指令 */  if (BPF_CLASS(insns[t].code) == BPF_JMP) {    u8 opcode = BPF_OP(insns[t].code);
        /* (3.4.2.1) 碰到BPF_EXIT指令,路徑終結(jié),開始回溯確認(rèn) */    if (opcode == BPF_EXIT) {      goto mark_explored;
    /* (3.4.2.2) 碰到BPF_CALL指令,繼續(xù)探索         并且把env->explored_states[]設(shè)置成STATE_LIST_MARK,標(biāo)識call函數(shù)調(diào)用后需要重新跟蹤計(jì)算寄存器和堆棧     */    } else if (opcode == BPF_CALL) {      ret = push_insn(t, t + 1, FALLTHROUGH, env);      if (ret == 1)        goto peek_stack;      else if (ret < 0)        goto err_free;      if (t + 1 < insn_cnt)        env->explored_states[t + 1] = STATE_LIST_MARK;
    /* (3.4.2.3) 碰到BPF_JA指令,繼續(xù)探索         并且把env->explored_states[]設(shè)置成STATE_LIST_MARK,標(biāo)識call函數(shù)調(diào)用后需要重新跟蹤計(jì)算寄存器和堆棧     */    } else if (opcode == BPF_JA) {      if (BPF_SRC(insns[t].code) != BPF_K) {        ret = -EINVAL;        goto err_free;      }      /* unconditional jump with single edge */      ret = push_insn(t, t + insns[t].off + 1,          FALLTHROUGH, env);      if (ret == 1)        goto peek_stack;      else if (ret < 0)        goto err_free;      /* tell verifier to check for equivalent states       * after every call and jump       */      if (t + 1 < insn_cnt)        env->explored_states[t + 1] = STATE_LIST_MARK;
    /* (3.4.2.4) 剩下的是有條件跳轉(zhuǎn)指令,首先探測條件失敗路徑,再探測條件成功路徑         并且把env->explored_states[]設(shè)置成STATE_LIST_MARK,標(biāo)識call函數(shù)調(diào)用后需要重新跟蹤計(jì)算寄存器和堆棧     */    } else {      /* conditional jump with two edges */      env->explored_states[t] = STATE_LIST_MARK;
      /* 條件失敗路徑 */      ret = push_insn(t, t + 1, FALLTHROUGH, env);      if (ret == 1)        goto peek_stack;      else if (ret < 0)        goto err_free;
            /* 條件成功路徑 */      ret = push_insn(t, t + insns[t].off + 1, BRANCH, env);      if (ret == 1)        goto peek_stack;      else if (ret < 0)        goto err_free;    }
  /* (3.4.3) 非分支指令 */  } else {    /* all other non-branch instructions with single     * fall-through edge     */    ret = push_insn(t, t + 1, FALLTHROUGH, env);    /* (3.4.3.1) ret的含義如下        ret == 1:繼續(xù)探索路徑        ret == 0:已經(jīng)是葉子節(jié)點(diǎn)了,跳轉(zhuǎn)到mark_explored確認(rèn)并回溯        ret < 0:探測到"back-edge"環(huán)路,或者其他錯(cuò)誤     */    if (ret == 1)      goto peek_stack;    else if (ret < 0)      goto err_free;  }
    /* (3.4.4) 確認(rèn)并回溯,狀態(tài)標(biāo)記為EXPLORED      */mark_explored:  insn_state[t] = EXPLORED;  if (cur_stack-- <= 0) {    verbose("pop stack internal bug
");    ret = -EFAULT;    goto err_free;  }  goto peek_stack;
    /* (3.4.5) 確認(rèn)沒有unreachable的指令,就是路徑?jīng)]法抵達(dá) */check_state:  for (i = 0; i < insn_cnt; i++) {    if (insn_state[i] != EXPLORED) {      verbose("unreachable insn %d
", i);      ret = -EINVAL;      goto err_free;    }  }  ret = 0; /* cfg looks good */
err_free:  kfree(insn_state);  kfree(insn_stack);  return ret;}

  • 3、step2、詳細(xì)掃描BPF代碼的運(yùn)行過程,跟蹤分析寄存器和堆棧,檢查是否有不符合規(guī)則的情況出現(xiàn)。

這段代碼的具體算法就是把step1的路徑重新走一遍,并且跟蹤寄存器和堆棧的變化,判斷最壞情況下是否有違反規(guī)則的情況出現(xiàn)。

在碰到指令對應(yīng)explored_states[]被設(shè)置成STATE_LIST_MARK,需要給當(dāng)前指令獨(dú)立分配一個(gè)bpf_verifier_state_list鏈表,來存儲這個(gè)指令在多個(gè)分支上的不同狀況。

這里也有一個(gè)快速分析的優(yōu)化方法:修剪(Pruning)。如果當(dāng)前指令的當(dāng)前分支的狀態(tài)cur_state,和當(dāng)前指令另一個(gè)已分析分支的狀態(tài)(當(dāng)前指令explored_states[]鏈表中的一個(gè)bpf_verifier_state_list成員)相等或者是它的一個(gè)子集,那么當(dāng)前指令的當(dāng)前分支就不需要分析了,因?yàn)樗隙ㄊ欠弦?guī)則的。

static int do_check(struct bpf_verifier_env *env){  struct bpf_verifier_state *state = &env->cur_state;  struct bpf_insn *insns = env->prog->insnsi;  struct bpf_reg_state *regs = state->regs;  int insn_cnt = env->prog->len;  int insn_idx, prev_insn_idx = 0;  int insn_processed = 0;  bool do_print_state = false;
  init_reg_state(regs);  insn_idx = 0;  env->varlen_map_value_access = false;  for (;;) {    struct bpf_insn *insn;    u8 class;    int err;
    if (insn_idx >= insn_cnt) {      verbose("invalid insn idx %d insn_cnt %d
",        insn_idx, insn_cnt);      return -EFAULT;    }
    insn = &insns[insn_idx];    class = BPF_CLASS(insn->code);
    if (++insn_processed > BPF_COMPLEXITY_LIMIT_INSNS) {      verbose("BPF program is too large. Proccessed %d insn
",        insn_processed);      return -E2BIG;    }
    err = is_state_visited(env, insn_idx);    if (err < 0)      return err;    if (err == 1) {      /* found equivalent state, can prune the search */      if (log_level) {        if (do_print_state)          verbose("
from %d to %d: safe
",            prev_insn_idx, insn_idx);        else          verbose("%d: safe
", insn_idx);      }      goto process_bpf_exit;    }
    if (need_resched())      cond_resched();
    if (log_level && do_print_state) {      verbose("
from %d to %d:", prev_insn_idx, insn_idx);      print_verifier_state(&env->cur_state);      do_print_state = false;    }
    if (log_level) {      verbose("%d: ", insn_idx);      print_bpf_insn(env, insn);    }
    err = ext_analyzer_insn_hook(env, insn_idx, prev_insn_idx);    if (err)      return err;
    env->insn_aux_data[insn_idx].seen = true;    if (class == BPF_ALU || class == BPF_ALU64) {      err = check_alu_op(env, insn);      if (err)        return err;
    } else if (class == BPF_LDX) {      enum bpf_reg_type *prev_src_type, src_reg_type;
      /* check for reserved fields is already done */
      /* check src operand */      err = check_reg_arg(regs, insn->src_reg, SRC_OP);      if (err)        return err;
      err = check_reg_arg(regs, insn->dst_reg, DST_OP_NO_MARK);      if (err)        return err;
      src_reg_type = regs[insn->src_reg].type;
      /* check that memory (src_reg + off) is readable,       * the state of dst_reg will be updated by this func       */      err = check_mem_access(env, insn->src_reg, insn->off,                 BPF_SIZE(insn->code), BPF_READ,                 insn->dst_reg);      if (err)        return err;
      reset_reg_range_values(regs, insn->dst_reg);      if (BPF_SIZE(insn->code) != BPF_W &&          BPF_SIZE(insn->code) != BPF_DW) {        insn_idx++;        continue;      }
      prev_src_type = &env->insn_aux_data[insn_idx].ptr_type;
      if (*prev_src_type == NOT_INIT) {        /* saw a valid insn         * dst_reg = *(u32 *)(src_reg + off)         * save type to validate intersecting paths         */        *prev_src_type = src_reg_type;
      } else if (src_reg_type != *prev_src_type &&           (src_reg_type == PTR_TO_CTX ||            *prev_src_type == PTR_TO_CTX)) {        /* ABuser program is trying to use the same insn         * dst_reg = *(u32*) (src_reg + off)         * with different pointer types:         * src_reg == ctx in one branch and         * src_reg == stack|map in some other branch.         * Reject it.         */        verbose("same insn cannot be used with different pointers
");        return -EINVAL;      }
    } else if (class == BPF_STX) {      enum bpf_reg_type *prev_dst_type, dst_reg_type;
      if (BPF_MODE(insn->code) == BPF_XADD) {        err = check_xadd(env, insn);        if (err)          return err;        insn_idx++;        continue;      }
      /* check src1 operand */      err = check_reg_arg(regs, insn->src_reg, SRC_OP);      if (err)        return err;      /* check src2 operand */      err = check_reg_arg(regs, insn->dst_reg, SRC_OP);      if (err)        return err;
      dst_reg_type = regs[insn->dst_reg].type;
      /* check that memory (dst_reg + off) is writeable */      err = check_mem_access(env, insn->dst_reg, insn->off,                 BPF_SIZE(insn->code), BPF_WRITE,                 insn->src_reg);      if (err)        return err;
      prev_dst_type = &env->insn_aux_data[insn_idx].ptr_type;
      if (*prev_dst_type == NOT_INIT) {        *prev_dst_type = dst_reg_type;      } else if (dst_reg_type != *prev_dst_type &&           (dst_reg_type == PTR_TO_CTX ||            *prev_dst_type == PTR_TO_CTX)) {        verbose("same insn cannot be used with different pointers
");        return -EINVAL;      }
    } else if (class == BPF_ST) {      if (BPF_MODE(insn->code) != BPF_MEM ||          insn->src_reg != BPF_REG_0) {        verbose("BPF_ST uses reserved fields
");        return -EINVAL;      }      /* check src operand */      err = check_reg_arg(regs, insn->dst_reg, SRC_OP);      if (err)        return err;
      if (is_ctx_reg(env, insn->dst_reg)) {        verbose("BPF_ST stores into R%d context is not allowed
",          insn->dst_reg);        return -EACCES;      }
      /* check that memory (dst_reg + off) is writeable */      err = check_mem_access(env, insn->dst_reg, insn->off,                 BPF_SIZE(insn->code), BPF_WRITE,                 -1);      if (err)        return err;
    } else if (class == BPF_JMP) {      u8 opcode = BPF_OP(insn->code);
      if (opcode == BPF_CALL) {        if (BPF_SRC(insn->code) != BPF_K ||            insn->off != 0 ||            insn->src_reg != BPF_REG_0 ||            insn->dst_reg != BPF_REG_0) {          verbose("BPF_CALL uses reserved fields
");          return -EINVAL;        }
        err = check_call(env, insn->imm, insn_idx);        if (err)          return err;
      } else if (opcode == BPF_JA) {        if (BPF_SRC(insn->code) != BPF_K ||            insn->imm != 0 ||            insn->src_reg != BPF_REG_0 ||            insn->dst_reg != BPF_REG_0) {          verbose("BPF_JA uses reserved fields
");          return -EINVAL;        }
        insn_idx += insn->off + 1;        continue;
      } else if (opcode == BPF_EXIT) {        if (BPF_SRC(insn->code) != BPF_K ||            insn->imm != 0 ||            insn->src_reg != BPF_REG_0 ||            insn->dst_reg != BPF_REG_0) {          verbose("BPF_EXIT uses reserved fields
");          return -EINVAL;        }
        /* eBPF calling convetion is such that R0 is used         * to return the value from eBPF program.         * Make sure that it's readable at this time         * of bpf_exit, which means that program wrote         * something into it earlier         */        err = check_reg_arg(regs, BPF_REG_0, SRC_OP);        if (err)          return err;
        if (is_pointer_value(env, BPF_REG_0)) {          verbose("R0 leaks addr as return value
");          return -EACCES;        }
process_bpf_exit:        insn_idx = pop_stack(env, &prev_insn_idx);        if (insn_idx < 0) {          break;        } else {          do_print_state = true;          continue;        }      } else {        err = check_cond_jmp_op(env, insn, &insn_idx);        if (err)          return err;      }    } else if (class == BPF_LD) {      u8 mode = BPF_MODE(insn->code);
      if (mode == BPF_ABS || mode == BPF_IND) {        err = check_ld_abs(env, insn);        if (err)          return err;
      } else if (mode == BPF_IMM) {        err = check_ld_imm(env, insn);        if (err)          return err;
        insn_idx++;        env->insn_aux_data[insn_idx].seen = true;      } else {        verbose("invalid BPF_LD mode
");        return -EINVAL;      }      reset_reg_range_values(regs, insn->dst_reg);    } else {      verbose("unknown insn class %d
", class);      return -EINVAL;    }
    insn_idx++;  }
  verbose("processed %d insns
", insn_processed);  return 0;}

  • 4、修復(fù)BPF指令中對內(nèi)核helper function函數(shù)的調(diào)用,把函數(shù)編號替換成實(shí)際的函數(shù)指針。

符合條件:(insn->code == (BPF_JMP | BPF_CALL)) 的指令,即是調(diào)用helper function的指令。

通用helper function的處理:根據(jù)insn->imm指定的編號找打?qū)?yīng)的函數(shù)指針,然后再把函數(shù)指針和__bpf_call_base之間的offset,賦值到insn->imm中。

static int fixup_bpf_calls(struct bpf_verifier_env *env){  struct bpf_prog *prog = env->prog;  struct bpf_insn *insn = prog->insnsi;  const struct bpf_func_proto *fn;  const int insn_cnt = prog->len;  struct bpf_insn insn_buf[16];  struct bpf_prog *new_prog;  struct bpf_map *map_ptr;  int i, cnt, delta = 0;
    /* (3.8.1) 遍歷指令 */  for (i = 0; i < insn_cnt; i++, insn++) {
      /* (3.8.2) 修復(fù)ALU指令的一個(gè)bug */    if (insn->code == (BPF_ALU | BPF_MOD | BPF_X) ||        insn->code == (BPF_ALU | BPF_DIV | BPF_X)) {      /* due to JIT bugs clear upper 32-bits of src register       * before div/mod operation       */      insn_buf[0] = BPF_MOV32_REG(insn->src_reg, insn->src_reg);      insn_buf[1] = *insn;      cnt = 2;      new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);      if (!new_prog)        return -ENOMEM;
      delta    += cnt - 1;      env->prog = prog = new_prog;      insn      = new_prog->insnsi + i + delta;      continue;    }
        /* (3.8.3) 符合條件:(insn->code == (BPF_JMP | BPF_CALL))             的指令,即是調(diào)用helper function的指令         */    if (insn->code != (BPF_JMP | BPF_CALL))      continue;
        /* (3.8.3.1) 幾種特殊helper function的處理 */    if (insn->imm == BPF_FUNC_get_route_realm)      prog->dst_needed = 1;    if (insn->imm == BPF_FUNC_get_prandom_u32)      bpf_user_rnd_init_once();    if (insn->imm == BPF_FUNC_tail_call) {      /* mark bpf_tail_call as different opcode to avoid       * conditional branch in the interpeter for every normal       * call and to prevent accidental JITing by JIT compiler       * that doesn't support bpf_tail_call yet        */      insn->imm = 0;      insn->code |= BPF_X;
      /* instead of changing every JIT dealing with tail_call       * emit two extra insns:       * if (index >= max_entries) goto out;       * index &= array->index_mask;       * to avoid out-of-bounds cpu speculation       */      map_ptr = env->insn_aux_data[i + delta].map_ptr;      if (!map_ptr->unpriv_array)        continue;      insn_buf[0] = BPF_JMP_IMM(BPF_JGE, BPF_REG_3,              map_ptr->max_entries, 2);      insn_buf[1] = BPF_ALU32_IMM(BPF_AND, BPF_REG_3,                container_of(map_ptr,                 struct bpf_array,                 map)->index_mask);      insn_buf[2] = *insn;      cnt = 3;      new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);      if (!new_prog)        return -ENOMEM;
      delta    += cnt - 1;      env->prog = prog = new_prog;      insn      = new_prog->insnsi + i + delta;      continue;    }
        /* (3.8.3.2) 通用helper function的處理:根據(jù)insn->imm指定的編號找打?qū)?yīng)的函數(shù)指針 */    fn = prog->aux->ops->get_func_proto(insn->imm);    /* all functions that have prototype and verifier allowed     * programs to call them, must be real in-kernel functions     */    if (!fn->func) {      verbose("kernel subsystem misconfigured func %d
",        insn->imm);      return -EFAULT;    }
    /* (3.8.3.3) 然后再把函數(shù)指針和__bpf_call_base之間的offset,賦值到insn->imm中 */    insn->imm = fn->func - __bpf_call_base;  }
  return 0;}

1.1.3、bpf JIT/kernel interpreter

在verifier驗(yàn)證通過以后,內(nèi)核通過JIT(Just-In-Time)將BPF目編碼轉(zhuǎn)換成本地指令碼;如果當(dāng)前架構(gòu)不支持JIT轉(zhuǎn)換內(nèi)核則會使用一個(gè)解析器(interpreter)來模擬運(yùn)行,這種運(yùn)行效率較低;

有些架構(gòu)(64 bit x86_64, arm64, ppc64, s390x, mips64, sparc64 and 32 bit arm)已經(jīng)支持BPF的JIT,它可以高效的幾乎一比一的把BPF代碼轉(zhuǎn)換成本機(jī)代碼(因?yàn)閑BPF的指令集已經(jīng)做了優(yōu)化,非常類似最新的arm/x86架構(gòu),ABI也類似)。如果當(dāng)前架構(gòu)不支持JTI只能使用內(nèi)核的解析器(interpreter)來模擬運(yùn)行;

struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err){#ifndef CONFIG_BPF_JIT_ALWAYS_ON    /* (4.1) 在不支持JIT只能使用解析器(interpreter)時(shí),BPF程序的運(yùn)行入口 */  fp->bpf_func = (void *) __bpf_prog_run;#else  fp->bpf_func = (void *) __bpf_prog_ret0;#endif
  /* eBPF JITs can rewrite the program in case constant   * blinding is active. However, in case of error during   * blinding, bpf_int_jit_compile() must always return a   * valid program, which in this case would simply not   * be JITed, but falls back to the interpreter.   */  /* (4.2) 嘗試對BPF程序進(jìn)行JIT轉(zhuǎn)換 */  fp = bpf_int_jit_compile(fp);#ifdef CONFIG_BPF_JIT_ALWAYS_ON  if (!fp->jited) {    *err = -ENOTSUPP;    return fp;  }#endif  bpf_prog_lock_ro(fp);
  /* The tail call compatibility check can only be done at   * this late stage as we need to determine, if we deal   * with JITed or non JITed program concatenations and not   * all eBPF JITs might immediately support all features.   */  /* (4.3) 對tail call使用的BPF_MAP_TYPE_PROG_ARRAY類型的map,進(jìn)行一些檢查 */  *err = bpf_check_tail_call(fp);
  return fp;}
  • 1、JIT

以arm64的JIT轉(zhuǎn)換為例:

struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog){  struct bpf_prog *tmp, *orig_prog = prog;  struct bpf_binary_header *header;  bool tmp_blinded = false;  struct jit_ctx ctx;  int image_size;  u8 *image_ptr;
  if (!bpf_jit_enable)    return orig_prog;
    /* (4.2.1) 把常量致盲(constant blinding) */  tmp = bpf_jit_blind_constants(prog);  /* If blinding was requested and we failed during blinding,   * we must fall back to the interpreter.   */  if (IS_ERR(tmp))    return orig_prog;  if (tmp != prog) {    tmp_blinded = true;    prog = tmp;  }
  memset(&ctx, 0, sizeof(ctx));  ctx.prog = prog;
  ctx.offset = kcalloc(prog->len, sizeof(int), GFP_KERNEL);  if (ctx.offset == NULL) {    prog = orig_prog;    goto out;  }
  /* 1. Initial fake pass to compute ctx->idx. */  /* (4.2.2) JIT指令轉(zhuǎn)換,但是不儲存轉(zhuǎn)換結(jié)果,只是記錄 “prologue + body + epilogue”轉(zhuǎn)換后需要的總長度 */
  /* Fake pass to fill in ctx->offset. */  if (build_body(&ctx)) {    prog = orig_prog;    goto out_off;  }
  if (build_prologue(&ctx)) {    prog = orig_prog;    goto out_off;  }
  ctx.epilogue_offset = ctx.idx;  build_epilogue(&ctx);
  /* Now we know the actual image size. */  /* (4.2.3) 根據(jù)計(jì)算的總長度,分配JIT轉(zhuǎn)換后指令的存儲空間:ctx.image */  image_size = sizeof(u32) * ctx.idx;  header = bpf_jit_binary_alloc(image_size, &image_ptr,              sizeof(u32), jit_fill_hole);  if (header == NULL) {    prog = orig_prog;    goto out_off;  }
  /* 2. Now, the actual pass. */  /* (4.2.4) 重新做一次JIT轉(zhuǎn)換,把轉(zhuǎn)換后的指令存儲到 ctx.image */
  ctx.image = (u32 *)image_ptr;  ctx.idx = 0;
    /* (4.2.4.1) 構(gòu)造轉(zhuǎn)換后image的頭,負(fù)責(zé)構(gòu)造BPF程序運(yùn)行時(shí)的堆棧,8條指令 */  build_prologue(&ctx);
    /* (4.2.4.2) 把BPF程序進(jìn)行JIT本地化指令轉(zhuǎn)換 */  if (build_body(&ctx)) {    bpf_jit_binary_free(header);    prog = orig_prog;    goto out_off;  }
    /* (4.2.4.3) 構(gòu)造轉(zhuǎn)換后image的尾部,負(fù)載清理工作,7條指令 */  build_epilogue(&ctx);
  /* 3. Extra pass to validate JITed code. */  /* (4.2.5) 確保轉(zhuǎn)換后的指令中沒有AARCH64_BREAK_FAULT */  if (validate_code(&ctx)) {    bpf_jit_binary_free(header);    prog = orig_prog;    goto out_off;  }
  /* And we're done. */  if (bpf_jit_enable > 1)    bpf_jit_dump(prog->len, image_size, 2, ctx.image);
    /* (4.2.6) 刷新新image對應(yīng)的icache */  bpf_flush_icache(header, ctx.image + ctx.idx);
    /* (4.2.7) 把image對應(yīng)的page設(shè)置為read only */  set_memory_ro((unsigned long)header, header->pages);
  /* (4.2.8) 把轉(zhuǎn)換后的image賦值給prog->bpf_func */  prog->bpf_func = (void *)ctx.image;  prog->jited = 1;
out_off:  kfree(ctx.offset);out:  if (tmp_blinded)    bpf_jit_prog_release_other(prog, prog == orig_prog ?             tmp : orig_prog);  return prog;}

JIT的核心轉(zhuǎn)換分為3部分:prologue + body + epilogue。
prologue:新增的指令,負(fù)責(zé)BPF運(yùn)行堆棧的構(gòu)建和運(yùn)行現(xiàn)場的保護(hù);
body:BPF主體部分;
epilogue:負(fù)責(zé)BPF運(yùn)行完現(xiàn)場的恢復(fù)和清理;

  • 1.1、prologue

A64_:開頭的是本機(jī)的相關(guān)寄存器

BPF_:開頭的是BPF虛擬機(jī)的寄存器

整個(gè)過程還是比較巧妙的:

首先將A64_FP/A64_LR保存進(jìn)堆棧A64_SP,然后把當(dāng)前A64_SP保存進(jìn)A64_FP;

繼續(xù)保存callee saved registers進(jìn)堆棧A64_SP:r6, r7, r8, r9, fp, tcc,然后把當(dāng)前A64_SP保存進(jìn)BPF_FP;

把A64_SP減去STACK_SIZE,給BPF_FP留出512字節(jié)的堆棧空間;

這樣BPF程序使用的是BPF_FP開始的512字節(jié)堆??臻g,普通kernel函數(shù)使用的是A64_SP繼續(xù)向下的堆??臻g,互不干擾;

static int build_prologue(struct jit_ctx *ctx){  const u8 r6 = bpf2a64[BPF_REG_6];  const u8 r7 = bpf2a64[BPF_REG_7];  const u8 r8 = bpf2a64[BPF_REG_8];  const u8 r9 = bpf2a64[BPF_REG_9];  const u8 fp = bpf2a64[BPF_REG_FP];  const u8 tcc = bpf2a64[TCALL_CNT];  const int idx0 = ctx->idx;  int cur_offset;
  /*   * BPF prog stack layout   *   *                         high   * original A64_SP =>   0:+-----+ BPF prologue   *                        |FP/LR|   * current A64_FP =>  -16:+-----+   *                        | ... | callee saved registers   * BPF fp register => -64:+-----+ <= (BPF_FP)   *                        |     |   *                        | ... | BPF prog stack   *                        |     |   *                        +-----+ <= (BPF_FP - MAX_BPF_STACK)   *                        |RSVD | JIT scratchpad   * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)   *                        |     |   *                        | ... | Function call stack   *                        |     |   *                        +-----+   *                          low   *   */
  /* Save FP and LR registers to stay align with ARM64 AAPCS */  emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);  emit(A64_MOV(1, A64_FP, A64_SP), ctx);
  /* Save callee-saved registers */  emit(A64_PUSH(r6, r7, A64_SP), ctx);  emit(A64_PUSH(r8, r9, A64_SP), ctx);  emit(A64_PUSH(fp, tcc, A64_SP), ctx);
  /* Set up BPF prog stack base register */  emit(A64_MOV(1, fp, A64_SP), ctx);
  /* Initialize tail_call_cnt */  emit(A64_MOVZ(1, tcc, 0, 0), ctx);
  /* Set up function call stack */  emit(A64_SUB_I(1, A64_SP, A64_SP, STACK_SIZE), ctx);
  cur_offset = ctx->idx - idx0;  if (cur_offset != PROLOGUE_OFFSET) {    pr_err_once("PROLOGUE_OFFSET = %d, expected %d!
",          cur_offset, PROLOGUE_OFFSET);    return -1;  }  return 0;}

  • 1.2、body

把BPF指令翻譯成本地arm64指令:

static int build_body(struct jit_ctx *ctx){  const struct bpf_prog *prog = ctx->prog;  int i;
  for (i = 0; i < prog->len; i++) {    const struct bpf_insn *insn = &prog->insnsi[i];    int ret;
    ret = build_insn(insn, ctx);    if (ret > 0) {      i++;      if (ctx->image == NULL)        ctx->offset[i] = ctx->idx;      continue;    }    if (ctx->image == NULL)      ctx->offset[i] = ctx->idx;    if (ret)      return ret;  }
  return 0;}

/* JITs an eBPF instruction. * Returns: * 0  - successfully JITed an 8-byte eBPF instruction. * >0 - successfully JITed a 16-byte eBPF instruction. * <0 - failed to JIT. */static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx){  const u8 code = insn->code;  const u8 dst = bpf2a64[insn->dst_reg];  const u8 src = bpf2a64[insn->src_reg];  const u8 tmp = bpf2a64[TMP_REG_1];  const u8 tmp2 = bpf2a64[TMP_REG_2];  const s16 off = insn->off;  const s32 imm = insn->imm;  const int i = insn - ctx->prog->insnsi;  const bool is64 = BPF_CLASS(code) == BPF_ALU64;  u8 jmp_cond;  s32 jmp_offset;
#define check_imm(bits, imm) do {          if ((((imm) > 0) && ((imm) >> (bits))) ||          (((imm) < 0) && (~(imm) >> (bits)))) {        pr_info("[%2d] imm=%d(0x%x) out of range
",        i, imm, imm);            return -EINVAL;            }              } while (0)#define check_imm19(imm) check_imm(19, imm)#define check_imm26(imm) check_imm(26, imm)
  switch (code) {  /* dst = src */  case BPF_ALU | BPF_MOV | BPF_X:  case BPF_ALU64 | BPF_MOV | BPF_X:    emit(A64_MOV(is64, dst, src), ctx);    break;  /* dst = dst OP src */  case BPF_ALU | BPF_ADD | BPF_X:  case BPF_ALU64 | BPF_ADD | BPF_X:    emit(A64_ADD(is64, dst, dst, src), ctx);    break;  case BPF_ALU | BPF_SUB | BPF_X:  case BPF_ALU64 | BPF_SUB | BPF_X:    emit(A64_SUB(is64, dst, dst, src), ctx);    break;  case BPF_ALU | BPF_AND | BPF_X:  case BPF_ALU64 | BPF_AND | BPF_X:    emit(A64_AND(is64, dst, dst, src), ctx);    break;  case BPF_ALU | BPF_OR | BPF_X:  case BPF_ALU64 | BPF_OR | BPF_X:    emit(A64_ORR(is64, dst, dst, src), ctx);    break;  case BPF_ALU | BPF_XOR | BPF_X:  case BPF_ALU64 | BPF_XOR | BPF_X:    emit(A64_EOR(is64, dst, dst, src), ctx);    break;  case BPF_ALU | BPF_MUL | BPF_X:  case BPF_ALU64 | BPF_MUL | BPF_X:    emit(A64_MUL(is64, dst, dst, src), ctx);    break;  case BPF_ALU | BPF_DIV | BPF_X:  case BPF_ALU64 | BPF_DIV | BPF_X:  case BPF_ALU | BPF_MOD | BPF_X:  case BPF_ALU64 | BPF_MOD | BPF_X:  {    const u8 r0 = bpf2a64[BPF_REG_0];
    /* if (src == 0) return 0 */    jmp_offset = 3; /* skip ahead to else path */    check_imm19(jmp_offset);    emit(A64_CBNZ(is64, src, jmp_offset), ctx);    emit(A64_MOVZ(1, r0, 0, 0), ctx);    jmp_offset = epilogue_offset(ctx);    check_imm26(jmp_offset);    emit(A64_B(jmp_offset), ctx);    /* else */    switch (BPF_OP(code)) {    case BPF_DIV:      emit(A64_UDIV(is64, dst, dst, src), ctx);      break;    case BPF_MOD:      emit(A64_UDIV(is64, tmp, dst, src), ctx);      emit(A64_MUL(is64, tmp, tmp, src), ctx);      emit(A64_SUB(is64, dst, dst, tmp), ctx);      break;    }    break;  }  case BPF_ALU | BPF_LSH | BPF_X:  case BPF_ALU64 | BPF_LSH | BPF_X:    emit(A64_LSLV(is64, dst, dst, src), ctx);    break;  case BPF_ALU | BPF_RSH | BPF_X:  case BPF_ALU64 | BPF_RSH | BPF_X:    emit(A64_LSRV(is64, dst, dst, src), ctx);    break;  case BPF_ALU | BPF_ARSH | BPF_X:  case BPF_ALU64 | BPF_ARSH | BPF_X:    emit(A64_ASRV(is64, dst, dst, src), ctx);    break;  /* dst = -dst */  case BPF_ALU | BPF_NEG:  case BPF_ALU64 | BPF_NEG:    emit(A64_NEG(is64, dst, dst), ctx);    break;  /* dst = BSWAP##imm(dst) */  case BPF_ALU | BPF_END | BPF_FROM_LE:  case BPF_ALU | BPF_END | BPF_FROM_BE:#ifdef CONFIG_CPU_BIG_ENDIAN    if (BPF_SRC(code) == BPF_FROM_BE)      goto emit_bswap_uxt;#else /* !CONFIG_CPU_BIG_ENDIAN */    if (BPF_SRC(code) == BPF_FROM_LE)      goto emit_bswap_uxt;#endif    switch (imm) {    case 16:      emit(A64_REV16(is64, dst, dst), ctx);      /* zero-extend 16 bits into 64 bits */      emit(A64_UXTH(is64, dst, dst), ctx);      break;    case 32:      emit(A64_REV32(is64, dst, dst), ctx);      /* upper 32 bits already cleared */      break;    case 64:      emit(A64_REV64(dst, dst), ctx);      break;    }    break;emit_bswap_uxt:    switch (imm) {    case 16:      /* zero-extend 16 bits into 64 bits */      emit(A64_UXTH(is64, dst, dst), ctx);      break;    case 32:      /* zero-extend 32 bits into 64 bits */      emit(A64_UXTW(is64, dst, dst), ctx);      break;    case 64:      /* nop */      break;    }    break;  /* dst = imm */  case BPF_ALU | BPF_MOV | BPF_K:  case BPF_ALU64 | BPF_MOV | BPF_K:    emit_a64_mov_i(is64, dst, imm, ctx);    break;  /* dst = dst OP imm */  case BPF_ALU | BPF_ADD | BPF_K:  case BPF_ALU64 | BPF_ADD | BPF_K:    emit_a64_mov_i(is64, tmp, imm, ctx);    emit(A64_ADD(is64, dst, dst, tmp), ctx);    break;  case BPF_ALU | BPF_SUB | BPF_K:  case BPF_ALU64 | BPF_SUB | BPF_K:    emit_a64_mov_i(is64, tmp, imm, ctx);    emit(A64_SUB(is64, dst, dst, tmp), ctx);    break;  case BPF_ALU | BPF_AND | BPF_K:  case BPF_ALU64 | BPF_AND | BPF_K:    emit_a64_mov_i(is64, tmp, imm, ctx);    emit(A64_AND(is64, dst, dst, tmp), ctx);    break;  case BPF_ALU | BPF_OR | BPF_K:  case BPF_ALU64 | BPF_OR | BPF_K:    emit_a64_mov_i(is64, tmp, imm, ctx);    emit(A64_ORR(is64, dst, dst, tmp), ctx);    break;  case BPF_ALU | BPF_XOR | BPF_K:  case BPF_ALU64 | BPF_XOR | BPF_K:    emit_a64_mov_i(is64, tmp, imm, ctx);    emit(A64_EOR(is64, dst, dst, tmp), ctx);    break;  case BPF_ALU | BPF_MUL | BPF_K:  case BPF_ALU64 | BPF_MUL | BPF_K:    emit_a64_mov_i(is64, tmp, imm, ctx);    emit(A64_MUL(is64, dst, dst, tmp), ctx);    break;  case BPF_ALU | BPF_DIV | BPF_K:  case BPF_ALU64 | BPF_DIV | BPF_K:    emit_a64_mov_i(is64, tmp, imm, ctx);    emit(A64_UDIV(is64, dst, dst, tmp), ctx);    break;  case BPF_ALU | BPF_MOD | BPF_K:  case BPF_ALU64 | BPF_MOD | BPF_K:    emit_a64_mov_i(is64, tmp2, imm, ctx);    emit(A64_UDIV(is64, tmp, dst, tmp2), ctx);    emit(A64_MUL(is64, tmp, tmp, tmp2), ctx);    emit(A64_SUB(is64, dst, dst, tmp), ctx);    break;  case BPF_ALU | BPF_LSH | BPF_K:  case BPF_ALU64 | BPF_LSH | BPF_K:    emit(A64_LSL(is64, dst, dst, imm), ctx);    break;  case BPF_ALU | BPF_RSH | BPF_K:  case BPF_ALU64 | BPF_RSH | BPF_K:    emit(A64_LSR(is64, dst, dst, imm), ctx);    break;  case BPF_ALU | BPF_ARSH | BPF_K:  case BPF_ALU64 | BPF_ARSH | BPF_K:    emit(A64_ASR(is64, dst, dst, imm), ctx);    break;
  /* JUMP off */  case BPF_JMP | BPF_JA:    jmp_offset = bpf2a64_offset(i + off, i, ctx);    check_imm26(jmp_offset);    emit(A64_B(jmp_offset), ctx);    break;  /* IF (dst COND src) JUMP off */  case BPF_JMP | BPF_JEQ | BPF_X:  case BPF_JMP | BPF_JGT | BPF_X:  case BPF_JMP | BPF_JGE | BPF_X:  case BPF_JMP | BPF_JNE | BPF_X:  case BPF_JMP | BPF_JSGT | BPF_X:  case BPF_JMP | BPF_JSGE | BPF_X:    emit(A64_CMP(1, dst, src), ctx);emit_cond_jmp:    jmp_offset = bpf2a64_offset(i + off, i, ctx);    check_imm19(jmp_offset);    switch (BPF_OP(code)) {    case BPF_JEQ:      jmp_cond = A64_COND_EQ;      break;    case BPF_JGT:      jmp_cond = A64_COND_HI;      break;    case BPF_JGE:      jmp_cond = A64_COND_CS;      break;    case BPF_JSET:    case BPF_JNE:      jmp_cond = A64_COND_NE;      break;    case BPF_JSGT:      jmp_cond = A64_COND_GT;      break;    case BPF_JSGE:      jmp_cond = A64_COND_GE;      break;    default:      return -EFAULT;    }    emit(A64_B_(jmp_cond, jmp_offset), ctx);    break;  case BPF_JMP | BPF_JSET | BPF_X:    emit(A64_TST(1, dst, src), ctx);    goto emit_cond_jmp;  /* IF (dst COND imm) JUMP off */  case BPF_JMP | BPF_JEQ | BPF_K:  case BPF_JMP | BPF_JGT | BPF_K:  case BPF_JMP | BPF_JGE | BPF_K:  case BPF_JMP | BPF_JNE | BPF_K:  case BPF_JMP | BPF_JSGT | BPF_K:  case BPF_JMP | BPF_JSGE | BPF_K:    emit_a64_mov_i(1, tmp, imm, ctx);    emit(A64_CMP(1, dst, tmp), ctx);    goto emit_cond_jmp;  case BPF_JMP | BPF_JSET | BPF_K:    emit_a64_mov_i(1, tmp, imm, ctx);    emit(A64_TST(1, dst, tmp), ctx);    goto emit_cond_jmp;  /* function call */  case BPF_JMP | BPF_CALL:  {    const u8 r0 = bpf2a64[BPF_REG_0];    const u64 func = (u64)__bpf_call_base + imm;
    emit_a64_mov_i64(tmp, func, ctx);    emit(A64_BLR(tmp), ctx);    emit(A64_MOV(1, r0, A64_R(0)), ctx);    break;  }  /* tail call */  case BPF_JMP | BPF_CALL | BPF_X:    if (emit_bpf_tail_call(ctx))      return -EFAULT;    break;  /* function return */  case BPF_JMP | BPF_EXIT:    /* Optimization: when last instruction is EXIT,       simply fallthrough to epilogue. */    if (i == ctx->prog->len - 1)      break;    jmp_offset = epilogue_offset(ctx);    check_imm26(jmp_offset);    emit(A64_B(jmp_offset), ctx);    break;
  /* dst = imm64 */  case BPF_LD | BPF_IMM | BPF_DW:  {    const struct bpf_insn insn1 = insn[1];    u64 imm64;
    if (insn1.code != 0 || insn1.src_reg != 0 ||        insn1.dst_reg != 0 || insn1.off != 0) {      /* Note: verifier in BPF core must catch invalid       * instructions.       */      pr_err_once("Invalid BPF_LD_IMM64 instruction
");      return -EINVAL;    }
    imm64 = (u64)insn1.imm << 32 | (u32)imm;    emit_a64_mov_i64(dst, imm64, ctx);
    return 1;  }
  /* LDX: dst = *(size *)(src + off) */  case BPF_LDX | BPF_MEM | BPF_W:  case BPF_LDX | BPF_MEM | BPF_H:  case BPF_LDX | BPF_MEM | BPF_B:  case BPF_LDX | BPF_MEM | BPF_DW:    emit_a64_mov_i(1, tmp, off, ctx);    switch (BPF_SIZE(code)) {    case BPF_W:      emit(A64_LDR32(dst, src, tmp), ctx);      break;    case BPF_H:      emit(A64_LDRH(dst, src, tmp), ctx);      break;    case BPF_B:      emit(A64_LDRB(dst, src, tmp), ctx);      break;    case BPF_DW:      emit(A64_LDR64(dst, src, tmp), ctx);      break;    }    break;
  /* ST: *(size *)(dst + off) = imm */  case BPF_ST | BPF_MEM | BPF_W:  case BPF_ST | BPF_MEM | BPF_H:  case BPF_ST | BPF_MEM | BPF_B:  case BPF_ST | BPF_MEM | BPF_DW:    /* Load imm to a register then store it */    emit_a64_mov_i(1, tmp2, off, ctx);    emit_a64_mov_i(1, tmp, imm, ctx);    switch (BPF_SIZE(code)) {    case BPF_W:      emit(A64_STR32(tmp, dst, tmp2), ctx);      break;    case BPF_H:      emit(A64_STRH(tmp, dst, tmp2), ctx);      break;    case BPF_B:      emit(A64_STRB(tmp, dst, tmp2), ctx);      break;    case BPF_DW:      emit(A64_STR64(tmp, dst, tmp2), ctx);      break;    }    break;
  /* STX: *(size *)(dst + off) = src */  case BPF_STX | BPF_MEM | BPF_W:  case BPF_STX | BPF_MEM | BPF_H:  case BPF_STX | BPF_MEM | BPF_B:  case BPF_STX | BPF_MEM | BPF_DW:    emit_a64_mov_i(1, tmp, off, ctx);    switch (BPF_SIZE(code)) {    case BPF_W:      emit(A64_STR32(src, dst, tmp), ctx);      break;    case BPF_H:      emit(A64_STRH(src, dst, tmp), ctx);      break;    case BPF_B:      emit(A64_STRB(src, dst, tmp), ctx);      break;    case BPF_DW:      emit(A64_STR64(src, dst, tmp), ctx);      break;    }    break;  /* STX XADD: lock *(u32 *)(dst + off) += src */  case BPF_STX | BPF_XADD | BPF_W:  /* STX XADD: lock *(u64 *)(dst + off) += src */  case BPF_STX | BPF_XADD | BPF_DW:    goto notyet;
  /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */  case BPF_LD | BPF_ABS | BPF_W:  case BPF_LD | BPF_ABS | BPF_H:  case BPF_LD | BPF_ABS | BPF_B:  /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */  case BPF_LD | BPF_IND | BPF_W:  case BPF_LD | BPF_IND | BPF_H:  case BPF_LD | BPF_IND | BPF_B:  {    const u8 r0 = bpf2a64[BPF_REG_0]; /* r0 = return value */    const u8 r6 = bpf2a64[BPF_REG_6]; /* r6 = pointer to sk_buff */    const u8 fp = bpf2a64[BPF_REG_FP];    const u8 r1 = bpf2a64[BPF_REG_1]; /* r1: struct sk_buff *skb */    const u8 r2 = bpf2a64[BPF_REG_2]; /* r2: int k */    const u8 r3 = bpf2a64[BPF_REG_3]; /* r3: unsigned int size */    const u8 r4 = bpf2a64[BPF_REG_4]; /* r4: void *buffer */    const u8 r5 = bpf2a64[BPF_REG_5]; /* r5: void *(*func)(...) */    int size;
    emit(A64_MOV(1, r1, r6), ctx);    emit_a64_mov_i(0, r2, imm, ctx);    if (BPF_MODE(code) == BPF_IND)      emit(A64_ADD(0, r2, r2, src), ctx);    switch (BPF_SIZE(code)) {    case BPF_W:      size = 4;      break;    case BPF_H:      size = 2;      break;    case BPF_B:      size = 1;      break;    default:      return -EINVAL;    }    emit_a64_mov_i64(r3, size, ctx);    emit(A64_SUB_I(1, r4, fp, STACK_SIZE), ctx);    emit_a64_mov_i64(r5, (unsigned long)bpf_load_pointer, ctx);    emit(A64_BLR(r5), ctx);    emit(A64_MOV(1, r0, A64_R(0)), ctx);
    jmp_offset = epilogue_offset(ctx);    check_imm19(jmp_offset);    emit(A64_CBZ(1, r0, jmp_offset), ctx);    emit(A64_MOV(1, r5, r0), ctx);    switch (BPF_SIZE(code)) {    case BPF_W:      emit(A64_LDR32(r0, r5, A64_ZR), ctx);#ifndef CONFIG_CPU_BIG_ENDIAN      emit(A64_REV32(0, r0, r0), ctx);#endif      break;    case BPF_H:      emit(A64_LDRH(r0, r5, A64_ZR), ctx);#ifndef CONFIG_CPU_BIG_ENDIAN      emit(A64_REV16(0, r0, r0), ctx);#endif      break;    case BPF_B:      emit(A64_LDRB(r0, r5, A64_ZR), ctx);      break;    }    break;  }notyet:    pr_info_once("*** NOT YET: opcode %02x ***
", code);    return -EFAULT;
  default:    pr_err_once("unknown opcode %02x
", code);    return -EINVAL;  }
  return 0;}

static inline void emit(const u32 insn, struct jit_ctx *ctx){  if (ctx->image != NULL)    ctx->image[ctx->idx] = cpu_to_le32(insn);
  ctx->idx++;}

  • 1.3、epilogue

做和prologue相反的工作,恢復(fù)和清理堆棧:

static void build_epilogue(struct jit_ctx *ctx){  const u8 r0 = bpf2a64[BPF_REG_0];  const u8 r6 = bpf2a64[BPF_REG_6];  const u8 r7 = bpf2a64[BPF_REG_7];  const u8 r8 = bpf2a64[BPF_REG_8];  const u8 r9 = bpf2a64[BPF_REG_9];  const u8 fp = bpf2a64[BPF_REG_FP];
  /* We're done with BPF stack */  emit(A64_ADD_I(1, A64_SP, A64_SP, STACK_SIZE), ctx);
  /* Restore fs (x25) and x26 */  emit(A64_POP(fp, A64_R(26), A64_SP), ctx);
  /* Restore callee-saved register */  emit(A64_POP(r8, r9, A64_SP), ctx);  emit(A64_POP(r6, r7, A64_SP), ctx);
  /* Restore FP/LR registers */  emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
  /* Set return value */  emit(A64_MOV(1, A64_R(0), r0), ctx);
  emit(A64_RET(A64_LR), ctx);}

  • 2、interpreter

對于不支持JIT的情況,內(nèi)核只能使用一個(gè)解析器來解釋prog->insnsi[]中BPF的指令含義,模擬BPF指令的運(yùn)行:

使用“u64 stack[MAX_BPF_STACK / sizeof(u64)]”局部變量來模擬BPF堆棧空間;

使用“u64 regs[MAX_BPF_REG]”局部變量來模擬BPF寄存器;

/** *  __bpf_prog_run - run eBPF program on a given context *  @ctx: is the data we are operating on *  @insn: is the array of eBPF instructions * * Decode and execute eBPF instructions. */static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn){  u64 stack[MAX_BPF_STACK / sizeof(u64)];  u64 regs[MAX_BPF_REG], tmp;  static const void *jumptable[256] = {    [0 ... 255] = &&default_label,    /* Now overwrite non-defaults ... */    /* 32 bit ALU operations */    [BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,    [BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,    [BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,    [BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,    [BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,    [BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,    [BPF_ALU | BPF_OR | BPF_X]  = &&ALU_OR_X,    [BPF_ALU | BPF_OR | BPF_K]  = &&ALU_OR_K,    [BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,    [BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,    [BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,    [BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,    [BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,    [BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,    [BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,    [BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,    [BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,    [BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,    [BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,    [BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,    [BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,    [BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,    [BPF_ALU | BPF_NEG] = &&ALU_NEG,    [BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,    [BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,    /* 64 bit ALU operations */    [BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,    [BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,    [BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,    [BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,    [BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,    [BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,    [BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,    [BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,    [BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,    [BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,    [BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,    [BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,    [BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,    [BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,    [BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,    [BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,    [BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,    [BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,    [BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,    [BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,    [BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,    [BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,    [BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,    [BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,    [BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,    /* Call instruction */    [BPF_JMP | BPF_CALL] = &&JMP_CALL,    [BPF_JMP | BPF_CALL | BPF_X] = &&JMP_TAIL_CALL,    /* Jumps */    [BPF_JMP | BPF_JA] = &&JMP_JA,    [BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,    [BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,    [BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,    [BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,    [BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,    [BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,    [BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,    [BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,    [BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,    [BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,    [BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,    [BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,    [BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,    [BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,    /* Program return */    [BPF_JMP | BPF_EXIT] = &&JMP_EXIT,    /* Store instructions */    [BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,    [BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,    [BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,    [BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,    [BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,    [BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,    [BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,    [BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,    [BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,    [BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,    /* Load instructions */    [BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,    [BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,    [BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,    [BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,    [BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,    [BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,    [BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,    [BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,    [BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,    [BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,    [BPF_LD | BPF_IMM | BPF_DW] = &&LD_IMM_DW,  };  u32 tail_call_cnt = 0;  void *ptr;  int off;
#define CONT   ({ insn++; goto select_insn; })#define CONT_JMP ({ insn++; goto select_insn; })
  FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];  ARG1 = (u64) (unsigned long) ctx;
select_insn:  goto *jumptable[insn->code];
  /* ALU */#define ALU(OPCODE, OP)        ALU64_##OPCODE##_X:        DST = DST OP SRC;      CONT;        ALU_##OPCODE##_X:        DST = (u32) DST OP (u32) SRC;      CONT;        ALU64_##OPCODE##_K:        DST = DST OP IMM;        CONT;        ALU_##OPCODE##_K:        DST = (u32) DST OP (u32) IMM;      CONT;
  ALU(ADD,  +)  ALU(SUB,  -)  ALU(AND,  &)  ALU(OR,   |)  ALU(LSH, <<)  ALU(RSH, >>)  ALU(XOR,  ^)  ALU(MUL,  *)#undef ALU  ALU_NEG:    DST = (u32) -DST;    CONT;  ALU64_NEG:    DST = -DST;    CONT;  ALU_MOV_X:    DST = (u32) SRC;    CONT;  ALU_MOV_K:    DST = (u32) IMM;    CONT;  ALU64_MOV_X:    DST = SRC;    CONT;  ALU64_MOV_K:    DST = IMM;    CONT;  LD_IMM_DW:    DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32;    insn++;    CONT;  ALU64_ARSH_X:    (*(s64 *) &DST) >>= SRC;    CONT;  ALU64_ARSH_K:    (*(s64 *) &DST) >>= IMM;    CONT;  ALU64_MOD_X:    if (unlikely(SRC == 0))      return 0;    div64_u64_rem(DST, SRC, &tmp);    DST = tmp;    CONT;  ALU_MOD_X:    if (unlikely((u32)SRC == 0))      return 0;    tmp = (u32) DST;    DST = do_div(tmp, (u32) SRC);    CONT;  ALU64_MOD_K:    div64_u64_rem(DST, IMM, &tmp);    DST = tmp;    CONT;  ALU_MOD_K:    tmp = (u32) DST;    DST = do_div(tmp, (u32) IMM);    CONT;  ALU64_DIV_X:    if (unlikely(SRC == 0))      return 0;    DST = div64_u64(DST, SRC);    CONT;  ALU_DIV_X:    if (unlikely((u32)SRC == 0))      return 0;    tmp = (u32) DST;    do_div(tmp, (u32) SRC);    DST = (u32) tmp;    CONT;  ALU64_DIV_K:    DST = div64_u64(DST, IMM);    CONT;  ALU_DIV_K:    tmp = (u32) DST;    do_div(tmp, (u32) IMM);    DST = (u32) tmp;    CONT;  ALU_END_TO_BE:    switch (IMM) {    case 16:      DST = (__force u16) cpu_to_be16(DST);      break;    case 32:      DST = (__force u32) cpu_to_be32(DST);      break;    case 64:      DST = (__force u64) cpu_to_be64(DST);      break;    }    CONT;  ALU_END_TO_LE:    switch (IMM) {    case 16:      DST = (__force u16) cpu_to_le16(DST);      break;    case 32:      DST = (__force u32) cpu_to_le32(DST);      break;    case 64:      DST = (__force u64) cpu_to_le64(DST);      break;    }    CONT;
  /* CALL */  JMP_CALL:    /* Function call scratches BPF_R1-BPF_R5 registers,     * preserves BPF_R6-BPF_R9, and stores return value     * into BPF_R0.     */    BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,                   BPF_R4, BPF_R5);    CONT;
  JMP_TAIL_CALL: {    struct bpf_map *map = (struct bpf_map *) (unsigned long) BPF_R2;    struct bpf_array *array = container_of(map, struct bpf_array, map);    struct bpf_prog *prog;    u32 index = BPF_R3;
    if (unlikely(index >= array->map.max_entries))      goto out;    if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))      goto out;
    tail_call_cnt++;
    prog = READ_ONCE(array->ptrs[index]);    if (!prog)      goto out;
    /* ARG1 at this point is guaranteed to point to CTX from     * the verifier side due to the fact that the tail call is     * handeled like a helper, that is, bpf_tail_call_proto,     * where arg1_type is ARG_PTR_TO_CTX.     */    insn = prog->insnsi;    goto select_insn;out:    CONT;  }  /* JMP */  JMP_JA:    insn += insn->off;    CONT;  JMP_JEQ_X:    if (DST == SRC) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JEQ_K:    if (DST == IMM) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JNE_X:    if (DST != SRC) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JNE_K:    if (DST != IMM) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JGT_X:    if (DST > SRC) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JGT_K:    if (DST > IMM) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JGE_X:    if (DST >= SRC) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JGE_K:    if (DST >= IMM) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JSGT_X:    if (((s64) DST) > ((s64) SRC)) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JSGT_K:    if (((s64) DST) > ((s64) IMM)) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JSGE_X:    if (((s64) DST) >= ((s64) SRC)) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JSGE_K:    if (((s64) DST) >= ((s64) IMM)) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JSET_X:    if (DST & SRC) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_JSET_K:    if (DST & IMM) {      insn += insn->off;      CONT_JMP;    }    CONT;  JMP_EXIT:    return BPF_R0;
  /* STX and ST and LDX*/#define LDST(SIZEOP, SIZE)              STX_MEM_##SIZEOP:                *(SIZE *)(unsigned long) (DST + insn->off) = SRC;      CONT;                ST_MEM_##SIZEOP:                *(SIZE *)(unsigned long) (DST + insn->off) = IMM;      CONT;                LDX_MEM_##SIZEOP:                DST = *(SIZE *)(unsigned long) (SRC + insn->off);      CONT;
  LDST(B,   u8)  LDST(H,  u16)  LDST(W,  u32)  LDST(DW, u64)#undef LDST  STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */    atomic_add((u32) SRC, (atomic_t *)(unsigned long)         (DST + insn->off));    CONT;  STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */    atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)           (DST + insn->off));    CONT;  LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + imm32)) */    off = IMM;load_word:    /* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are     * only appearing in the programs where ctx ==     * skb. All programs keep 'ctx' in regs[BPF_REG_CTX]     * == BPF_R6, bpf_convert_filter() saves it in BPF_R6,     * internal BPF verifier will check that BPF_R6 ==     * ctx.     *     * BPF_ABS and BPF_IND are wrappers of function calls,     * so they scratch BPF_R1-BPF_R5 registers, preserve     * BPF_R6-BPF_R9, and store return value into BPF_R0.     *     * Implicit input:     *   ctx == skb == BPF_R6 == CTX     *     * Explicit input:     *   SRC == any register     *   IMM == 32-bit immediate     *     * Output:     *   BPF_R0 - 8/16/32-bit skb data converted to cpu endianness     */
    ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 4, &tmp);    if (likely(ptr != NULL)) {      BPF_R0 = get_unaligned_be32(ptr);      CONT;    }
    return 0;  LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + imm32)) */    off = IMM;load_half:    ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 2, &tmp);    if (likely(ptr != NULL)) {      BPF_R0 = get_unaligned_be16(ptr);      CONT;    }
    return 0;  LD_ABS_B: /* BPF_R0 = *(u8 *) (skb->data + imm32) */    off = IMM;load_byte:    ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 1, &tmp);    if (likely(ptr != NULL)) {      BPF_R0 = *(u8 *)ptr;      CONT;    }
    return 0;  LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + src_reg + imm32)) */    off = IMM + SRC;    goto load_word;  LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + src_reg + imm32)) */    off = IMM + SRC;    goto load_half;  LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + src_reg + imm32) */    off = IMM + SRC;    goto load_byte;
  default_label:    /* If we ever reach this, we have a bug somewhere. */    WARN_RATELIMIT(1, "unknown opcode %02x
", insn->code);    return 0;}

3、BPF_PROG_RUN()

不論是轉(zhuǎn)換成JIT的映像,或者是使用interpreter解釋器。最后BPF程序運(yùn)行的時(shí)候都是使用BPF_PROG_RUN()這個(gè)宏來調(diào)用的:

ret = BPF_PROG_RUN(prog, ctx);

#defineBPF_PROG_RUN(filter,ctx)(*filter->bpf_func)(ctx,filter->insnsi)

1.1.4、fd分配

對于加載到內(nèi)核空間的BPF程序,最后會給它分配一個(gè)文件句柄fd,將prog存儲到對應(yīng)的file->private_data上。方便后續(xù)的引用。

int bpf_prog_new_fd(struct bpf_prog *prog){  return anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog,        O_RDWR | O_CLOEXEC);}

int anon_inode_getfd(const char *name, const struct file_operations *fops,         void *priv, int flags){  int error, fd;  struct file *file;
  error = get_unused_fd_flags(flags);  if (error < 0)    return error;  fd = error;
  file = anon_inode_getfile(name, fops, priv, flags);  if (IS_ERR(file)) {    error = PTR_ERR(file);    goto err_put_unused_fd;  }  fd_install(fd, file);
  return fd;
err_put_unused_fd:  put_unused_fd(fd);  return error;}

struct file *anon_inode_getfile(const char *name,        const struct file_operations *fops,        void *priv, int flags){  struct qstr this;  struct path path;  struct file *file;
  if (IS_ERR(anon_inode_inode))    return ERR_PTR(-ENODEV);
  if (fops->owner && !try_module_get(fops->owner))    return ERR_PTR(-ENOENT);
  /*   * Link the inode to a directory entry by creating a unique name   * using the inode sequence number.   */  file = ERR_PTR(-ENOMEM);  this.name = name;  this.len = strlen(name);  this.hash = 0;  path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);  if (!path.dentry)    goto err_module;
  path.mnt = mntget(anon_inode_mnt);  /*   * We know the anon_inode inode count is always greater than zero,   * so ihold() is safe.   */  ihold(anon_inode_inode);
  d_instantiate(path.dentry, anon_inode_inode);
  file = alloc_file(&path, OPEN_FMODE(flags), fops);  if (IS_ERR(file))    goto err_dput;  file->f_mapping = anon_inode_inode->i_mapping;
  file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);  file->private_data = priv;
  return file;
err_dput:  path_put(&path);err_module:  module_put(fops->owner);  return file;}

1.2、bpf map操作

BPF map的應(yīng)用場景有幾種:

  • BPF程序和用戶態(tài)態(tài)的交互:BPF程序運(yùn)行完,得到的結(jié)果存儲到map中,供用戶態(tài)訪問;

  • BPF程序內(nèi)部交互:如果BPF程序內(nèi)部需要用全局變量來交互,但是由于安全原因BPF程序不允許訪問全局變量,可以使用map來充當(dāng)全局變量;

  • BPF Tail call:Tail call是一個(gè)BPF程序跳轉(zhuǎn)到另一BPF程序,BPF程序首先通過BPF_MAP_TYPE_PROG_ARRAY類型的map來知道另一個(gè)BPF程序的指針,然后調(diào)用tail_call()的helper function來執(zhí)行Tail call。

  • BPF程序和內(nèi)核態(tài)的交互:和BPF程序以外的內(nèi)核程序交互,也可以使用map作為中介;

目前,支持的map種類:

static int __init register_array_map(void){  bpf_register_map_type(&array_type);  bpf_register_map_type(&percpu_array_type);  return 0;}static int __init register_cgroup_array_map(void){  bpf_register_map_type(&cgroup_array_type);  return 0;}static int __init register_htab_map(void){  bpf_register_map_type(&htab_type);  bpf_register_map_type(&htab_percpu_type);  return 0;}static int __init register_perf_event_array_map(void){  bpf_register_map_type(&perf_event_array_type);  return 0;}static int __init register_prog_array_map(void){  bpf_register_map_type(&prog_array_type);  return 0;}static int __init register_stack_map(void){  bpf_register_map_type(&stack_map_type);  return 0;}

不論哪種map,對map的使用都是用"鍵-值“對(key-value)的形式來使用的。

1.2.1、map的創(chuàng)建

如果用戶態(tài)的BPF c程序有定義map,map最后會被編譯進(jìn)__section(“maps”)。
用戶態(tài)的loader在加載BPF程序的時(shí)候,首先會根據(jù)__section(“maps”)中的成員來調(diào)用bpf()系統(tǒng)調(diào)用來創(chuàng)建map對象。

static int map_create(union bpf_attr *attr){  struct bpf_map *map;  int err;
  err = CHECK_ATTR(BPF_MAP_CREATE);  if (err)    return -EINVAL;
  /* find map type and init map: hashtable vs rbtree vs bloom vs ... */  /* (1) 根據(jù)map的類型分配空間 */  map = find_and_alloc_map(attr);  if (IS_ERR(map))    return PTR_ERR(map);
  atomic_set(&map->refcnt, 1);  atomic_set(&map->usercnt, 1);
    /* (2) 在進(jìn)程vm中給map鎖定空間 */  err = bpf_map_charge_memlock(map);  if (err)    goto free_map_nouncharge;
    /* (3) 給map分配對應(yīng)的文件句柄 */  err = bpf_map_new_fd(map);  if (err < 0)    /* failed to allocate fd */    goto free_map;
  return err;
free_map:  bpf_map_uncharge_memlock(map);free_map_nouncharge:  map->ops->map_free(map);  return err;}
|→
static struct bpf_map *find_and_alloc_map(union bpf_attr *attr){  struct bpf_map_type_list *tl;  struct bpf_map *map;
  list_for_each_entry(tl, &bpf_map_types, list_node) {    if (tl->type == attr->map_type) {
        /* (1.1) 根據(jù)type找到對應(yīng)的tl,分配map空間 */      map = tl->ops->map_alloc(attr);      if (IS_ERR(map))        return map;      map->ops = tl->ops;      map->map_type = attr->map_type;      return map;    }  }  return ERR_PTR(-EINVAL);}
|→
int bpf_map_new_fd(struct bpf_map *map){    /* (3.1) 給map分配對應(yīng)的文件句柄fd,把map指針賦值給file->private_data */  return anon_inode_getfd("bpf-map", &bpf_map_fops, map,        O_RDWR | O_CLOEXEC);}

  • 1、BPF_MAP_TYPE_ARRAY

我們以BPF_MAP_TYPE_ARRAY類型的map為例,來看看map的分配過程:

從用戶態(tài)傳過來的attr成員意義如下:

attr->map_type:map的類型;

attr->key_size:鍵key成員的大?。?/p>

attr->value_size:值value成員的大??;

attr->max_entries:需要存儲多少個(gè)條目("鍵-值“對)

static const struct bpf_map_ops array_ops = {  .map_alloc = array_map_alloc,  .map_free = array_map_free,  .map_get_next_key = array_map_get_next_key,  .map_lookup_elem = array_map_lookup_elem,  .map_update_elem = array_map_update_elem,  .map_delete_elem = array_map_delete_elem,};
static struct bpf_map_type_list array_type __read_mostly = {  .ops = &array_ops,  .type = BPF_MAP_TYPE_ARRAY,};

static struct bpf_map *array_map_alloc(union bpf_attr *attr){  bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;  u32 elem_size, index_mask, max_entries;  bool unpriv = !capable(CAP_SYS_ADMIN);  struct bpf_array *array;  u64 array_size, mask64;
  /* check sanity of attributes */  if (attr->max_entries == 0 || attr->key_size != 4 ||      attr->value_size == 0 || attr->map_flags)    return ERR_PTR(-EINVAL);
  if (attr->value_size >= 1 << (KMALLOC_SHIFT_MAX - 1))    /* if value_size is bigger, the user space won't be able to     * access the elements.     */    return ERR_PTR(-E2BIG);
    /* (1.1.1) 計(jì)算value的size,key的size不用計(jì)算也不用存儲,因?yàn)檫@里的key直接就是index */  elem_size = round_up(attr->value_size, 8);
  max_entries = attr->max_entries;
  /* On 32 bit archs roundup_pow_of_two() with max_entries that has   * upper most bit set in u32 space is undefined behavior due to   * resulting 1U << 32, so do it manually here in u64 space.   */  mask64 = fls_long(max_entries - 1);  mask64 = 1ULL << mask64;  mask64 -= 1;
  index_mask = mask64;  if (unpriv) {    /* round up array size to nearest power of 2,     * since cpu will speculate within index_mask limits     */    max_entries = index_mask + 1;    /* Check for overflows. */    if (max_entries < attr->max_entries)      return ERR_PTR(-E2BIG);  }
    /* (1.1.2) 計(jì)算bpf_array + value數(shù)組的總大小,bpf_array包含了map的通用結(jié)構(gòu)bpf_map */  array_size = sizeof(*array);  if (percpu)    array_size += (u64) max_entries * sizeof(void *);  else    array_size += (u64) max_entries * elem_size;
  /* make sure there is no u32 overflow later in round_up() */  if (array_size >= U32_MAX - PAGE_SIZE)    return ERR_PTR(-ENOMEM);
  /* allocate all map elements and zero-initialize them */  /* (1.1.3) 根據(jù)總大小,分配bpf_array空間 */  array = bpf_map_area_alloc(array_size);  if (!array)    return ERR_PTR(-ENOMEM);  array->index_mask = index_mask;  array->map.unpriv_array = unpriv;
  /* copy mandatory map attributes */  /* (1.1.4) 拷貝attr到array->map中 */  array->map.map_type = attr->map_type;  array->map.key_size = attr->key_size;  array->map.value_size = attr->value_size;  array->map.max_entries = attr->max_entries;  array->elem_size = elem_size;
  if (!percpu)    goto out;
  array_size += (u64) attr->max_entries * elem_size * num_possible_cpus();
  if (array_size >= U32_MAX - PAGE_SIZE ||      elem_size > PCPU_MIN_UNIT_SIZE || bpf_array_alloc_percpu(array)) {    bpf_map_area_free(array);    return ERR_PTR(-ENOMEM);  }out:  array->map.pages = round_up(array_size, PAGE_SIZE) >> PAGE_SHIFT;
  return &array->map;}

  • 2、BPF_MAP_TYPE_HASH

我們以BPF_MAP_TYPE_HASH類型的map為例,來看看map的分配過程:

static const struct bpf_map_ops htab_ops = {  .map_alloc = htab_map_alloc,  .map_free = htab_map_free,  .map_get_next_key = htab_map_get_next_key,  .map_lookup_elem = htab_map_lookup_elem,  .map_update_elem = htab_map_update_elem,  .map_delete_elem = htab_map_delete_elem,};
static struct bpf_map_type_list htab_type __read_mostly = {  .ops = &htab_ops,  .type = BPF_MAP_TYPE_HASH,};

static struct bpf_map *htab_map_alloc(union bpf_attr *attr){  bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_HASH;  struct bpf_htab *htab;  int err, i;  u64 cost;
  if (attr->map_flags & ~BPF_F_NO_PREALLOC)    /* reserved bits should not be used */    return ERR_PTR(-EINVAL);
    /* (1.1.1) 因?yàn)閔ash是用鏈表存儲的,所以bpf_htab結(jié)構(gòu)是固定的,優(yōu)先分配 */  htab = kzalloc(sizeof(*htab), GFP_USER);  if (!htab)    return ERR_PTR(-ENOMEM);
  /* mandatory map attributes */  htab->map.map_type = attr->map_type;  htab->map.key_size = attr->key_size;  htab->map.value_size = attr->value_size;  htab->map.max_entries = attr->max_entries;  htab->map.map_flags = attr->map_flags;
  /* check sanity of attributes.   * value_size == 0 may be allowed in the future to use map as a set   */  err = -EINVAL;  if (htab->map.max_entries == 0 || htab->map.key_size == 0 ||      htab->map.value_size == 0)    goto free_htab;
  /* hash table size must be power of 2 */  /* (1.1.2) 鏈表頭buckets的個(gè)數(shù),等于和最大條目值最接近的2的n次方 */  htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
  err = -E2BIG;  if (htab->map.key_size > MAX_BPF_STACK)    /* eBPF programs initialize keys on stack, so they cannot be     * larger than max stack size     */    goto free_htab;
  if (htab->map.value_size >= (1 << (KMALLOC_SHIFT_MAX - 1)) -      MAX_BPF_STACK - sizeof(struct htab_elem))    /* if value_size is bigger, the user space won't be able to     * access the elements via bpf syscall. This check also makes     * sure that the elem_size doesn't overflow and it's     * kmalloc-able later in htab_map_update_elem()     */    goto free_htab;
  if (percpu && round_up(htab->map.value_size, 8) > PCPU_MIN_UNIT_SIZE)    /* make sure the size for pcpu_alloc() is reasonable */    goto free_htab;
    /* (1.1.3) hash的一個(gè)element size = htab_elem + key_size + value_size */  htab->elem_size = sizeof(struct htab_elem) +        round_up(htab->map.key_size, 8);  if (percpu)    htab->elem_size += sizeof(void *);  else    htab->elem_size += round_up(htab->map.value_size, 8);
  /* prevent zero size kmalloc and check for u32 overflow */  if (htab->n_buckets == 0 ||      htab->n_buckets > U32_MAX / sizeof(struct bucket))    goto free_htab;

    /* (1.1.4) 總占用內(nèi)存的大小cost = bucket_size*max_entries + elem_size*max_entries + extra_element_size,        其中extra_element_size = elem_size * num_possible_cpus();     */  cost = (u64) htab->n_buckets * sizeof(struct bucket) +         (u64) htab->elem_size * htab->map.max_entries;
  if (percpu)    cost += (u64) round_up(htab->map.value_size, 8) *      num_possible_cpus() * htab->map.max_entries;  else         cost += (u64) htab->elem_size * num_possible_cpus();
  if (cost >= U32_MAX - PAGE_SIZE)    /* make sure page count doesn't overflow */    goto free_htab;
  htab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
  /* if map size is larger than memlock limit, reject it early */  err = bpf_map_precharge_memlock(htab->map.pages);  if (err)    goto free_htab;
  err = -ENOMEM;  /* (1.1.5) 分配bucket空間 */  htab->buckets = bpf_map_area_alloc(htab->n_buckets *             sizeof(struct bucket));  if (!htab->buckets)    goto free_htab;
  for (i = 0; i < htab->n_buckets; i++) {    INIT_HLIST_HEAD(&htab->buckets[i].head);    raw_spin_lock_init(&htab->buckets[i].lock);  }
    /* (1.1.6) 分配extra elems空間 */  if (!percpu) {    err = alloc_extra_elems(htab);    if (err)      goto free_buckets;  }
    /* (1.1.7) 分配elems空間,并且將其平均掛載到htab->freelist的percpu鏈表上 */  if (!(attr->map_flags & BPF_F_NO_PREALLOC)) {    err = prealloc_elems_and_freelist(htab);    if (err)      goto free_extra_elems;  }
  return &htab->map;
free_extra_elems:  free_percpu(htab->extra_elems);free_buckets:  bpf_map_area_free(htab->buckets);free_htab:  kfree(htab);  return ERR_PTR(err);}

1.2.2、map的查找

查找就是通過key來找到對應(yīng)的value。

static int map_lookup_elem(union bpf_attr *attr){  void __user *ukey = u64_to_ptr(attr->key);  void __user *uvalue = u64_to_ptr(attr->value);  int ufd = attr->map_fd;  struct bpf_map *map;  void *key, *value, *ptr;  u32 value_size;  struct fd f;  int err;
  if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM))    return -EINVAL;
  f = fdget(ufd);  map = __bpf_map_get(f);  if (IS_ERR(map))    return PTR_ERR(map);
  err = -ENOMEM;  key = kmalloc(map->key_size, GFP_USER);  if (!key)    goto err_put;
  err = -EFAULT;  if (copy_from_user(key, ukey, map->key_size) != 0)    goto free_key;
  if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||      map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY)    value_size = round_up(map->value_size, 8) * num_possible_cpus();  else    value_size = map->value_size;
  err = -ENOMEM;  value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);  if (!value)    goto free_key;
    /* (1) 幾種特殊類型map的處理 */  if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH) {    err = bpf_percpu_hash_copy(map, key, value);  } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {    err = bpf_percpu_array_copy(map, key, value);  } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {    err = bpf_stackmap_copy(map, key, value);
  /* (2) 其他類型map都會調(diào)用map->ops->map_lookup_elem()函數(shù) */  } else {    rcu_read_lock();    ptr = map->ops->map_lookup_elem(map, key);
    /* (3.1) 賦值給value */    if (ptr)      memcpy(value, ptr, value_size);    rcu_read_unlock();    err = ptr ? 0 : -ENOENT;  }
  if (err)    goto free_value;
  err = -EFAULT;
  /* (3.2) 將value值拷貝會給用戶空間 */  if (copy_to_user(uvalue, value, value_size) != 0)    goto free_value;
  err = 0;
free_value:  kfree(value);free_key:  kfree(key);err_put:  fdput(f);  return err;}

1、BPF_MAP_TYPE_ARRAY

BPF_MAP_TYPE_ARRAY類型的map最終調(diào)用到array_map_lookup_elem():

static void *array_map_lookup_elem(struct bpf_map *map, void *key){  struct bpf_array *array = container_of(map, struct bpf_array, map);
  /* (2.1) key就是index */  u32 index = *(u32 *)key;
  if (unlikely(index >= array->map.max_entries))    return NULL;
    /* (2.2) 根據(jù)index,找到array->value[]數(shù)組中的value指針 */  return array->value + array->elem_size * (index & array->index_mask);}

  • 2、BPF_MAP_TYPE_HASH

BPF_MAP_TYPE_HASH類型的map最終調(diào)用到htab_map_lookup_elem():

static void *htab_map_lookup_elem(struct bpf_map *map, void *key){  struct htab_elem *l = __htab_map_lookup_elem(map, key);
  if (l)    return l->key + round_up(map->key_size, 8);
  return NULL;}

static void *__htab_map_lookup_elem(struct bpf_map *map, void *key){  struct bpf_htab *htab = container_of(map, struct bpf_htab, map);  struct hlist_head *head;  struct htab_elem *l;  u32 hash, key_size;
  /* Must be called with rcu_read_lock. */  WARN_ON_ONCE(!rcu_read_lock_held());
  key_size = map->key_size;
    /* (2.1) 根據(jù)key計(jì)算出hash值 */  hash = htab_map_hash(key, key_size);
    /* (2.2) 根據(jù)hash值找到鏈表頭bucket */  head = select_bucket(htab, hash);
    /* (2.3) 在bucket鏈表中搜索key相等的htab_elem,如果找不到返回NULL */  l = lookup_elem_raw(head, hash, key, key_size);
  return l;}

1.2.3、BPF_FUNC_map_lookup_elem

除了用戶態(tài)空間需要通過bpf()系統(tǒng)調(diào)用來查找key對應(yīng)的value值。BPF程序中也需要根據(jù)key查找到value的地址,然后在BPF程序中使用。BPF程序時(shí)通過調(diào)用BPF_FUNC_map_lookup_elem helper function來實(shí)現(xiàn)的。

我們以perf_event為例,看看BPF_FUNC_map_lookup_elem helper function的實(shí)現(xiàn):

static const struct bpf_verifier_ops perf_event_prog_ops = {  .get_func_proto    = tp_prog_func_proto,  .is_valid_access  = pe_prog_is_valid_access,  .convert_ctx_access  = pe_prog_convert_ctx_access,};
static struct bpf_prog_type_list perf_event_tl = {  .ops  = &perf_event_prog_ops,  .type  = BPF_PROG_TYPE_PERF_EVENT,};

static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id){  switch (func_id) {  case BPF_FUNC_perf_event_output:    return &bpf_perf_event_output_proto_tp;  case BPF_FUNC_get_stackid:    return &bpf_get_stackid_proto_tp;  default:    return tracing_func_proto(func_id);  }}

static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id){  switch (func_id) {  case BPF_FUNC_map_lookup_elem:    return &bpf_map_lookup_elem_proto;  case BPF_FUNC_map_update_elem:    return &bpf_map_update_elem_proto;  case BPF_FUNC_map_delete_elem:    return &bpf_map_delete_elem_proto;  case BPF_FUNC_probe_read:    return &bpf_probe_read_proto;  case BPF_FUNC_ktime_get_ns:    return &bpf_ktime_get_ns_proto;  case BPF_FUNC_tail_call:    return &bpf_tail_call_proto;  case BPF_FUNC_get_current_pid_tgid:    return &bpf_get_current_pid_tgid_proto;  case BPF_FUNC_get_current_task:    return &bpf_get_current_task_proto;  case BPF_FUNC_get_current_uid_gid:    return &bpf_get_current_uid_gid_proto;  case BPF_FUNC_get_current_comm:    return &bpf_get_current_comm_proto;  case BPF_FUNC_trace_printk:    return bpf_get_trace_printk_proto();  case BPF_FUNC_get_smp_processor_id:    return &bpf_get_smp_processor_id_proto;  case BPF_FUNC_perf_event_read:    return &bpf_perf_event_read_proto;  case BPF_FUNC_probe_write_user:    return bpf_get_probe_write_proto();  case BPF_FUNC_current_task_under_cgroup:    return &bpf_current_task_under_cgroup_proto;  case BPF_FUNC_get_prandom_u32:    return &bpf_get_prandom_u32_proto;  default:    return NULL;  }}

const struct bpf_func_proto bpf_map_lookup_elem_proto = {  .func    = bpf_map_lookup_elem,  .gpl_only  = false,  .pkt_access  = true,  .ret_type  = RET_PTR_TO_MAP_VALUE_OR_NULL,  .arg1_type  = ARG_CONST_MAP_PTR,  .arg2_type  = ARG_PTR_TO_MAP_KEY,};

BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key){  WARN_ON_ONCE(!rcu_read_lock_held());  return (unsigned long) map->ops->map_lookup_elem(map, key);}

和bpf()系統(tǒng)調(diào)用一樣,最后調(diào)用的都是map->ops->map_lookup_elem()函數(shù),只不過BPF程序需要返回的是value的指針,而bpf()系統(tǒng)調(diào)用需要返回的是value的值。

關(guān)于map的helper function,還有BPF_FUNC_map_update_elem、BPF_FUNC_map_delete_elem可以使用,原理一樣。

1.3、obj pin

系統(tǒng)把bpf_prog和bpf_map都和文件句柄綁定起來。有一系列的好處:比如可以在用戶態(tài)使用一系列的通用文件操作;也有一系列的壞處:因?yàn)閒d生存在進(jìn)程空間的,其他進(jìn)程不能訪問,而且一旦本進(jìn)程退出,這些對象都會處于失聯(lián)狀態(tài)無法訪問。

所以系統(tǒng)也支持把bpf對象進(jìn)行全局化的聲明,具體的做法是把這些對象綁定到一個(gè)專用的文件系統(tǒng)當(dāng)中:

# ls /sys/fs/bpf/#

具體分為pin操作和get操作。

1.3.1、bpf_obj_pin()

static int bpf_obj_pin(const union bpf_attr *attr){  if (CHECK_ATTR(BPF_OBJ))    return -EINVAL;
  return bpf_obj_pin_user(attr->bpf_fd, u64_to_ptr(attr->pathname));}

int bpf_obj_pin_user(u32 ufd, const char __user *pathname){  struct filename *pname;  enum bpf_type type;  void *raw;  int ret;
    /* (1) 根據(jù)字符串獲取路徑 */  pname = getname(pathname);  if (IS_ERR(pname))    return PTR_ERR(pname);
    /* (2) 根據(jù)fd獲取到bpf_map/bpf_prog對象 */  raw = bpf_fd_probe_obj(ufd, &type);  if (IS_ERR(raw)) {    ret = PTR_ERR(raw);    goto out;  }
    /* (3) 創(chuàng)建文件節(jié)點(diǎn),和bpf對象聯(lián)結(jié)起來 */  ret = bpf_obj_do_pin(pname, raw, type);  if (ret != 0)    bpf_any_put(raw, type);out:  putname(pname);  return ret;}
|→
static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type){  void *raw;
    /* (2.1) 根據(jù)fd,嘗試獲取map對象 */  *type = BPF_TYPE_MAP;  raw = bpf_map_get_with_uref(ufd);  if (IS_ERR(raw)) {      /* (2.2) 如果失敗,根據(jù)fd,嘗試獲取prog對象 */    *type = BPF_TYPE_PROG;    raw = bpf_prog_get(ufd);  }
  return raw;}
|→
static int bpf_obj_do_pin(const struct filename *pathname, void *raw,        enum bpf_type type){  struct dentry *dentry;  struct inode *dir;  struct path path;  umode_t mode;  dev_t devt;  int ret;
    /* (3.1) 創(chuàng)建dentry對象 */  dentry = kern_path_create(AT_FDCWD, pathname->name, &path, 0);  if (IS_ERR(dentry))    return PTR_ERR(dentry);
  mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
  /* (3.2) type存儲在devt中 */  devt = MKDEV(UNNAMED_MAJOR, type);
  ret = security_path_mknod(&path, dentry, mode, devt);  if (ret)    goto out;
  dir = d_inode(path.dentry);  if (dir->i_op != &bpf_dir_iops) {    ret = -EPERM;    goto out;  }
    /* (3.3) 對象指針raw存放到dentry->d_fsdata中,再來創(chuàng)建inode */  dentry->d_fsdata = raw;  ret = vfs_mknod(dir, dentry, mode, devt);  dentry->d_fsdata = NULL;out:  done_path_create(&path, dentry);  return ret;}

1.3.2、bpf_obj_get()

static int bpf_obj_get(const union bpf_attr *attr){  if (CHECK_ATTR(BPF_OBJ) || attr->bpf_fd != 0)    return -EINVAL;
  return bpf_obj_get_user(u64_to_ptr(attr->pathname));}

int bpf_obj_get_user(const char __user *pathname){  enum bpf_type type = BPF_TYPE_UNSPEC;  struct filename *pname;  int ret = -ENOENT;  void *raw;
    /* (1) 根據(jù)字符串獲取路徑 */  pname = getname(pathname);  if (IS_ERR(pname))    return PTR_ERR(pname);
    /* (2) 根據(jù)路徑,在對應(yīng)inode中找到bpf對象的raw指針和type */  raw = bpf_obj_do_get(pname, &type);  if (IS_ERR(raw)) {    ret = PTR_ERR(raw);    goto out;  }
    /* (3) 根據(jù)對象type,在本進(jìn)程中給bpf對象分配一個(gè)fd */  if (type == BPF_TYPE_PROG)    ret = bpf_prog_new_fd(raw);  else if (type == BPF_TYPE_MAP)    ret = bpf_map_new_fd(raw);  else    goto out;
  if (ret < 0)    bpf_any_put(raw, type);out:  putname(pname);  return ret;}

static void *bpf_obj_do_get(const struct filename *pathname,          enum bpf_type *type){  struct inode *inode;  struct path path;  void *raw;  int ret;
    /* (2.1) 根據(jù)路徑,獲取到dentry */  ret = kern_path(pathname->name, LOOKUP_FOLLOW, &path);  if (ret)    return ERR_PTR(ret);
    /* (2.2) 根據(jù)dentry,獲取到inode */  inode = d_backing_inode(path.dentry);  ret = inode_permission(inode, MAY_WRITE);  if (ret)    goto out;
    /* (2.3) 根據(jù)inode,獲取到type */  ret = bpf_inode_type(inode, type);  if (ret)    goto out;
    /* (2.4) 根據(jù)inode和type,獲取到raw指針 */  raw = bpf_any_get(inode->i_private, *type);  if (!IS_ERR(raw))    touch_atime(&path);
  path_put(&path);  return raw;out:  path_put(&path);  return ERR_PTR(ret);}

2.Tracing類型的BPF程序

經(jīng)過上一節(jié)的內(nèi)容,bpf程序和map已經(jīng)加載到內(nèi)核當(dāng)中了。什么時(shí)候bpf程序才能發(fā)揮它的作用呢?

這就需要bpf的應(yīng)用系統(tǒng)把其掛載到適當(dāng)?shù)你^子上,當(dāng)鉤子所在點(diǎn)的路徑被執(zhí)行,鉤子被觸發(fā),BPF程序得以執(zhí)行。

目前應(yīng)用bpf的子系統(tǒng)分為兩大類:

  • tracing:kprobe、tracepoint、perf_event

  • filter:sk_filter、sched_cls、sched_act、xdp、cg_skb

我們仔細(xì)分析一下tracing類子系統(tǒng)應(yīng)用bpf的過程,tracing類型的bpf操作都是通過perf來完成的。

2.1、bpf程序的綁定

在使用perf_event_open()系統(tǒng)調(diào)用創(chuàng)建perf_event并且返回一個(gè)文件句柄后,可以使用ioctl的PERF_EVENT_IOC_SET_BPF命令把加載好的bpf程序和當(dāng)前perf_event綁定起來。

static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg){  struct perf_event *event = file->private_data;  struct perf_event_context *ctx;  long ret;
  ctx = perf_event_ctx_lock(event);  ret = _perf_ioctl(event, cmd, arg);  perf_event_ctx_unlock(event, ctx);
  return ret;}

static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned long arg){  void (*func)(struct perf_event *);  u32 flags = arg;
  switch (cmd) {  case PERF_EVENT_IOC_ENABLE:    func = _perf_event_enable;    break;  case PERF_EVENT_IOC_DISABLE:    func = _perf_event_disable;    break;  case PERF_EVENT_IOC_RESET:    func = _perf_event_reset;    break;
  case PERF_EVENT_IOC_REFRESH:    return _perf_event_refresh(event, arg);
  case PERF_EVENT_IOC_PERIOD:    return perf_event_period(event, (u64 __user *)arg);
  case PERF_EVENT_IOC_ID:  {    u64 id = primary_event_id(event);
    if (copy_to_user((void __user *)arg, &id, sizeof(id)))      return -EFAULT;    return 0;  }
  case PERF_EVENT_IOC_SET_OUTPUT:  {    int ret;    if (arg != -1) {      struct perf_event *output_event;      struct fd output;      ret = perf_fget_light(arg, &output);      if (ret)        return ret;      output_event = output.file->private_data;      ret = perf_event_set_output(event, output_event);      fdput(output);    } else {      ret = perf_event_set_output(event, NULL);    }    return ret;  }
  case PERF_EVENT_IOC_SET_FILTER:    return perf_event_set_filter(event, (void __user *)arg);
  case PERF_EVENT_IOC_SET_BPF:    return perf_event_set_bpf_prog(event, arg);
  case PERF_EVENT_IOC_PAUSE_OUTPUT: {    struct ring_buffer *rb;
    rcu_read_lock();    rb = rcu_dereference(event->rb);    if (!rb || !rb->nr_pages) {      rcu_read_unlock();      return -EINVAL;    }    rb_toggle_paused(rb, !!arg);    rcu_read_unlock();    return 0;  }  default:    return -ENOTTY;  }
  if (flags & PERF_IOC_FLAG_GROUP)    perf_event_for_each(event, func);  else    perf_event_for_each_child(event, func);
  return 0;}

static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd){  bool is_kprobe, is_tracepoint;  struct bpf_prog *prog;
    /* (1) 對于PERF_TYPE_HARDWARE、PERF_TYPE_SOFTWARE類型的perf_event,需要綁定BPF_PROG_TYPE_PERF_EVENT類型的BPF prog         event->prog = prog;     */  if (event->attr.type == PERF_TYPE_HARDWARE ||      event->attr.type == PERF_TYPE_SOFTWARE)    return perf_event_set_bpf_handler(event, prog_fd);
  if (event->attr.type != PERF_TYPE_TRACEPOINT)    return -EINVAL;
  if (event->tp_event->prog)    return -EEXIST;
  is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;  is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;  if (!is_kprobe && !is_tracepoint)    /* bpf programs can only be attached to u/kprobe or tracepoint */    return -EINVAL;
  prog = bpf_prog_get(prog_fd);  if (IS_ERR(prog))    return PTR_ERR(prog);
    /* (2) 對于TRACE_EVENT_FL_TRACEPOINT類型的perf_event,需要綁定BPF_PROG_TYPE_TRACEPOINT類型的BPF prog         對于TRACE_EVENT_FL_UKPROBE類型的perf_event,需要綁定BPF_PROG_TYPE_KPROBE類型的BPF prog         event->tp_event->prog = prog;     */  if ((is_kprobe && prog->type != BPF_PROG_TYPE_KPROBE) ||      (is_tracepoint && prog->type != BPF_PROG_TYPE_TRACEPOINT)) {    /* valid fd, but invalid bpf program type */    bpf_prog_put(prog);    return -EINVAL;  }
    /* (3) 如果是tracepoint類型的perf_event,需要注意自定義數(shù)據(jù)的大小不能超過bpf_prog中規(guī)定的context的大小,不然會被認(rèn)為是非法訪問 */  if (is_tracepoint) {    int off = trace_event_get_offsets(event->tp_event);
    if (prog->aux->max_ctx_offset > off) {      bpf_prog_put(prog);      return -EACCES;    }  }  event->tp_event->prog = prog;  event->tp_event->bpf_prog_owner = event;
  return 0;}

如上,perf_event綁定bpf_prog的規(guī)則如下:

  • 對于PERF_TYPE_HARDWARE、PERF_TYPE_SOFTWARE類型的perf_event,需要綁定BPF_PROG_TYPE_PERF_EVENT類型的BPF prog。event->prog = prog;

  • 對于TRACE_EVENT_FL_TRACEPOINT實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event,需要綁定BPF_PROG_TYPE_TRACEPOINT類型的BPF prog。event->tp_event->prog = prog;

  • 對于TRACE_EVENT_FL_UKPROBE實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event,需要綁定BPF_PROG_TYPE_KPROBE類型的BPF prog。event->tp_event->prog = prog;

2.2、bpf程序的執(zhí)行

因?yàn)閹追Nperf_event的執(zhí)行路徑不一樣,我們分開描述。

  • 1、PERF_TYPE_HARDWARE、PERF_TYPE_SOFTWARE類型的perf_event。

static void bpf_overflow_handler(struct perf_event *event,         struct perf_sample_data *data,         struct pt_regs *regs){
    /* (1) 構(gòu)造context */  struct bpf_perf_event_data_kern ctx = {    .data = data,    .regs = regs,  };  int ret = 0;
  preempt_disable();  if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))    goto out;  rcu_read_lock();
  /* (2) 調(diào)用bpf_prog處理 */  ret = BPF_PROG_RUN(event->prog, (void *)&ctx);  rcu_read_unlock();out:  __this_cpu_dec(bpf_prog_active);  preempt_enable();  if (!ret)    return;
    /* (3) perf_event的數(shù)據(jù)處理 */  event->orig_overflow_handler(event, data, regs);}

  • 2、TRACE_EVENT_FL_TRACEPOINT實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event。

static notrace void              perf_trace_##call(void *__data, proto)          {                    struct trace_event_call *event_call = __data;        struct trace_event_data_offsets_##call __maybe_unused __data_offsets;  struct trace_event_raw_##call *entry;          struct bpf_prog *prog = event_call->prog;        struct pt_regs *__regs;              u64 __count = 1;              struct task_struct *__task = NULL;          struct hlist_head *head;            int __entry_size;              int __data_size;              int rctx;                                  __data_size = trace_event_get_offsets_##call(&__data_offsets, args);                     head = this_cpu_ptr(event_call->perf_events);        if (!prog && __builtin_constant_p(!__task) && !__task &&          hlist_empty(head))          return;                                  __entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),           sizeof(u64));          __entry_size -= sizeof(u32);                              entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx);    if (!entry)                  return;                                  perf_fetch_caller_regs(__regs);                              tstruct                                    { assign; }                                  perf_trace_run_bpf_submit(entry, __entry_size, rctx,              event_call, __count, __regs,              head, __task);      }

void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,             struct trace_event_call *call, u64 count,             struct pt_regs *regs, struct hlist_head *head,             struct task_struct *task){  struct bpf_prog *prog = call->prog;
    /* (1) 調(diào)用bpf_prog處理 */  if (prog) {    *(struct pt_regs **)raw_data = regs;    if (!trace_call_bpf(prog, raw_data) || hlist_empty(head)) {      perf_swevent_put_recursion_context(rctx);      return;    }  }
  /* (2) perf_event的數(shù)據(jù)處理 */  perf_tp_event(call->event.type, count, raw_data, size, regs, head,          rctx, task);}

unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx){  unsigned int ret;
  if (in_nmi()) /* not supported yet */    return 1;
  preempt_disable();
  if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {    /*     * since some bpf program is already running on this cpu,     * don't call into another bpf program (same or different)     * and don't send kprobe event into ring-buffer,     * so return zero here     */    ret = 0;    goto out;  }
  rcu_read_lock();
  /* (1.1) 對bpf_prog的調(diào)用 */  ret = BPF_PROG_RUN(prog, ctx);  rcu_read_unlock();
 out:  __this_cpu_dec(bpf_prog_active);  preempt_enable();
  return ret;}

  • 3、TRACE_EVENT_FL_UKPROBE實(shí)現(xiàn)的PERF_TYPE_TRACEPOINT類型的perf_event。

kprobe類型的實(shí)現(xiàn):

static voidkprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs){  struct trace_event_call *call = &tk->tp.call;  struct bpf_prog *prog = call->prog;  struct kprobe_trace_entry_head *entry;  struct hlist_head *head;  int size, __size, dsize;  int rctx;
    /* (1) 調(diào)用bpf_prog處理 */  if (prog && !trace_call_bpf(prog, regs))    return;
  head = this_cpu_ptr(call->perf_events);  if (hlist_empty(head))    return;
  dsize = __get_data_size(&tk->tp, regs);  __size = sizeof(*entry) + tk->tp.size + dsize;  size = ALIGN(__size + sizeof(u32), sizeof(u64));  size -= sizeof(u32);
  entry = perf_trace_buf_alloc(size, NULL, &rctx);  if (!entry)    return;
  entry->ip = (unsigned long)tk->rp.kp.addr;  memset(&entry[1], 0, dsize);  store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);  /* (2) perf_event的數(shù)據(jù)處理 */  perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,            head, NULL);}

kretprobe類型的實(shí)現(xiàn):

static voidkretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,        struct pt_regs *regs){  struct trace_event_call *call = &tk->tp.call;  struct bpf_prog *prog = call->prog;  struct kretprobe_trace_entry_head *entry;  struct hlist_head *head;  int size, __size, dsize;  int rctx;
    /* (1) 調(diào)用bpf_prog處理 */  if (prog && !trace_call_bpf(prog, regs))    return;
  head = this_cpu_ptr(call->perf_events);  if (hlist_empty(head))    return;
  dsize = __get_data_size(&tk->tp, regs);  __size = sizeof(*entry) + tk->tp.size + dsize;  size = ALIGN(__size + sizeof(u32), sizeof(u64));  size -= sizeof(u32);
  entry = perf_trace_buf_alloc(size, NULL, &rctx);  if (!entry)    return;
  entry->func = (unsigned long)tk->rp.kp.addr;  entry->ret_ip = (unsigned long)ri->ret_addr;  store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
  /* (2) perf_event的數(shù)據(jù)處理 */  perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,            head, NULL);}

3.Filter類型的BPF程序

暫不分析


原文標(biāo)題:BPF內(nèi)核實(shí)現(xiàn)詳解

文章出處:【微信公眾號:Linux閱碼場】歡迎添加關(guān)注!文章轉(zhuǎn)載請注明出處。

審核編輯:湯梓紅
聲明:本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人,不代表電子發(fā)燒友網(wǎng)立場。文章及其配圖僅供工程師學(xué)習(xí)之用,如有內(nèi)容侵權(quán)或者其他違規(guī)問題,請聯(lián)系本站處理。 舉報(bào)投訴
  • 芯片
    +關(guān)注

    關(guān)注

    456

    文章

    51283

    瀏覽量

    427811
  • Linux
    +關(guān)注

    關(guān)注

    87

    文章

    11351

    瀏覽量

    210512
  • 內(nèi)存
    +關(guān)注

    關(guān)注

    8

    文章

    3064

    瀏覽量

    74383

原文標(biāo)題:BPF內(nèi)核實(shí)現(xiàn)詳解

文章出處:【微信號:LinuxDev,微信公眾號:Linux閱碼場】歡迎添加關(guān)注!文章轉(zhuǎn)載請注明出處。

收藏 人收藏

    評論

    相關(guān)推薦

    FT-BPF-4053-C帶通濾波器

    FT-BPF-4053-C帶通濾波器產(chǎn)品介紹FT-BPF-4053-C詢價(jià)熱線FT-BPF-4053-C現(xiàn)貨FT-BPF-4053-C代理王先生*** 深圳市首質(zhì)誠科技有限公司
    發(fā)表于 11-15 14:26

    BPF2怎么修訂歷史

    我剛從設(shè)計(jì)器V4.4切換到V5.3,我注意到BPF2過濾器現(xiàn)在是V6.0。我要去哪里修改歷史? 以上來自于百度翻譯 以下為原文I have just switched from Designer
    發(fā)表于 05-17 12:29

    BPF-A580+帶通濾波器產(chǎn)品介紹

    `產(chǎn)品名稱: 帶通濾波器產(chǎn)品型號:BPF-A580+BPF-A580+ 產(chǎn)品特性線性相位,F(xiàn)c±60 MHz時(shí),典型值高達(dá)±6度高拒絕率屏蔽盒水性廢水工作溫度-40 o C至85 o C儲存溫度
    發(fā)表于 09-24 15:29

    淺析openat系統(tǒng)調(diào)用在arm64下的實(shí)現(xiàn)及使用原理

    1. 簡介這篇文章主要介紹系統(tǒng)調(diào)用在arm64下的實(shí)現(xiàn)及使用原理,考慮到目前bpf系統(tǒng)調(diào)試和調(diào)優(yōu)工作中被大量使用,在文章的最后也簡單介紹一下系統(tǒng)
    發(fā)表于 06-02 17:48

    一篇文章搞懂BPF的內(nèi)部邏輯

    1、如何把bpftrace program轉(zhuǎn)化成抽象語法樹AST下圖是BPF內(nèi)部邏輯圖,可以看出如何從bpftrace program經(jīng)過AST,LLVM IR,BPF bytecode這幾個(gè)階段
    發(fā)表于 07-07 16:54

    TCP-IP詳解卷2_BPF:BSD 分組過濾程序

    TCP-IP詳解卷2 BPF:BSD 分組過濾程序,學(xué)習(xí)TCP很好的資料。歡迎下載。
    發(fā)表于 05-09 14:13 ?0次下載

    保證BPF程序安全的BPF驗(yàn)證器介紹

    和應(yīng)用程序的任何信息,同時(shí),BPF系統(tǒng)性能影響很小,執(zhí)行效率很高,而且開發(fā)人員不需要因?yàn)槭占瘮?shù)據(jù)而修改程序。 本文將介紹保證BPF
    的頭像 發(fā)表于 05-03 11:27 ?1938次閱讀
    保證<b class='flag-5'>BPF</b><b class='flag-5'>程序</b>安全的<b class='flag-5'>BPF</b>驗(yàn)證器介紹

    教你們?nèi)绾问褂胑BPF追蹤LINUX內(nèi)核

    和應(yīng)用程序的任何信息,同時(shí),BPF系統(tǒng)性能影響很小,執(zhí)行效率很高,而且開發(fā)人員不需要因?yàn)槭占瘮?shù)據(jù)而修改程序。 本文將介紹保證BPF
    的頭像 發(fā)表于 04-20 11:26 ?2442次閱讀
    教你們?nèi)绾问褂胑BPF追蹤LINUX內(nèi)核

    如何使用BPF對Linux內(nèi)核進(jìn)行實(shí)時(shí)跟蹤

    我們可以使用BPF對Linux內(nèi)核進(jìn)行跟蹤,收集我們想要的內(nèi)核數(shù)據(jù),從而對Linux中的程序進(jìn)行分析和調(diào)試。與其它的跟蹤技術(shù)相比,使用BPF的主要優(yōu)點(diǎn)是幾乎可以訪問Linux內(nèi)核和應(yīng)用程序
    的頭像 發(fā)表于 06-30 17:28 ?2357次閱讀
    如何使用<b class='flag-5'>BPF</b>對Linux內(nèi)核進(jìn)行實(shí)時(shí)跟蹤

    BPF ring buffer解決的問題及背后的設(shè)計(jì)

    文章介紹了 BPF ring buffer 解決的問題及背后的設(shè)計(jì),并給出了一些代碼示例和內(nèi)核 patch 鏈接,深度和廣度兼?zhèn)洌菍W(xué)習(xí) ring buffer 的極佳參考。
    的頭像 發(fā)表于 05-17 09:37 ?2373次閱讀

    BPF編程的環(huán)境搭建方法

    本來想寫一篇“BPF 深度分析、環(huán)境搭建與案例分析”的文章,但是篇幅過長,于是先把BPF編程的環(huán)境搭建先放出來。接下來的文章將對BPF深度分析(包括BPF虛擬機(jī)、
    的頭像 發(fā)表于 10-14 17:02 ?2066次閱讀
    <b class='flag-5'>BPF</b>編程的環(huán)境搭建方法

    BPF為內(nèi)核編程提供了一個(gè)新的參考模型

    這個(gè)新的編程環(huán)境混合使用了 C語言擴(kuò)展以及運(yùn)行時(shí)環(huán)境的組合實(shí)現(xiàn)的,這個(gè)運(yùn)行時(shí)環(huán)境包含了 Clang、用戶空間的 BPF 加載器庫(libbpf)和內(nèi)核中的 BPF系統(tǒng)。
    的頭像 發(fā)表于 10-19 11:27 ?1198次閱讀

    BPF如何在Unix內(nèi)核實(shí)現(xiàn)網(wǎng)絡(luò)數(shù)據(jù)包過濾

    BPF發(fā)展到現(xiàn)在名稱升級為eBPF:「extended Berkeley Packet Filter」。它演進(jìn)成為了一套通用執(zhí)行引擎,提供可基于系統(tǒng)程序事件高效安全執(zhí)行特定代碼的通用能力,通用能力的使用者不再局限于內(nèi)核開發(fā)者。
    發(fā)表于 06-11 15:24 ?1201次閱讀
    <b class='flag-5'>BPF</b>如何在Unix內(nèi)核實(shí)現(xiàn)網(wǎng)絡(luò)數(shù)據(jù)包過濾

    Linux內(nèi)核革命性技術(shù)之BPF的前世今生

    從指令集角度,BPF 起初的架構(gòu)比較簡單,只有一個(gè)32位寬度累加器A,一個(gè)32位寬度寄存器X,以及16x32bit 數(shù)組內(nèi)存空間。但BPF 實(shí)現(xiàn)了加載、存儲、跳轉(zhuǎn)、運(yùn)算四類指令。
    發(fā)表于 07-26 12:28 ?2087次閱讀
    Linux內(nèi)核革命性技術(shù)之<b class='flag-5'>BPF</b>的前世今生

    內(nèi)核觀測技術(shù)BPF詳解

    補(bǔ)丁和不斷完善代碼,BPF程序變成了一個(gè)更通用的執(zhí)行引擎,可以完成多種任務(wù)。簡單來說,BPF提供了一種在各種內(nèi)核時(shí)間和應(yīng)用程序事件發(fā)生時(shí)運(yùn)行一小段
    的頭像 發(fā)表于 11-10 10:34 ?1344次閱讀