略知一二之Cilium

What is Cilium?

Cilium is open source software for transparently securing the network connectivity between application services deployed using Linux container management platforms like Docker and Kubernetes.

At the foundation of Cilium is a new Linux kernel technology called BPF, which enables the dynamic insertion of powerful security visibility and control logic within Linux itself. Because BPF runs inside the Linux kernel, Cilium security policies can be applied and updated without any changes to the application code or container configuration.

Why Cilium?

1

eBPF Architecture

2

XDP

Cilium方案中大量使用了XDP、TC等网络相关的BPF hook,以实现高性能的网络RX和TX。

XDP全称为eXpress Data Path,是Linux内核网络栈的最底层。它只存在于RX路径上,允许在网络设备驱动内部网络堆栈中数据来源最早的地方进行数据包处理,在特定模式下可以在操作系统分配内存(skb)之前就已经完成处理。

尝试写一个xdp代码,名为xdp-example.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include<linux/bpf.h>

#ifndef __section
# define __section(NAME) \
__attribute__((section(NAME),used))
#endif

__section("prog")
int xdp_drop(struct xdp_md *ctx)
{
return XDP_DROP;
}

char __license[] __section("license") = "GPL";

查看整个编译过程都做了哪些事情

1
2
3
4
5
6
root@ubuntu:~# clang -ccc-print-phases -o2 -Wall -target bpf -c xdp-example.c -o xdp-example.o
+- 0: input, "xdp-example.c", c //输入
+- 1: preprocessor, {0}, cpp-output //预处理
+- 2: compiler, {1}, ir //编译
+- 3: backend, {2}, assembler //汇编
4: assembler, {3}, object //生成目标代码
1
2
root@ubuntu:~# file xdp-example.o
xdp-example.o: ELF 64-bit LSB relocatable, eBPF, version 1 (SYSV), not stripped

加载xdp程序

首先查看主机的网卡设备列表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
root@ubuntu:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:82:8b:d7 brd ff:ff:ff:ff:ff:ff
altname enp3s0
inet 192.168.19.84/16 brd 192.168.255.255 scope global ens160
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fe82:8bd7/64 scope link
valid_lft forever preferred_lft forever

加载程序之前ping虚拟机

1
2
3
4
5
6
7
8
9
10
11
12
caoyifan@MacBookPro ~ % ping 192.168.19.84
PING 192.168.19.84 (192.168.19.84): 56 data bytes
64 bytes from 192.168.19.84: icmp_seq=0 ttl=60 time=98.922 ms
64 bytes from 192.168.19.84: icmp_seq=1 ttl=60 time=104.328 ms
64 bytes from 192.168.19.84: icmp_seq=2 ttl=60 time=101.547 ms
64 bytes from 192.168.19.84: icmp_seq=3 ttl=60 time=107.351 ms
64 bytes from 192.168.19.84: icmp_seq=4 ttl=60 time=62.270 ms
64 bytes from 192.168.19.84: icmp_seq=5 ttl=60 time=84.121 ms
^C
--- 192.168.19.84 ping statistics ---
6 packets transmitted, 6 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 62.270/93.090/107.351/15.629 ms

加载xdp程序到网卡设备

1
root@ubuntu:~# ip link set dev ens160 xdp obj xdp-example.o

检查是否已经挂载到ens160网卡

3

加载程序之后ping虚拟机

1
2
3
4
5
6
7
8
9
caoyifan@MacBookPro ~ % ping 192.168.19.84                    
PING 192.168.19.84 (192.168.19.84): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
^C
--- 192.168.19.84 ping statistics ---
5 packets transmitted, 0 packets received, 100.0% packet loss

卸载xdp程序

1
root@ubuntu:~# ip link set dev ens160 xdp off

内核跟踪

4

动态追踪历史

严格来讲 Linux 中的动态追踪技术其实是一种高级的调试技术, 可以在内核态和用户态进行深入的分析, 方便开发者或系统管理者便捷快速的定位和处理问题

如下表所示, 为 Linux 追踪技术的大致发展历程

年份 技术
2004 kprobes/kretprobes
2005 systemtap
2008 ftrace
2009 perf_events
2009 tracepoints
2012 uprobes
2015 eBPF
tracepoint

tracepoints是散落在内核源码中的一些hook,它们可以在特定的代码被执行到时触发,这一特定可以被各种trace/debug工具所使用。

perf将tracepoint产生的时间记录下来,生成报告,通过分析这些报告,条有人缘便可以了解程序运行期间内核的各种细节,对性能症状做出准确的诊断。

安装perf工具

首先下载内核对应的源码包:https://mirrors.edge.kernel.org/pub/linux/kernel/

1
2
3
4
5
root@ubuntu:~/linux-5.8/tools/perf# uname -r
5.8.0-050800-generic
root@ubuntu:~# cd linux-5.8/tools/perf/
root@ubuntu:~# make
root@ubuntu:~# sudo make install
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
root@ubuntu:~/linux-5.8/tools/perf# ./perf list tracepoint

List of pre-defined events (to be used in -e):

alarmtimer:alarmtimer_cancel [Tracepoint event]
alarmtimer:alarmtimer_fired [Tracepoint event]
alarmtimer:alarmtimer_start [Tracepoint event]
alarmtimer:alarmtimer_suspend [Tracepoint event]
block:block_bio_backmerge [Tracepoint event]
block:block_bio_bounce [Tracepoint event]
block:block_bio_complete [Tracepoint event]
block:block_bio_frontmerge [Tracepoint event]
block:block_bio_queue [Tracepoint event]
block:block_bio_remap [Tracepoint event]
block:block_dirty_buffer [Tracepoint event]
block:block_getrq [Tracepoint event]
block:block_plug [Tracepoint event]
block:block_rq_complete [Tracepoint event]
block:block_rq_insert [Tracepoint event]
block:block_rq_issue [Tracepoint event]
block:block_rq_remap [Tracepoint event]
block:block_rq_requeue [Tracepoint event]
block:block_sleeprq [Tracepoint event]
block:block_split [Tracepoint event]
block:block_touch_buffer [Tracepoint event]
block:block_unplug [Tracepoint event]

用户空间下,ping命令的系统调用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
root@ubuntu:~/linux-5.8/tools/perf# strace -fF -e trace=network ping 114.114.114.114 -c 1
strace: deprecated option -F ignored
socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) = 3
socket(AF_INET6, SOCK_DGRAM, IPPROTO_ICMPV6) = 4
socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(1025), sin_addr=inet_addr("114.114.114.114")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(51536), sin_addr=inet_addr("192.168.19.84")}, [16]) = 0
setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0
setsockopt(3, SOL_IP, IP_RECVTTL, [1], 4) = 0
setsockopt(3, SOL_IP, IP_RETOPTS, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_SNDBUF, [324], 4) = 0
setsockopt(3, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
getsockopt(3, SOL_SOCKET, SO_RCVBUF, [131072], [4]) = 0
PING 114.114.114.114 (114.114.114.114) 56(84) bytes of data.
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_SNDTIMEO, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
sendto(3, "\10\0\177\7\0\0\0\1\345\0053b\0\0\0\0\227\274\n\0\0\0\0\0\20\21\22\23\24\25\26\27"..., 64, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("114.114.114.114")}, 16) = 64
recvmsg(3, {msg_name={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("114.114.114.114")}, msg_namelen=128->16, msg_iov=[{iov_base="\0\0\207\3\0\4\0\1\345\0053b\0\0\0\0\227\274\n\0\0\0\0\0\20\21\22\23\24\25\26\27"..., iov_len=192}], msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SCM_TIMESTAMP, cmsg_data={tv_sec=1647511013, tv_usec=720786}}, {cmsg_len=20, cmsg_level=SOL_IP, cmsg_type=IP_TTL, cmsg_data=[66]}], msg_controllen=56, msg_flags=0}, 0) = 64
64 bytes from 114.114.114.114: icmp_seq=1 ttl=66 time=17.1 ms

--- 114.114.114.114 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 17.147/17.147/17.147/0.000 ms
+++ exited with 0 +++

内核空间下,ping命令的系统调用

1
2
3
4
root@ubuntu:~/linux-5.8/tools/perf# ./perf trace --event 'net:*' ping 114.114.114.114 -c 1 > /dev/null 
0.000 ping/3954500 net:net_dev_queue(skbaddr: 0xffff9204a4a42600, len: 98, name: "ens160")
0.044 ping/3954500 net:net_dev_start_xmit(name: "ens160", queue_mapping: 3, skbaddr: 0xffff9204a4a42600, protocol: 2048, len: 98, network_offset: 14, transport_offset_valid: 1, transport_offset: 34)
0.087 ping/3954500 net:net_dev_xmit(skbaddr: 0xffff9204a4a42600, len: 98, name: "ens160")

除此之外,还可以通过perf查看系统调用函数使用cpu的情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
root@ubuntu:~/linux-5.8/tools/perf# ./perf top
PerfTop: 16433 irqs/sec kernel:66.4% exact: 0.0% lost: 0/0 drop: 0/0 [4000Hz cpu-clock:pppH], (all, 4 CPUs)
-------------------------------------------------------------------------------------------------------------------------------------------------------

9.79% [kernel] [k] finish_task_switch
4.87% [kernel] [k] __lock_text_start
1.81% [kernel] [k] do_syscall_64
1.25% [kernel] [k] do_user_addr_fault
1.13% [kernel] [k] clear_page_orig
0.80% [kernel] [k] __softirqentry_text_start
0.72% [kernel] [k] exit_to_usermode_loop
0.51% perf [.] dso__find_symbol
0.48% [kernel] [k] rmqueue_pcplist.constprop.0
0.45% [kernel] [k] strchr
0.44% perf [.] hists__findnew_entry
0.42% [kernel] [k] copy_page_regs
0.40% perf [.] hist_entry__sort
0.39% [kernel] [k] memset_orig
0.38% [kernel] [k] string_escape_mem
0.36% perf [.] __symbols__insert
0.35% perf [.] rb_next
0.34% [kernel] [k] zap_pte_range
0.33% perf [.] sort__dso_cmp
0.33% [kernel] [k] _raw_spin_lock
0.32% ld-2.31.so [.] 0x000000000000e304
0.30% perf [.] perf_hpp__is_dynamic_entry
0.30% [kernel] [k] __schedule
0.29% dockerd [.] crypto/sha256.block
0.29% perf [.] evsel__parse_sample
0.29% perf [.] hpp__sort_overhead
0.28% [kernel] [k] copy_user_generic_unrolled
0.28% [kernel] [k] filemap_map_pages
0.27% dockerd [.] runtime.mallocgc
0.26% [kernel] [k] __handle_mm_fault
0.26% kube-apiserver [.] 0x0000000001068802
0.26% [kernel] [k] __d_lookup_rcu
0.26% [kernel] [k] memcg_kmem_get_cache
0.25% [kernel] [k] __d_lookup
0.23% dockerd [.] runtime.scanobject
0.23% perf [.] sort__sym_cmp
0.22% [kernel] [k] kmem_cache_alloc
0.21% [kernel] [k] rcu_all_qs
0.21% [kernel] [k] handle_mm_fault
0.19% [kernel] [k] __run_timers.part.0
0.19% [kernel] [k] free_unref_page_list
uprobes

uprobe是用户态的探针,它和kprobe是相对应的,kprobe是内核态的探针。uprobe需要制定用户态探针在执行文件中的位置,插入探针的原理和kprobe类似。

准备一个测试代码demo.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include<stdlib.h>
#include<stdio.h>

int count = 0;
void print_info()
{
printf("current count = %d\n",count);
count++;
}

int main(int argc,char* argv[])
{
while(1){
print_info();
system("sleep 1");
}
return 0;
}

编译并运行

1
2
3
4
5
6
7
8
9
10
root@ubuntu:~/uprobe# gcc demo.c -o demo
root@ubuntu:~/uprobe# ./demo
current count = 0
current count = 1
current count = 2
current count = 3
current count = 4
current count = 5
current count = 6
current count = 7

这个时候如果程序出现问题,我们就可以通过uprobe查看相关信息。

首先通过objdump查看相关段的信息

1
2
3
root@ubuntu:~/uprobe# objdump -t demo | grep print
0000000000000000 F *UND* 0000000000000000 printf@@GLIBC_2.2.5
0000000000001169 g F .text 0000000000000033 print_info

添加程序相关信息到uprobe点

1
2
3
4
root@ubuntu:~/uprobe# cat /sys/kernel/debug/tracing/uprobe_events 
root@ubuntu:~/uprobe# echo 'p:print_info ~/uprobe/demo:0x1169 %ip %ax' > /sys/kernel/debug/tracing/uprobe_events
root@ubuntu:~/uprobe# cat /sys/kernel/debug/tracing/uprobe_events
p:uprobes/print_info ~/uprobe/demo:0x0000000000001169 arg1=%ip arg2=%ax

打开日志追踪

1
root@ubuntu:~/uprobe# echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable

查看日志信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
root@ubuntu:~/uprobe# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 10/10 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<...>-4007899 [003] d... 1239242.207371: print_info: (0x55e58e23e169) arg1=0x55e58e23e169 arg2=0x0
<...>-4007899 [003] d... 1239243.212293: print_info: (0x55e58e23e169) arg1=0x55e58e23e169 arg2=0x0
<...>-4007899 [003] d... 1239244.217997: print_info: (0x55e58e23e169) arg1=0x55e58e23e169 arg2=0x0
<...>-4007899 [000] d... 1239245.235024: print_info: (0x55e58e23e169) arg1=0x55e58e23e169 arg2=0x0
<...>-4007899 [003] d... 1239246.240172: print_info: (0x55e58e23e169) arg1=0x55e58e23e169 arg2=0x0
<...>-4007899 [003] d... 1239247.247020: print_info: (0x55e58e23e169) arg1=0x55e58e23e169 arg2=0x0
<...>-4007899 [003] d... 1239248.255470: print_info: (0x55e58e23e169) arg1=0x55e58e23e169 arg2=0x0
<...>-4007899 [003] d... 1239249.260487: print_info: (0x55e58e23e169) arg1=0x55e58e23e169 arg2=0x0
<...>-4007899 [001] d... 1239250.266369: print_info: (0x55e58e23e169) arg1=0x55e58e23e169 arg2=0x0
<...>-4007899 [001] d... 1239251.271204: print_info: (0x55e58e23e169) arg1=0x55e58e23e169 arg2=0x0

关闭及清理

1
2
root@ubuntu:~/uprobe# echo 0 > /sys/kernel/debug/tracing/events/uprobes/enable
root@ubuntu:~/uprobe# > /sys/kernel/debug/tracing/trace
kprobe

Kprobe是一种内核调测手段,它可以动态地跟踪内核的行为、收集debug信息和性能信息。

查看do_sys_open系统调用的详细信息

1
2
3
root@ubuntu:~# echo 'r:myprobe do_sys_open' > /sys/kernel/debug/tracing/kprobe_events 
root@ubuntu:~# cat /sys/kernel/debug/tracing/kprobe_events
r4:kprobes/myprobe do_sys_open

打开日志追踪

1
root@ubuntu:~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable

查看调用信息

1
root@ubuntu:~# cat /sys/kernel/debug/tracing/trace

网络跟踪

  • BPF XDP
  • BPF TC hooks
  • BPF Cgroups
  • BPF sockmap and sockops
  • BPF system calls

5

Cilium vs Kube-router

6

VxLan

VxLan(Virtual eXtensible Local Area Network,虚拟可扩展局域网),是一种虚拟化隧道通信技术。它是一种 Overlay(覆盖网络)技术,通过三层的网络来搭建虚拟的二层网络。

简单来讲,VxLan是在底层物理网络(underlay)之上使用隧道技术,借助 UDP 层构建的 Overlay 的逻辑网络,使逻辑网络与物理网络解耦,实现灵活的组网需求。它对原有的网络架构几乎没有影响,不需要对原网络做任何改动,即可架设一层新的网络。也正是因为这个特性,很多CNI插件会选择 VxLan 作为通信网络。

VxLan 不仅支持一对一,也支持一对多,一个 VxLan 设备能通过像网桥一样的学习方式学习到其他对端的 IP 地址,还可以直接配置静态转发表。

VxLan常见术语

  • VTEP(VXLAN Tunnel Endpoints,VxLan 隧道端点)

    VxLan 网络的边缘设备,用来进行VxLan报文的处理(封包和解包)。VTEP 可以是网络设备(比如交换机),也可以是一台机器(比如虚拟化集群中的宿主机)。

  • VNI(VxLan Network Identifier,VxLan 网络标识符)

    VNI 是每个 VXLAN 段的标识,是个 24 位整数,一共有16777216个,一般每个 VNI 对应一个租户,也就是说使用 VxLan 搭建的公有云可以理论上可以支撑千万级别的租户。

  • Tunnel(VxLan 隧道)

    隧道是一个逻辑上的概念,在 VxLan 模型中并没有具体的物理实体向对应。隧道可以看做是一种虚拟通道,VxLan 通信双方认为自己是在直接通信,并不知道底层网络的存在。从整体来说,每个 VxLan 网络像是为通信的虚拟机搭建了一个单独的通信通道,也就是隧道。

Cilium组件

Cilium agent
  • 以Node为单位
  • 采用DaemonSet方式部署
  • 通过CNI插件与CRI和Kubernetes交互
  • 采用IPAM地址分配方式
  • 生成eBPF程序,编译字节码,Attach到内核

7

Cilium operator

8

Cilium控制平面

创建一个Pod的流程

  • kubectl将对应的请求发给API Server
  • API Server将对应的pod信息写到etcd中
  • Scheduler服务会watch API Server,选择合适的节点
  • kublet调用CRI-Containerd创建容器
  • 创建对应的容器网络,调用CNI-Plugin,即调用Cilium agent
  • Cilium agent创建对应的网络,调用bpf_syscall()

9

Cilium数据平面-ipvs/iptables

从网卡到Pod经历了哪些点

10

  • 经过eth0,此时数据放在Ring Buffer中,内核会通过NAPI轮训调取Ring Buffer数据
  • 经过XDP,对数据进行PASS、DROP等操作,前提是网卡支持XDP
  • 内核给数据分配skb,skb是网络在内核中的结构体
  • 经过GRO,将数据包进行组合封包,提升网络吞吐
  • 经过TC ingress,包括流量限速,流量整形,策略应用等操作
  • 经过Netfilter
  • 经过TC egress,出口流量的队列调度等操作
  • 经过GSO,将大封包转为小封包
  • 本地流量走第17步,远程流量走第18步
Cilium数据平面-eBPF

11

Cilium数据平面-service转发
  • 南北向流量:XDP或TC
  • 东西向流量:BPF socket
Cilium数据转发与tc hook

12

Cilium 在主机网络空间上创建了三个虚拟接口:ciliumhost、ciliumnet和ciliumvxlan。Cilium Agent 在启动时创建一个名为“ciliumhost -> ciliumnet”的 veth 对,并将 CIDR 的第一个IP地址设置为 ciliumhost,然后作为 CIDR 的网关。CNI 插件会生成 BPF 规则,编译后注入内核,以解决veth对之间的连通问题。

Cilium组网模式-VxLan

13

Cilium组网模式-BGP router

14

参考