k8s qos实现分析
创始人
2024-03-24 21:37:27
0

k8s版本1.24

1. cgroup参数介绍

这里介绍一下k8s用到的cgroup的几个参数,以cgroupv1为主
cpu.share
用来设置cgroup中的进程可用的CPU的相对值,在系统不繁忙时,可任意使用CPU资源,不受此值限制,在系统繁忙时,保证进程能用的CPU的最小值。
此值为相对值,且不管是单核还是多核,默认值为1024,最终可用的CPU资源为:本cgroup的cpu.share / 所有cgroup的cpu.share之和,比如在单核系统上,cgroup A和B的cpu.share均为默认值1024,则A和B中的进程都可以使用50%的CPU资源,如果增加了cgroup C,且值为2048,则cgroup A可用25%的CPU资源,cgroup B可用25%的CPU资源,cgroup C可用50%的CPU资源,在多核系统上也是一样的。
最重要的意义是保证cgroup可用的最小CPU资源,比如cgroup A可使用50%的CPU资源,则不管系统多繁忙,都会保证A中的进程都有50%的CPU资源可用,如果系统不繁忙时,A中的进程可使用100%的CPU资源。

在k8s中,通过resources.request.cpu指定了最小可用CPU资源,会通过函数MilliCPUToShares转换成cpu.share值,具体转换规则如下:
cpu.share = (resources.request.cpu * 1024) / 1000

cpu.cfs_period_us
用来设置重新分配cgroup可用CPU资源的时间间隔,即多长时间重新分配CPU资源,相当于一个时间片,单位为微秒,取值范围为1ms到1s(1000-1000000)。

cpu.cfs_quota_us
用来设置在一个时间间隔内所能使用的CPU资源时间,即在一个时间片内,此cgroup可用的CPU资源,如果指定为-1,则不受cgroup限制,最小值为1ms

在k8s中,通过resources.limit.cpu指定了最大可用CPU资源,会通过函数MilliCPUToQuota转换成cpu.cfs_quota_us值,cpu.cfs_period_us可通过参数指定,默认值为100ms。具体转换规则如下cpu.cfs_quota_us = (resources.limit.cpu * cpu.cfs_period_us) / 1000

memory.limit_in_bytes
用来设置cgroup中的进程可用的内存的最大值,如果没指定单位,则默认为字节,也可加后缀表示更大的单位,比如K/M/G,如果指定为-1,则不受cgroup限制

在k8s中,通过resources.request.memory指定了最小可用内存资源,但是在cgroupv1中,不支持设置内存可用最小值,所以没用到此值,在cgroupv2中支持,可打开MemoryQoS后进行设置。
在k8s中,通过resources.limit.memory指定了最大可用内存资源,会转换成memory.limit_in_bytes。

2. pod qos等级

根据resources.request和resources.limit指定的值,可将pod分为如下三个等级:
a. Guaranteed: pod内的所有container都指定了request和limit,且非0并相等
b. Burstable:pod内的有任何container指定了request或者limit
c. BestEffort:pod内的所有container都没有指定request和limit

qos等级的底层实现:
a. 不同qos等级的进程的oom_score_adj值不一样,此值会影响oom_score的最终值,oom_score值越高,则当发生OOM时,对应的进程先被kill掉,
Guaranteed级别的oom_score_adj为-997,Burstable级别的oom_score_adj为3-999,BestEffort级别的oom_score_adj为1000,
具体可参考函数:pkg/kubelet/qos/policy.go:GetContainerOOMScoreAdjust
b. qos底层由cgroup来实现,不同等级的qos表现在不同的cgroup层级上,Guaranteed级别的cgroup在ROOT/kubepods下,
Burstable级别的cgroup在ROOT/kubepods/kubepods-burstable下,BestEffort级别的cgroup在ROOT/kubepods/kubepods-besteffort下。

注意事项:
qos等级不能通过yaml指定,而是自动计算得出,可参考代码:pkg/apis/core/helper/qos/qos.go:GetPodQOS

如果只指定了limit,没指定request,则默认将request的值设置为limit的值
如果同时指定request和limit,request的值不能大于limit

kube-scheduler在调度时,只会根据request进行调度,不会参考limit的值

3. cgroup驱动

支持两种cgroup驱动:cgroupfs和systemd,前者直接操作对应的cgroup文件,后者调用systemd的接口间隔操作。
使用systemd驱动时,cgroup的目录名字需要加上.slice后缀,可参考代码:pkg/kubelet/cm/cgroup_manager_linux.go:ToSystemd

4. kubelet中和cgroup相关的几个参数

a. --cgroups-per-qos: 指定此参数后,会为qos等级和pod创建对应的cgroup层级,默认为true
b. --cgroup-root: 指定root cgroup,默认为/,即/sys/fs/cgroup/,如果同时指定了–cgroups-per-qos,则自动加上kubepods,最终为/kubepods
c. --enforce-node-allocatable: 用来指定是否强制分配,可选值为none,pods,system-reserved和kube-reserved,默认值为pods。
如果指定了system-reserved,则必须指定–system-reserved-cgroup,
如果指定了kube-reserved,则必须指定–kube-reserved-cgroup
d. --system-reserved: 用来指定给系统进程预留的资源,比如cpu=200m,memory=500Mi,ephemeral-storage=1Gi
e. --kube-reserved: 用来指定给k8s组件进程预留的资源,比如cpu=200m,memory=500Mi,ephemeral-storage=1Gi
f. --system-reserved-cgroup: 用来指定给系统进程使用的cgroup绝对路径,会将–system-reserved指定的值设置到此cgroup中,用来限制系统进程能使用的资源,
比如指定的是/kube,如果是systemd驱动,则需要用户提前创建好/sys/fs/cgroup/kube.slice目录
g. --kube-reserved-cgroup: 用来指定给k8s组件进程使用的cgroup绝对路径,会将–kube-reserved指定的值设置到此cgroup中,用来限制k8s组件进程能使用的资源,
比如指定的是/sys,如果是systemd驱动,则需要用户提前创建好/sys/fs/cgroup/sys.slice目录
h. --system-cgroup: 用来指定系统进程使用的cgroup的绝对路径,最好放在–system-reserved-cgroup的层级下面,比如/sys.slice/system,会自动创建指定的路径,
可参考函数:pkg/kubelet/cm/container_manager_linux.go:ensureSystemCgroups,此函数会尝试将所有的非kernel进程和非1进程移到到此cgroup中,
但是使用systemd的系统上,所有进程要么属于kernel进程,要么属于1的子进程,所以即使配置了此参数,也不会有进程移到到此cgroup中
i. --kubelet-cgroup: 用来指定kubelet进程使用的cgroup的绝对路径,最好放在–kube-reserved-cgroup的层级下面,比如/kube.slice/kubelet,
会自动创建指定的路径,可参考函数:pkg/kubelet/cm/container_manager_linux.go:ensureProcessInContainerWithOOMScore,此函数还会设置kubelet进程的
oom_score_adj为-999
j. --qos-reserved: 用来指定为高优先级pod预留资源比例,当前只支持内存,比如指定memory=100%时,当前可分配内存为1G,创建了一个limit 100M的Guaranteed级别pod,
则预留100M给Guaranteed级别pod,Burstable和BestEffort cgroup的memory.limit_in_bytes设置为900M,此时又创建了一个limit 200M的Burstable级别的pod,则BestEffort cgroup的memory.limit_in_bytes设置为700M

5. k8s cgroup层级

kubelet启动后,会在–cgroup-root指定的目录下创建kubepods目录,比如/sys/fs/cgroup/cpu/kubepods,将node上可分配的资源写入kubepods目录下对应的cgroup文件中,比如cpu.share,后面创建的所有pod都会创建在kubepods目录下,以此达到限制pod资源的目的。在kubepods目录下会按照qos级别分三类目录,对于Guaranteed级别的pod,其对应的cgroup直接设置在kubepods目录下,对于Burstable级别的pod,会在kubepods目录下创建kubepods-burstable.slice目录,对应的pod的cgroup设置在kubepods-burstable.slice目录下,对于BestEffort级别的pod,会在kubepods目录下创建kubepods-besteffort.slice目录,对应的pod的cgroup设置在kubepods-besteffort.slice目录下

下面创建三种qos等级的pod,看一下cgroup层级是怎么样的

a. request和limit相等的pod,即Guaranteed级别的pod
apiVersion: apps/v1
kind: Deployment
metadata:name: nginx-demo1
spec:replicas: 1spec:nodeName: mastercontainers:- image: nginx:1.14imagePullPolicy: IfNotPresentname: nginxresources:requests:memory: "128Mi"cpu: "500m"limits:memory: "128Mi"cpu: "500m"b. request大于limit的pod,即Burstable级别的pod
apiVersion: apps/v1
kind: Deployment
metadata:name: nginx-demo1
spec:replicas: 1spec:nodeName: mastercontainers:- image: nginx:1.14imagePullPolicy: IfNotPresentname: nginxresources:requests:memory: "128Mi"cpu: "500m"limits:memory: "256Mi"cpu: "1000m"c. 不指定request和limit的pod,即BestEffort级别的pod
apiVersion: apps/v1
kind: Deployment
metadata:name: nginx-demo1
spec:replicas: 1spec:nodeName: mastercontainers:- image: nginx:1.14imagePullPolicy: IfNotPresentname: nginx

下面为cgroup的内存层级,kubepods.slice目录下包含此node上所有的pod,其memory.limit_in_bytes为2809M限制了pod能用的内存资源,kubepods.slice目录下的三个子目录为三个qos级别的pod,当前只有一个Guaranteed级别的pod,如果有多个的话,就会有多个目录。值得注意的是pod下面的container并不在对应的pod目录下面,而且在system.slice/containerd.service目录下

root@master:/root# tree /sys/fs/cgroup/memory
/sys/fs/cgroup/memory/
├── memory.limit_in_bytes  //9223372036854771712
├── kubepods.slice
│   ├── memory.limit_in_bytes  //2946347008字节/2877292Ki/2809M
│   ├── kubepods-besteffort.slice
│   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   ├── kubepods-besteffort-podde4983ac-ff0c-40be-8472-8b6674593aa3.slice //BestEffort级别的pod
│   │   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   │   └── tasks
│   │   └── tasks
│   ├── kubepods-burstable.slice
│   │   ├── memory.limit_in_bytes //9223372036854771712最大值,即不在qos级别对内存做限制
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice //Burstable级别的pod
│   │   │   ├── memory.limit_in_bytes //268435456/256M
│   │   │   └── tasks
│   │   └── tasks
│   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice //Guaranteed级别的pod
│   │   ├── memory.limit_in_bytes //134217728/128M
│   │   └── tasks
│   └── tasks
├── kube.slice
│   ├── memory.limit_in_bytes //104857600/100M 为k8s组件预留资源100M
│   ├── kubelet
│   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   └── tasks
│   └── tasks
├── sys.slice
│   ├── memory.limit_in_bytes //104857600/100M 为系统进程预留资源100M
│   └── tasks
├── system.slice
│   ├── memory.limit_in_bytes //9223372036854771712
│   ├── containerd.service
│   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:5a323896aa0db2f15c9f82145cd38851783d08d8bf132f3ed4a7613a3830f71a
│   │   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   │   └── tasks
│   │   ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:e6803695024464a3365721812dcff0347c40e162b8142244a527da7b785f215c
│   │   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   │   └── tasks
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:3c6a7115e688913d0a6d382607f0c1a9b5ecf58d4ee33c9c24e640dc33b80acc
│   │   │   ├── memory.limit_in_bytes //268435456/256M
│   │   │   └── tasks
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:67e2b0336ed2af44875ad7b1fb9c35bae335673cf20a2a1d8331b85d4bea4d95
│   │   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   │   └── tasks
│   │   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:836a0a6aa460663b9a4dc8961dd55da11ae090c9e76705f81e9c7d43060423c3
│   │   │   ├── memory.limit_in_bytes //9223372036854771712
│   │   │   └── tasks
│   │   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:9bbc1d7134d322e988ace0cbb4fc75f44184f4e0f24f1c0228be7eed6ec6f659
│   │   │   ├── memory.limit_in_bytes //134217728/128M
│   │   │   └── tasks
│   └── tasks
├── tasks

下面为cgroup的CPU层级,目录结构和内存层级是一样的,各个层级的cpu.cfs_period_us都是100000

root@master:/root# tree /sys/fs/cgroup/cpu
/sys/fs/cgroup/cpu/
├── cpu.cfs_period_us //100000
├── cpu.cfs_quota_us //-1
├── cpu.shares //1024
├── kubepods.slice
│   ├── cpu.cfs_period_us
│   ├── cpu.cfs_quota_us //-1
│   ├── cpu.shares //7168
│   ├── kubepods-besteffort.slice
│   │   ├── cpu.cfs_period_us
│   │   ├── cpu.cfs_quota_us //-1
│   │   ├── cpu.shares //2
│   │   ├── kubepods-besteffort-podde4983ac-ff0c-40be-8472-8b6674593aa3.slice
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us
│   │   │   ├── cpu.shares //2
│   │   │   └── tasks
│   │   └── tasks
│   ├── kubepods-burstable.slice
│   │   ├── cpu.cfs_period_us
│   │   ├── cpu.cfs_quota_us //-1
│   │   ├── cpu.shares //1546
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //100000
│   │   │   ├── cpu.shares //512
│   │   │   └── tasks
│   │   └── tasks
│   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice
│   │   ├── cpu.cfs_period_us
│   │   ├── cpu.cfs_quota_us //50000
│   │   ├── cpu.shares //512
│   │   └── tasks
│   └── tasks
├── kube.slice
│   ├── cpu.cfs_period_us
│   ├── cpu.cfs_quota_us //-1
│   ├── cpu.shares //512
│   ├── kubelet
│   │   ├── cpu.cfs_period_us
│   │   ├── cpu.cfs_quota_us //-1
│   │   ├── cpu.shares //1024
│   │   └── tasks
│   └── tasks
├── sys.slice
│   ├── cpu.cfs_period_us
│   ├── cpu.cfs_quota_us //-1
│   ├── cpu.shares //512
│   └── tasks
├── system.slice
│   ├── cpu.cfs_period_us
│   ├── cpu.cfs_quota_us //-1
│   ├── cpu.shares //1024
│   ├── containerd.service
│   │   ├── cpu.cfs_period_us
│   │   ├── cpu.cfs_quota_us //-1
│   │   ├── cpu.shares //1024
│   │   ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:5a323896aa0db2f15c9f82145cd38851783d08d8bf132f3ed4a7613a3830f71a
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //-1
│   │   │   ├── cpu.shares //2
│   │   │   └── tasks
│   │   ├── kubepods-besteffort-podde4983ac_ff0c_40be_8472_8b6674593aa3.slice:cri-containerd:e6803695024464a3365721812dcff0347c40e162b8142244a527da7b785f215c
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //-1
│   │   │   ├── cpu.shares //2
│   │   │   └── tasks
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:3c6a7115e688913d0a6d382607f0c1a9b5ecf58d4ee33c9c24e640dc33b80acc
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //100000
│   │   │   ├── cpu.shares //512
│   │   │   └── tasks
│   │   ├── kubepods-burstable-pod18ec1047_8414_4905_8747_ccb1dd50e0bc.slice:cri-containerd:67e2b0336ed2af44875ad7b1fb9c35bae335673cf20a2a1d8331b85d4bea4d95
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //-1
│   │   │   ├── cpu.shares //2
│   │   │   └── tasks
│   │   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:836a0a6aa460663b9a4dc8961dd55da11ae090c9e76705f81e9c7d43060423c3
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //-1
│   │   │   ├── cpu.shares //2
│   │   │   └── tasks
│   │   ├── kubepods-pod5799fccc_d1f5_4958_b13f_6a82378a8934.slice:cri-containerd:9bbc1d7134d322e988ace0cbb4fc75f44184f4e0f24f1c0228be7eed6ec6f659
│   │   │   ├── cpu.cfs_period_us
│   │   │   ├── cpu.cfs_quota_us //50000
│   │   │   ├── cpu.shares //512
│   │   │   └── tasks
│   └── tasks
├── tasks

6. 资源计算方法

通过上面的cgroup层级,可将k8s用到的cgroup分为四个层级:node level, qos level, pod level和container level,下面看一下各个层级的资源计算方法
a. node level
node层级的目的是为了限制pod能用的总资源,防止无限资源申请抢占node上其他进程的资源,导致node不稳定,这里用到了node allocatable机制,后面文章再详细介绍此机制,现在只看一下如何计算node level所用资源。
node上的总资源用capacity表示,kube-reserved和system-reserved分别为–kube-reserved和–system-reserved指定的预留资源,如果–enforce-node-allocatable指定了pods,则node level可用资源为node level = capacity - kube-reserved - system-reserved,如果–enforce-node-allocatable没指定pods,则node level可用资源就是capacity。

kubepods.slice/cpu.shares = capacity(cpu) - kube-reserved(cpu) - system-reserved(cpu)  //只会转换成cpu.share
kubepods.slice/memory.limit_in_bytes = capacity(memory) - kube-reserved(memory) - system-reserved(memory)

b. qos level
qos level的三个qos等级计算方法不同,Guaranteed级别的pod request和limit相等所以不用计算,对于memory的计算取决于参数–qos-reserved,如果不指定,则无内存限制,下面以指定–qos-reserved为例说明

Burstable级别:
kubepods.slice/kubepods-besteffort.slice/cpu.share = sum request[cpu] of all burstable pod
kubepods.slice/kubepods-besteffort.slice/memory.limit_in_bytes = kubepods.slice/memory.limit_in_bytes - {(sum of requests[memory] of all guaranteed pods)*(reservePercent / 100)}BestEffort级别:
kubepods.slice/kubepods-besteffort.slice/cpu.share = 2
kubepods.slice/kubepods-besteffort.slice/memory.limit_in_bytes = kubepods.slice/memory.limit_in_bytes - {(sum of requests[memory] of all guaranteed and burstable pods)*(reservePercent / 100)}

c. pod level
不同级别的计算方法如下:

Guaranteed级别:
kubepods.slice/kubepods-pod/cpu.share = sum request[cpu] of all container
kubepods.slice/kubepods-pod/cpu.cfs_period_us = 100000
kubepods.slice/kubepods-pod/cpu.cfs_quota_us = sum limit[cpu] of all container
kubepods.slice/kubepods-pod/memory.limit_in_bytes = sum limit[memory] of all containerBurstable级别:
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod/cpu.share = sum request[cpu] of all container
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod/cpu.cfs_period_us = 100000
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod/cpu.cfs_quota_us = sum limit[cpu] of all container
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod/memory.limit_in_bytes = sum limit[memory] of all containerBestEffort级别:
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod/cpu.share = 2
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod/cpu.cfs_period_us = 100000
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod/cpu.cfs_quota_us = sum limit[cpu] of all container
kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod/memory.limit_in_bytes = sum limit[memory] of all container

d. container level
不同级别的计算方法如下:

Guaranteed级别:
system.slice/containerd.service/kubepods-besteffort-pod.slice:cri-containerd:/cpu.share = request[cpu] of container
system.slice/containerd.service/kubepods-besteffort-pod.slice:cri-containerd:/cpu.cfs_period_us = 100000
system.slice/containerd.service/kubepods-besteffort-pod.slice:cri-containerd:/cpu.cfs_quota_us = limit[cpu] of container
system.slice/containerd.service/kubepods-besteffort-pod.slice:cri-containerd:/memory.limit_in_bytes = limit[memory] of containerBurstable级别:
system.slice/containerd.service/kubepods-burstable-pod.slice:cri-containerd:/cpu.share = request[cpu] of container
system.slice/containerd.service/kubepods-burstable-pod.slice:cri-containerd:/cpu.cfs_period_us = 100000
system.slice/containerd.service/kubepods-burstable-pod.slice:cri-containerd:/cpu.cfs_quota_us = limit[cpu] of container
system.slice/containerd.service/kubepods-burstable-pod.slice:cri-containerd:/memory.limit_in_bytes = limit[memory] of containerBestEffort级别:
system.slice/containerd.service/kubepods-besteffort-pod.slice:cri-containerd:/cpu.share = 2
system.slice/containerd.service/kubepods-besteffort-pod.slice:cri-containerd:/cpu.cfs_period_us = 100000
system.slice/containerd.service/kubepods-besteffort-pod.slice:cri-containerd:/cpu.cfs_quota_us = -1
system.slice/containerd.service/kubepods-besteffort-pod.slice:cri-containerd:/memory.limit_in_bytes = 9223372036854771712

7. 源码分析

这里分析一下各个层级的cgroup何时创建,何时更新
a. node level
调用路径为:containerManagerImpl.start-> setupNode -> createNodeAllocatableCgroups

//代码路径:pkg/kubelet/cm/node_container_manager.go
//createNodeAllocatableCgroups creates Node Allocatable Cgroup when CgroupsPerQOS flag is specified as true
func (cm *containerManagerImpl) createNodeAllocatableCgroups() error {//获取node的capacitynodeAllocatable := cm.internalCapacity// Use Node Allocatable limits instead of capacity if the user requested enforcing node allocatable.nc := cm.NodeConfig.NodeAllocatableConfig//如果配置了--cgroups-per-qos为true,并且--enforce-node-allocatable指定了pods,则需要减去--system-reserved和--kube-reserved预留的资源if cm.CgroupsPerQOS && nc.EnforceNodeAllocatable.Has(kubetypes.NodeAllocatableEnforcementKey) {nodeAllocatable = cm.getNodeAllocatableInternalAbsolute()}cgroupConfig := &CgroupConfig{Name: cm.cgroupRoot,// The default limits for cpu shares can be very low which can lead to CPU starvation for pods.ResourceParameters: getCgroupConfig(nodeAllocatable),}//调用cgroupManager接口判断是否已经存在if cm.cgroupManager.Exists(cgroupConfig.Name) {return nil}//如果不存在则创建node level的cgroup目录,比如/sys/fs/cgroup/kubepods.sliceif err := cm.cgroupManager.Create(cgroupConfig); err != nil {klog.ErrorS(err, "Failed to create cgroup", "cgroupName", cm.cgroupRoot)return err}return nil
}// getNodeAllocatableInternalAbsolute is similar to getNodeAllocatableAbsolute except that
// it also includes internal resources (currently process IDs).  It is intended for setting
// up top level cgroups only.
func (cm *containerManagerImpl) getNodeAllocatableInternalAbsolute() v1.ResourceList {return cm.getNodeAllocatableAbsoluteImpl(cm.internalCapacity)
}func (cm *containerManagerImpl) getNodeAllocatableAbsoluteImpl(capacity v1.ResourceList) v1.ResourceList {result := make(v1.ResourceList)for k, v := range capacity {value := v.DeepCopy()if cm.NodeConfig.SystemReserved != nil {value.Sub(cm.NodeConfig.SystemReserved[k])}if cm.NodeConfig.KubeReserved != nil {value.Sub(cm.NodeConfig.KubeReserved[k])}if value.Sign() < 0 {// Negative Allocatable resources don't make sense.value.Set(0)}result[k] = value}return result
}// getCgroupConfig returns a ResourceConfig object that can be used to create or update cgroups via CgroupManager interface.
func getCgroupConfig(rl v1.ResourceList) *ResourceConfig {// TODO(vishh): Set CPU Quota if necessary.if rl == nil {return nil}var rc ResourceConfigif q, exists := rl[v1.ResourceMemory]; exists {// Memory is defined in bytes.val := q.Value()rc.Memory = &val}if q, exists := rl[v1.ResourceCPU]; exists {// CPU is defined in milli-cores.val := MilliCPUToShares(q.MilliValue())rc.CpuShares = &val}if q, exists := rl[pidlimit.PIDs]; exists {val := q.Value()rc.PidsLimit = &val}rc.HugePageLimit = HugePageLimits(rl)return &rc
}

b. qos level
调用路径为:containerManagerImpl.start-> setupNode -> cm.qosContainerManager.Start

Start创建BestEffort和Burstable级别的cgroup目录,并启动协程,1分钟执行一次UpdateCgroups,更新cgroup的资源值

代码路径:pkg/kubelet/cm/qos_container_manager.go
func (m *qosContainerManagerImpl) Start(getNodeAllocatable func() v1.ResourceList, activePods ActivePodsFunc) error {cm := m.cgroupManagerrootContainer := m.cgroupRootif !cm.Exists(rootContainer) {return fmt.Errorf("root container %v doesn't exist", rootContainer)}// Top level for Qos containers are created only for Burstable// and Best Effort classesqosClasses := map[v1.PodQOSClass]CgroupName{v1.PodQOSBurstable:  NewCgroupName(rootContainer, strings.ToLower(string(v1.PodQOSBurstable))),v1.PodQOSBestEffort: NewCgroupName(rootContainer, strings.ToLower(string(v1.PodQOSBestEffort))),}// Create containers for both qos classesfor qosClass, containerName := range qosClasses {resourceParameters := &ResourceConfig{}//对于BestEffort qos来说,cpu.share永远是MinShares,//Burstable qos初始值为0,后续会通过UpdateCgroups更新// the BestEffort QoS class has a statically configured minShares valueif qosClass == v1.PodQOSBestEffort {minShares := uint64(MinShares)resourceParameters.CpuShares = &minShares}// containerConfig object stores the cgroup specificationscontainerConfig := &CgroupConfig{Name:               containerName,ResourceParameters: resourceParameters,}// for each enumerated huge page size, the qos tiers are unboundedm.setHugePagesUnbounded(containerConfig)//调用cgroupManager接口判断是否已经存在,如果不存在则创建// check if it existsif !cm.Exists(containerName) {if err := cm.Create(containerConfig); err != nil {return fmt.Errorf("failed to create top level %v QOS cgroup : %v", qosClass, err)}} else {// to ensure we actually have the right state, we update the config on startupif err := cm.Update(containerConfig); err != nil {return fmt.Errorf("failed to update top level %v QOS cgroup : %v", qosClass, err)}}}// Store the top level qos container namesm.qosContainersInfo = QOSContainersInfo{Guaranteed: rootContainer,Burstable:  qosClasses[v1.PodQOSBurstable],BestEffort: qosClasses[v1.PodQOSBestEffort],}m.getNodeAllocatable = getNodeAllocatablem.activePods = activePods//启动协程,1分钟执行一次UpdateCgroups,更新cgroup的资源值// update qos cgroup tiers on startup and in periodic intervals// to ensure desired state is in sync with actual state.go wait.Until(func() {err := m.UpdateCgroups()if err != nil {klog.InfoS("Failed to reserve QoS requests", "err", err)}}, periodicQOSCgroupUpdateInterval, wait.NeverStop)return nil
}

有两个地方调用UpdateCgroups,一个是上面启动的协程周期性调用,另一个是在syncPod中创建pod时,最终目的都是为了将pod申请的资源信息
累加到对应的qos cgroup中

func (m *qosContainerManagerImpl) UpdateCgroups() error {m.Lock()defer m.Unlock()qosConfigs := map[v1.PodQOSClass]*CgroupConfig{v1.PodQOSGuaranteed: {Name:               m.qosContainersInfo.Guaranteed,ResourceParameters: &ResourceConfig{},},v1.PodQOSBurstable: {Name:               m.qosContainersInfo.Burstable,ResourceParameters: &ResourceConfig{},},v1.PodQOSBestEffort: {Name:               m.qosContainersInfo.BestEffort,ResourceParameters: &ResourceConfig{},},}//获取所有active pod中Burstable和BestEffort qos级别pod的cpu信息// update the qos level cgroup settings for cpu sharesif err := m.setCPUCgroupConfig(qosConfigs); err != nil {return err}// update the qos level cgroup settings for huge pages (ensure they remain unbounded)if err := m.setHugePagesConfig(qosConfigs); err != nil {return err}//cgroupv2的特性,暂时忽略// update the qos level cgrougs v2 settings of memory qos if feature enabledif utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) &&libcontainercgroups.IsCgroup2UnifiedMode() {m.setMemoryQoS(qosConfigs)}//如果开启了QOSReserved特性,则获取Burstable和BestEffort qos级别pod的内存信息if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.QOSReserved) {for resource, percentReserve := range m.qosReserved {switch resource {case v1.ResourceMemory:m.setMemoryReserve(qosConfigs, percentReserve)}}updateSuccess := truefor _, config := range qosConfigs {err := m.cgroupManager.Update(config)if err != nil {updateSuccess = false}}if updateSuccess {klog.V(4).InfoS("Updated QoS cgroup configuration")return nil}// If the resource can adjust the ResourceConfig to increase likelihood of// success, call the adjustment function here.  Otherwise, the Update() will// be called again with the same values.for resource, percentReserve := range m.qosReserved {switch resource {case v1.ResourceMemory:m.retrySetMemoryReserve(qosConfigs, percentReserve)}}}//最后更新对应的cgroupfor _, config := range qosConfigs {err := m.cgroupManager.Update(config)if err != nil {klog.ErrorS(err, "Failed to update QoS cgroup configuration")return err}}klog.V(4).InfoS("Updated QoS cgroup configuration")return nil
}func (m *qosContainerManagerImpl) setCPUCgroupConfig(configs map[v1.PodQOSClass]*CgroupConfig) error {pods := m.activePods()burstablePodCPURequest := int64(0)for i := range pods {pod := pods[i]//获取pod的qos级别qosClass := v1qos.GetPodQOS(pod)//只关心Burstable级别的podif qosClass != v1.PodQOSBurstable {// we only care about the burstable qos tiercontinue}//累加cpu资源信息req, _ := resource.PodRequestsAndLimits(pod)if request, found := req[v1.ResourceCPU]; found {burstablePodCPURequest += request.MilliValue()}}//BestEffort的cpu.share永远是2// make sure best effort is always 2 sharesbestEffortCPUShares := uint64(MinShares)configs[v1.PodQOSBestEffort].ResourceParameters.CpuShares = &bestEffortCPUShares// set burstable shares based on current observe stateburstableCPUShares := MilliCPUToShares(burstablePodCPURequest)configs[v1.PodQOSBurstable].ResourceParameters.CpuShares = &burstableCPUSharesreturn nil
}// setMemoryReserve sums the memory limits of all pods in a QOS class,
// calculates QOS class memory limits, and set those limits in the
// CgroupConfig for each QOS class.
func (m *qosContainerManagerImpl) setMemoryReserve(configs map[v1.PodQOSClass]*CgroupConfig, percentReserve int64) {qosMemoryRequests := m.getQoSMemoryRequests()//getNodeAllocatable为函数 GetNodeAllocatableAbsoluteresources := m.getNodeAllocatable()allocatableResource, ok := resources[v1.ResourceMemory]if !ok {klog.V(2).InfoS("Allocatable memory value could not be determined, not setting QoS memory limits")return}allocatable := allocatableResource.Value()if allocatable == 0 {klog.V(2).InfoS("Allocatable memory reported as 0, might be in standalone mode, not setting QoS memory limits")return}for qos, limits := range qosMemoryRequests {klog.V(2).InfoS("QoS pod memory limit", "qos", qos, "limits", limits, "percentReserve", percentReserve)}// Calculate QOS memory limitsburstableLimit := allocatable - (qosMemoryRequests[v1.PodQOSGuaranteed] * percentReserve / 100)bestEffortLimit := burstableLimit - (qosMemoryRequests[v1.PodQOSBurstable] * percentReserve / 100)configs[v1.PodQOSBurstable].ResourceParameters.Memory = &burstableLimitconfigs[v1.PodQOSBestEffort].ResourceParameters.Memory = &bestEffortLimit
}// getQoSMemoryRequests sums and returns the memory request of all pods for
// guaranteed and burstable qos classes.
func (m *qosContainerManagerImpl) getQoSMemoryRequests() map[v1.PodQOSClass]int64 {qosMemoryRequests := map[v1.PodQOSClass]int64{v1.PodQOSGuaranteed: 0,v1.PodQOSBurstable:  0,}// Sum the pod limits for pods in each QOS classpods := m.activePods()for _, pod := range pods {podMemoryRequest := int64(0)qosClass := v1qos.GetPodQOS(pod)if qosClass == v1.PodQOSBestEffort {// limits are not set for Best Effort podscontinue}req, _ := resource.PodRequestsAndLimits(pod)if request, found := req[v1.ResourceMemory]; found {podMemoryRequest += request.Value()}qosMemoryRequests[qosClass] += podMemoryRequest}return qosMemoryRequests
}// GetNodeAllocatableAbsolute returns the absolute value of Node Allocatable which is primarily useful for enforcement.
// Note that not all resources that are available on the node are included in the returned list of resources.
// Returns a ResourceList.
func (cm *containerManagerImpl) GetNodeAllocatableAbsolute() v1.ResourceList {return cm.getNodeAllocatableAbsoluteImpl(cm.capacity)
}func (cm *containerManagerImpl) getNodeAllocatableAbsoluteImpl(capacity v1.ResourceList) v1.ResourceList {result := make(v1.ResourceList)for k, v := range capacity {value := v.DeepCopy()if cm.NodeConfig.SystemReserved != nil {value.Sub(cm.NodeConfig.SystemReserved[k])}if cm.NodeConfig.KubeReserved != nil {value.Sub(cm.NodeConfig.KubeReserved[k])}if value.Sign() < 0 {// Negative Allocatable resources don't make sense.value.Set(0)}result[k] = value}return result
}

c. pod level
在创建pod时,先调用UpdateQOSCgroups更新qos level的cgroup信息,再调用EnsureExists创建pod level的cgroup

func (kl *Kubelet) syncPodpcm := kl.containerManager.NewPodContainerManager()if !pcm.Exists(pod) {//更新 burstable cgroupkl.containerManager.UpdateQOSCgroups()//在kubepod下创建pod的cgrouppcm.EnsureExists(pod}代码路径:pkg/kubelet/cm/pod_container_manager.go
// EnsureExists takes a pod as argument and makes sure that
// pod cgroup exists if qos cgroup hierarchy flag is enabled.
// If the pod level container doesn't already exist it is created.
func (m *podContainerManagerImpl) EnsureExists(pod *v1.Pod) error {podContainerName, _ := m.GetPodContainerName(pod)// check if container already existalreadyExists := m.Exists(pod)if !alreadyExists {enforceMemoryQoS := falseif utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) &&libcontainercgroups.IsCgroup2UnifiedMode() {enforceMemoryQoS = true}// Create the pod containercontainerConfig := &CgroupConfig{Name:               podContainerName,ResourceParameters: ResourceConfigForPod(pod, m.enforceCPULimits, m.cpuCFSQuotaPeriod, enforceMemoryQoS),}if m.podPidsLimit > 0 {containerConfig.ResourceParameters.PidsLimit = &m.podPidsLimit}if enforceMemoryQoS {klog.V(4).InfoS("MemoryQoS config for pod", "pod", klog.KObj(pod), "unified", containerConfig.ResourceParameters.Unified)}if err := m.cgroupManager.Create(containerConfig); err != nil {return fmt.Errorf("failed to create container for %v : %v", podContainerName, err)}}return nil
}代码路径:pkg/kubelet/cm/helper_linux.go
// ResourceConfigForPod takes the input pod and outputs the cgroup resource config.
func ResourceConfigForPod(pod *v1.Pod, enforceCPULimits bool, cpuPeriod uint64, enforceMemoryQoS bool) *ResourceConfig {// sum requests and limits.reqs, limits := resource.PodRequestsAndLimits(pod)cpuRequests := int64(0)cpuLimits := int64(0)memoryLimits := int64(0)if request, found := reqs[v1.ResourceCPU]; found {cpuRequests = request.MilliValue()}if limit, found := limits[v1.ResourceCPU]; found {cpuLimits = limit.MilliValue()}if limit, found := limits[v1.ResourceMemory]; found {memoryLimits = limit.Value()}// convert to CFS valuescpuShares := MilliCPUToShares(cpuRequests)cpuQuota := MilliCPUToQuota(cpuLimits, int64(cpuPeriod))// track if limits were applied for each resource.memoryLimitsDeclared := truecpuLimitsDeclared := true// map hugepage pagesize (bytes) to limits (bytes)hugePageLimits := map[int64]int64{}for _, container := range pod.Spec.Containers {if container.Resources.Limits.Cpu().IsZero() {cpuLimitsDeclared = false}if container.Resources.Limits.Memory().IsZero() {memoryLimitsDeclared = false}containerHugePageLimits := HugePageLimits(container.Resources.Requests)for k, v := range containerHugePageLimits {if value, exists := hugePageLimits[k]; exists {hugePageLimits[k] = value + v} else {hugePageLimits[k] = v}}}for _, container := range pod.Spec.InitContainers {if container.Resources.Limits.Cpu().IsZero() {cpuLimitsDeclared = false}if container.Resources.Limits.Memory().IsZero() {memoryLimitsDeclared = false}containerHugePageLimits := HugePageLimits(container.Resources.Requests)for k, v := range containerHugePageLimits {if value, exists := hugePageLimits[k]; !exists || v > value {hugePageLimits[k] = v}}}// quota is not capped when cfs quota is disabledif !enforceCPULimits {cpuQuota = int64(-1)}// determine the qos classqosClass := v1qos.GetPodQOS(pod)// build the resultresult := &ResourceConfig{}if qosClass == v1.PodQOSGuaranteed {result.CpuShares = &cpuSharesresult.CpuQuota = &cpuQuotaresult.CpuPeriod = &cpuPeriodresult.Memory = &memoryLimits} else if qosClass == v1.PodQOSBurstable {result.CpuShares = &cpuSharesif cpuLimitsDeclared {result.CpuQuota = &cpuQuotaresult.CpuPeriod = &cpuPeriod}if memoryLimitsDeclared {result.Memory = &memoryLimits}} else {shares := uint64(MinShares)result.CpuShares = &shares}result.HugePageLimit = hugePageLimitsif enforceMemoryQoS {memoryMin := int64(0)if request, found := reqs[v1.ResourceMemory]; found {memoryMin = request.Value()}if memoryMin > 0 {result.Unified = map[string]string{MemoryMin: strconv.FormatInt(memoryMin, 10),}}}return result
}

d. container level
调用路径startContainer -> generateContainerConfig -> applyPlatformSpecificContainerConfig -> generateLinuxContainerConfig

在函数generateLinuxContainerConfig中,将用户配置的request和limit资源信息转换到container的配置,最终会传给containerd来创建container和container level cgroup

代码路径:pkg/kubelet/kuberuntime/kuberuntime_container_linux.go
// generateLinuxContainerConfig generates linux container config for kubelet runtime v1.
func (m *kubeGenericRuntimeManager) generateLinuxContainerConfig(container *v1.Container, pod *v1.Pod, uid *int64, username string, nsTarget *kubecontainer.ContainerID, enforceMemoryQoS bool) *runtimeapi.LinuxContainerConfig {...// set linux container resourcesvar cpuShares int64cpuRequest := container.Resources.Requests.Cpu()cpuLimit := container.Resources.Limits.Cpu()memoryLimit := container.Resources.Limits.Memory().Value()memoryRequest := container.Resources.Requests.Memory().Value()oomScoreAdj := int64(qos.GetContainerOOMScoreAdjust(pod, container,int64(m.machineInfo.MemoryCapacity)))// If request is not specified, but limit is, we want request to default to limit.// API server does this for new containers, but we repeat this logic in Kubelet// for containers running on existing Kubernetes clusters.if cpuRequest.IsZero() && !cpuLimit.IsZero() {cpuShares = milliCPUToShares(cpuLimit.MilliValue())} else {// if cpuRequest.Amount is nil, then milliCPUToShares will return the minimal number// of CPU shares.cpuShares = milliCPUToShares(cpuRequest.MilliValue())}lc.Resources.CpuShares = cpuSharesif memoryLimit != 0 {lc.Resources.MemoryLimitInBytes = memoryLimit}// Set OOM score of the container based on qos policy. Processes in lower-priority pods should// be killed first if the system runs out of memory.lc.Resources.OomScoreAdj = oomScoreAdjif m.cpuCFSQuota {// if cpuLimit.Amount is nil, then the appropriate default value is returned// to allow full usage of cpu resource.cpuPeriod := int64(quotaPeriod)if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.CPUCFSQuotaPeriod) {cpuPeriod = int64(m.cpuCFSQuotaPeriod.Duration / time.Microsecond)}cpuQuota := milliCPUToQuota(cpuLimit.MilliValue(), cpuPeriod)lc.Resources.CpuQuota = cpuQuotalc.Resources.CpuPeriod = cpuPeriod}...return lc
}

参考
https://github.com/kubernetes/design-proposals-archive/blob/main/node/node-allocatable.md
https://github.com/kubernetes/design-proposals-archive/blob/main/node/resource-qos.md
https://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/7/html/resource_management_guide/sec-memory

相关内容

热门资讯

山西省医保局完善省直管单位职工... 12月24日,记者从省医保局获悉,我省将进一步完善省直管单位职工基本医疗保险门诊慢特病(以下简称“省...
宿迁成功举办“思辩明法 笃行致... 12月21日,宿迁市“思辩明法 笃行致远”律师辩论赛决赛在运河金陵大酒店隆重举行,12支代表队经过紧...
将引入私人投资,加州撤回起诉美... 央视记者当地时间12月27日获悉,美国加利福尼亚州已正式撤回此前针对美联邦政府的诉讼,不再挑战联邦政...
增强市场主体获得感 以制度和规... “放得活”与“管得好”,既是一项重要的经济政策要求,也是一种治理智慧。在经济转型升级的关键阶段,把握...
东营集中巡查!抓获违法犯罪人员... 根据上级统一部署 连日来 东营市公安局东营分局 结合岁末年初社会治安特点 对全区治安复杂区域、人员密...
每周股票复盘:金煤科技(600... 截至2025年12月26日收盘,金煤科技(600844)报收于2.94元,较上周的3.0元下跌2.0...
每周股票复盘:武进不锈(603... 截至2025年12月26日收盘,武进不锈(603878)报收于9.94元,较上周的10.14元下跌1...
每周股票复盘:华依科技(688... 截至2025年12月26日收盘,华依科技(688071)报收于33.81元,较上周的32.34元上涨...
每周股票复盘:文灿股份(603... 截至2025年12月26日收盘,文灿股份(603348)报收于19.23元,较上周的19.5元下跌1...
每周股票复盘:中邮科技(688... 截至2025年12月26日收盘,中邮科技(688648)报收于58.51元,较上周的58.49元上涨...