Linux CPU之mpstat

创始人

2024-04-14 03:41:23

0次

文章目录

前言
一、mpstat 简介
二、mpstat -P
三、mpstat -I
- 3.1 mpstat -I SUM
- 3.2 mpstat -I CPU
- - 3.2.1 数据来源
  - 3.2.2 内核源码解析
- 3.3 mpstat -I SCPU
- - 3.3.1 数据来源
  - 3.3.2 内核源码解析
参考资料

前言

NAMEmpstat - Report processors related statistics.

vmstat用来观测系统整体的性能情况，pidstat来观测单个进程的性能情况，那么mpstat是用来观测单个CPU的性能情况。

一、mpstat 简介

mpstat命令输出每个可用处理器的标准输出活动，输出显示中cpu0是第一个处理器。还报告了所有处理器的全局的平均活动。mpstat命令可以在SMP和UP机器上使用，但在后者（UP机器上）中，只打印全局平均活动。如果未指定参数，则默认报告为CPU整体利用率报告。

备注：
UP（Uni-Processor）：系统只有一个处理器单元，即单核CPU系统。
SMP（Symmetric Multi-Processors）：系统有多个处理器单元。各个处理器之间共享总线，内存等等

[root@localhost ~]# mpstat
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)02:48:09 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
02:48:09 PM  all    0.31    0.00    0.28    0.01    0.00    0.00    0.00    0.00    0.00   99.40

 mpstat ...... [ interval [ count ] ]

interval参数指定每个报告之间的时间量（以秒为单位）。值为0（或根本没有参数）表示自系统启动（引导）以来将报告处理器统计信息。如果没有将interval参数设置为零，则可以将count参数与interval参数一起指定。count的值决定了间隔几秒生成的报告的数量。如果指定interval参数而不指定count参数，则mpstat命令将连续生成报表。

[root@localhost ~]# mpstat 2 5
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)02:55:10 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
02:55:12 PM  all    0.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00    0.00   99.00
02:55:14 PM  all    0.25    0.00    0.63    0.00    0.00    0.00    0.00    0.00    0.00   99.12
02:55:16 PM  all    0.38    0.00    0.50    0.00    0.00    0.00    0.00    0.00    0.00   99.12
02:55:18 PM  all    0.50    0.00    0.62    0.00    0.00    0.00    0.00    0.00    0.00   98.88
02:55:20 PM  all    0.38    0.00    0.50    0.00    0.00    0.00    0.00    0.00    0.00   99.12
Average:     all    0.40    0.00    0.55    0.00    0.00    0.00    0.00    0.00    0.00   99.05

以两秒间隔显示所有处理器之间的五个全局统计数据报告。

就是读取 /proc/stat 文件中的数据：

open("/proc/stat", O_RDONLY)            = 3

二、mpstat -P

 -P { cpu [,...] | ON | ALL }

指示要报告统计信息的处理器编号，cpu是处理器编号，处理器0是第一个处理器。ON关键字表示要为每个在线处理器报告统计信息，而ALL关键字表示要报告所有处理器的统计信息。

输出处理器1（第二个处理器）的报告统计信息：

[root@localhost ~]# mpstat -P 1
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)03:08:34 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:08:34 PM    1    0.28    0.00    0.29    0.00    0.00    0.00    0.00    0.00    0.00   99.43

输出每个在线处理器报告统计信息：

[root@localhost ~]# mpstat -P ON
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)03:09:59 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:09:59 PM  all    0.31    0.00    0.28    0.01    0.00    0.00    0.00    0.00    0.00   99.40
03:09:59 PM    0    0.31    0.00    0.27    0.00    0.00    0.00    0.00    0.00    0.00   99.41
03:09:59 PM    1    0.28    0.00    0.29    0.00    0.00    0.00    0.00    0.00    0.00   99.43
03:09:59 PM    2    0.33    0.00    0.29    0.00    0.00    0.00    0.00    0.00    0.00   99.38
03:09:59 PM    3    0.34    0.00    0.28    0.01    0.00    0.00    0.00    0.00    0.00   99.37

输出所有处理器的统计信息：

[root@localhost ~]# mpstat -P ALL
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)03:11:24 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:11:24 PM  all    0.31    0.00    0.28    0.01    0.00    0.00    0.00    0.00    0.00   99.40
03:11:24 PM    0    0.31    0.00    0.27    0.00    0.00    0.00    0.00    0.00    0.00   99.41
03:11:24 PM    1    0.28    0.00    0.29    0.00    0.00    0.00    0.00    0.00    0.00   99.43
03:11:24 PM    2    0.33    0.00    0.29    0.00    0.00    0.00    0.00    0.00    0.00   99.38
03:11:24 PM    3    0.34    0.00    0.28    0.01    0.00    0.00    0.00    0.00    0.00   99.37

每个字段的含义：

CPUProcessor number. The keyword all indicates that statistics are calculated as averages among all processors.%usrShow the percentage of CPU utilization that occurred while executing at the user level (application).%niceShow the percentage of CPU utilization that occurred while executing at the user level with nice priority.%sysShow the percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this does not include time spent servicing hardware and  soft‐ware interrupts.%iowaitShow the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.%irqShow the percentage of time spent by the CPU or CPUs to service hardware interrupts.%softShow the percentage of time spent by the CPU or CPUs to service software interrupts.%stealShow the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.%guestShow the percentage of time spent by the CPU or CPUs to run a virtual processor.%gniceShow the percentage of time spent by the CPU or CPUs to run a niced guest.%idleShow the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.

具体请参考：Linux top命令的cpu使用率和内存使用率这篇文章中关于各个字段的释义。

三、mpstat -I

-I { SUM | CPU | SCPU | ALL }

报告中断统计信息

3.1 mpstat -I SUM

使用SUM关键字，mpstat命令报告每个处理器的中断总数。将显示以下值：
CPU：处理器编号，关键字all表示统计数据是以所有处理器的平均值计算的。
intr/s：显示CPU每秒接收的中断总数。

[root@localhost ~]# mpstat -I SUM
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)03:21:23 PM  CPU    intr/s
03:21:23 PM  all     93.37

3.2 mpstat -I CPU

3.2.1 数据来源

使用CPU关键字，将显示 CPU or CPUs 每秒接收到的每个中断（硬中断）的数量。

[root@localhost ~]# mpstat -I CPU
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)03:23:41 PM  CPU        0/s        1/s        8/s        9/s       12/s       16/s       20/s      120/s      121/s      122/s      123/s      124/s      125/s      126/s      127/s      NMI/s      LOC/s      SPU/s      PMI/s      IWI/s      RTR/s      RES/s      CAL/s      TLB/s      TRM/s      THR/s      DFR/s      MCE/s      MCP/s      ERR/s      MIS/s      PIN/s      NPI/s      PIW/s
03:23:41 PM    0       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.02       4.03       0.00       0.00       0.00      20.01       0.00       0.00       0.19       0.00       0.21       0.00       0.11       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
03:23:41 PM    1       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00      21.62       0.00       0.00       0.21       0.00       0.15       0.00       0.14       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
03:23:41 PM    2       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00      22.75       0.00       0.00       0.17       0.00       0.15       0.00       0.07       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
03:23:41 PM    3       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.20       0.00       0.00       0.00       0.00      22.86       0.00       0.00       0.25       0.00       0.15       0.00       0.08       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00

数据来源就是读取 /proc/interrupts 文件，/proc/interrupts 提供了硬中断的运行情况：
备注：中断本质上是一种特殊的电信号，由硬件设备发向处理器。处理器接受到中断后，会马上向操作系统反映中断信号的到来，然后由操作系统负责处理这些新到来的数据。硬件设备生成中断的时候不考虑与处理器的时钟同步，即中断随时可以产生。
中断其实是一种异步的事件处理机制，可以提高系统的并发处理能力。
由于中断处理程序会打断其他进程的运行，所以，为了减少对正常进程运行调度的影响，中断处理程序就需要尽可能快地运行

open("/proc/interrupts", O_RDONLY)      = 3

[root@localhost ~]# cat /proc/interruptsCPU0       CPU1       CPU2       CPU30:         55          0          0          0  IR-IO-APIC-edge      timer1:          4          0          0          0  IR-IO-APIC-edge      i80428:          1          0          0          0  IR-IO-APIC-edge      rtc09:          4          0          0          0  IR-IO-APIC-fasteoi   acpi12:          3          3          0          0  IR-IO-APIC-edge      i804216:          0          0          0          0  IR-IO-APIC-fasteoi   i801_smbus20:          0          0          0          0  IR-IO-APIC-fasteoi   idma64.0120:          0          0          0          0  DMAR_MSI-edge      dmar0121:          0          0          0          0  DMAR_MSI-edge      dmar1122:          0          0          0          0  IR-PCI-MSI-edge      aerdrv, PCIe PME123:        148         16         10          2  IR-PCI-MSI-edge      xhci_hcd124:       4738        519        421      54943  IR-PCI-MSI-edge      0000:00:17.0125:    1111752          0          0          0  IR-PCI-MSI-edge      enp1s0126:         38          1        109          9  IR-PCI-MSI-edge      i915127:        541        136        191         85  IR-PCI-MSI-edge      snd_hda_intel:card0NMI:         56         52         55         56   Non-maskable interruptsLOC:    5504316    5950291    6263079    6292876   Local timer interruptsSPU:          0          0          0          0   Spurious interruptsPMI:         56         52         55         56   Performance monitoring interruptsIWI:      52427      58892      47990      68017   IRQ work interruptsRTR:          0          0          0          0   APIC ICR read retriesRES:      56937      40801      42634      41527   Rescheduling interruptsCAL:       1150       1147       1194       1149   Function call interruptsTLB:      28982      39043      19609      20986   TLB shootdownsTRM:          0          0          0          0   Thermal event interruptsTHR:          0          0          0          0   Threshold APIC interruptsDFR:          0          0          0          0   Deferred Error APIC interruptsMCE:          0          0          0          0   Machine check exceptionsMCP:        918        918        918        918   Machine check pollsERR:          0MIS:          0PIN:          0          0          0          0   Posted-interrupt notification eventNPI:          0          0          0          0   Nested posted-interrupt eventPIW:          0          0          0          0   Posted-interrupt wakeup event

其中的一些字段：
NMI（Non-maskable interrupts）：在这种情况下，NMI会递增，因为每个定时器中断都会生成一个NMI（非屏蔽中断），NMI看门狗使用它来检测锁定。
LOC：LOC是每个CPU的内部APIC的 the local interrupt counter。
SPU：a spurious interrupt 是在APIC完全处理之前由某个IO设备引发然后降低的某个中断。因此，APIC看到这种中断，但不知道它来自哪个设备。在这种情况下，APIC将生成IRQ向量为0xff的中断。这也可能是芯片组错误造成的。
RES（Rescheduling interrupts）、CAL（Function call interrupts）、TLB（TLB shootdowns）：根据OS的需要从一个CPU向另一个CPU发送重新调度、调用和TLB刷新中断。通常，内核开发人员和感兴趣的用户使用它们的统计信息来确定给定类型中断的发生。
TRM（ Thermal event interrupts）：当超过CPU的温度阈值时，发生热事件中断。当温度降至正常值时，也可能会产生该中断。
THR（Threshold APIC interrupts）：当机器检查阈值计数器（通常计数内存或缓存的ECC纠正错误）超过可配置阈值时引发的中断。仅在某些系统上可用。

3.2.2 内核源码解析

// linux-3.10/fs/proc/interrupts.c/** /proc/interrupts*/
static void *int_seq_start(struct seq_file *f, loff_t *pos)
{return (*pos <= nr_irqs) ? pos : NULL;
}static void *int_seq_next(struct seq_file *f, void *v, loff_t *pos)
{(*pos)++;if (*pos > nr_irqs)return NULL;return pos;
}static void int_seq_stop(struct seq_file *f, void *v)
{/* Nothing to do */
}static const struct seq_operations int_seq_ops = {.start = int_seq_start,.next  = int_seq_next,.stop  = int_seq_stop,.show  = show_interrupts
};static int interrupts_open(struct inode *inode, struct file *filp)
{return seq_open(filp, &int_seq_ops);
}static const struct file_operations proc_interrupts_operations = {.open		= interrupts_open,.read		= seq_read,.llseek		= seq_lseek,.release	= seq_release,
};static int __init proc_interrupts_init(void)
{proc_create("interrupts", 0, NULL, &proc_interrupts_operations);return 0;
}
module_init(proc_interrupts_init);

其中show_interrupts函数：

// linux-3.10/kernel/irq/proc.cint show_interrupts(struct seq_file *p, void *v)
{......arch_show_interrupts(p, prec);......
}

arch_show_interrupts是一个与架构有关的函数，对于x86架构：

// linux-3.10/arch/x86/kernel/irq.c#define irq_stats(x)		(&per_cpu(irq_stat, x))
/** /proc/interrupts printing for arch specific interrupts*/
int arch_show_interrupts(struct seq_file *p, int prec)
{int j;seq_printf(p, "%*s: ", prec, "NMI");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->__nmi_count);seq_printf(p, "  Non-maskable interrupts\n");
#ifdef CONFIG_X86_LOCAL_APICseq_printf(p, "%*s: ", prec, "LOC");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs);seq_printf(p, "  Local timer interrupts\n");seq_printf(p, "%*s: ", prec, "SPU");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->irq_spurious_count);seq_printf(p, "  Spurious interrupts\n");seq_printf(p, "%*s: ", prec, "PMI");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs);seq_printf(p, "  Performance monitoring interrupts\n");seq_printf(p, "%*s: ", prec, "IWI");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->apic_irq_work_irqs);seq_printf(p, "  IRQ work interrupts\n");seq_printf(p, "%*s: ", prec, "RTR");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->icr_read_retry_count);seq_printf(p, "  APIC ICR read retries\n");
#endifif (x86_platform_ipi_callback) {seq_printf(p, "%*s: ", prec, "PLT");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->x86_platform_ipis);seq_printf(p, "  Platform interrupts\n");}
#ifdef CONFIG_SMPseq_printf(p, "%*s: ", prec, "RES");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->irq_resched_count);seq_printf(p, "  Rescheduling interrupts\n");seq_printf(p, "%*s: ", prec, "CAL");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->irq_call_count -irq_stats(j)->irq_tlb_count);seq_printf(p, "  Function call interrupts\n");seq_printf(p, "%*s: ", prec, "TLB");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->irq_tlb_count);seq_printf(p, "  TLB shootdowns\n");
#endif
#ifdef CONFIG_X86_THERMAL_VECTORseq_printf(p, "%*s: ", prec, "TRM");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->irq_thermal_count);seq_printf(p, "  Thermal event interrupts\n");
#endif
#ifdef CONFIG_X86_MCE_THRESHOLDseq_printf(p, "%*s: ", prec, "THR");for_each_online_cpu(j)seq_printf(p, "%10u ", irq_stats(j)->irq_threshold_count);seq_printf(p, "  Threshold APIC interrupts\n");
#endif
#ifdef CONFIG_X86_MCEseq_printf(p, "%*s: ", prec, "MCE");for_each_online_cpu(j)seq_printf(p, "%10u ", per_cpu(mce_exception_count, j));seq_printf(p, "  Machine check exceptions\n");seq_printf(p, "%*s: ", prec, "MCP");for_each_online_cpu(j)seq_printf(p, "%10u ", per_cpu(mce_poll_count, j));seq_printf(p, "  Machine check polls\n");
#endifseq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(&irq_err_count));
#if defined(CONFIG_X86_IO_APIC)seq_printf(p, "%*s: %10u\n", prec, "MIS", atomic_read(&irq_mis_count));
#endifreturn 0;
}

可以看到主要是从 per-cpu内存区读取相应的数据，关于x86_64 per-cpu相关知识请参考：Linux per-cpu

// linux-3.10/arch/x86/include/asm/hardirq.htypedef struct {unsigned int __softirq_pending;unsigned int __nmi_count;	/* arch dependent */
#ifdef CONFIG_X86_LOCAL_APICunsigned int apic_timer_irqs;	/* arch dependent */unsigned int irq_spurious_count;unsigned int icr_read_retry_count;
#endif
#ifdef CONFIG_HAVE_KVMunsigned int kvm_posted_intr_ipis;
#endifunsigned int x86_platform_ipis;	/* arch dependent */unsigned int apic_perf_irqs;unsigned int apic_irq_work_irqs;
#ifdef CONFIG_SMPunsigned int irq_resched_count;unsigned int irq_call_count;/** irq_tlb_count is double-counted in irq_call_count, so it must be* subtracted from irq_call_count when displaying irq_call_count*/unsigned int irq_tlb_count;
#endif
#ifdef CONFIG_X86_THERMAL_VECTORunsigned int irq_thermal_count;
#endif
#ifdef CONFIG_X86_MCE_THRESHOLDunsigned int irq_threshold_count;
#endif
} ____cacheline_aligned irq_cpustat_t;DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);

// linux-3.10/include/linux/irq_cpustat.h/** Simple wrappers reducing source bloat.  Define all irq_stat fields* here, even ones that are arch dependent.  That way we get common* definitions instead of differing sets for each arch.*/#ifndef __ARCH_IRQ_STAT
extern irq_cpustat_t irq_stat[];		/* defined in asm/hardirq.h */
#define __IRQ_STAT(cpu, member)	(irq_stat[cpu].member)
#endif

3.3 mpstat -I SCPU

3.3.1 数据来源

使用SCPU关键字，将显示 CPU or CPUs 每秒接收到的每个软件中断的数量。

[root@localhost ~]# mpstat -I SCPU
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)04:48:54 PM  CPU       HI/s    TIMER/s   NET_TX/s   NET_RX/s    BLOCK/s BLOCK_IOPOLL/s  TASKLET/s    SCHED/s  HRTIMER/s      RCU/s
04:48:54 PM    0       0.00      10.62       0.17       4.26       0.02       0.00       0.02       6.63       0.00       3.92
04:48:54 PM    1       0.00      12.78       0.00       0.03       0.00       0.00       0.00       7.19       0.00       4.89
04:48:54 PM    2       0.00      12.41       0.00       0.03       0.00       0.00       0.00       7.28       0.00       4.48
04:48:54 PM    3       0.00      12.97       0.00       0.02       0.19       0.00       0.00       7.13       0.00       4.90

数据来源读取 /proc/softirqs 文件，/proc/softirqs 提供了软中断的运行情况：

open("/proc/softirqs", O_RDONLY)        = 3

[root@localhost ~]# cat /proc/softirqsCPU0       CPU1       CPU2       CPU3HI:         29         12         81          4TIMER:    2978642    3586869    3484711    3642228NET_TX:      46707          2          2          1NET_RX:    1195259       8070       7563       6755BLOCK:       5432        776        578      53783
BLOCK_IOPOLL:          0          0          0          0TASKLET:       5769          0          0          0SCHED:    1860352    2017852    2042455    1999179HRTIMER:          0          0          0          0RCU:    1100586    1372825    1258876    1377782

软中断包括了 10 个类别，分别对应不同的工作类型。比如 NET_RX 表示网络接收中断，而 NET_TX 表示网络发送中断。

参数详解：

tasklet	优先级	描述
HI	0	优先级高的tasklets
TIMER	1	定时器的下半部
NET_TX	2	网络发送中断
NET_RX	3	网络接收中断
BLOCK	4	块设备
BLOCK_IOPOLL	5	块设备
TASKLET	6	正常优先级的tasklets
SCHED	7	调度程度
HRTIMER	8	高分辨率定时器
RCU	9	RCU锁

优先级小的软中断（比如：0）在优先级大的软中断（比如：9）之前执行。

软中断实际上是以内核线程的方式运行的，每个 CPU 都对应一个软中断内核线程，这个软中断内核线程就叫做 ksoftirqd/CPU 编号。

[root@localhost ~]# top -n 1 | grep ksoftirqd3 root      20   0       0      0      0 S   0.0  0.0   0:00.89 ksoftirqd/014 root      20   0       0      0      0 S   0.0  0.0   0:00.10 ksoftirqd/119 root      20   0       0      0      0 S   0.0  0.0   0:00.13 ksoftirqd/224 root      20   0       0      0      0 S   0.0  0.0   0:00.18 ksoftirqd/3

或者：

[root@localhost ~]# ps aux | grep ksoftirq
root         3  0.0  0.0      0     0 ?        S    Nov25   0:00 [ksoftirqd/0]
root        14  0.0  0.0      0     0 ?        S    Nov25   0:00 [ksoftirqd/1]
root        19  0.0  0.0      0     0 ?        S    Nov25   0:00 [ksoftirqd/2]
root        24  0.0  0.0      0     0 ?        S    Nov25   0:00 [ksoftirqd/3]

这些线程的名字外面都有中括号，这说明 ps 无法获取它们的命令行参数（cmline）。一般来说，ps 的输出中，名字括在中括号里的，一般都是内核线程。

注意：软中断的优先级是高于普通进程的，当一个软中断执行的时候，可以重新触发自己以便在得到执行（比如网络子系统），如果软中断出现的频率比较高，再加上软中断又有将自己重新设置为可执行状态的能力，那么就会导致用户态普通进程无法获得足够多的运行时间。
因此内核线程 ksoftirq 就是在内核中出现大量的软中断时，内核线程就会辅助软中断，处理软中断的数据，内核线程的优先级比较低，由上面可以看到 nice 值为0，和普通进程一样，这样就会避免普通进程无法获得足够多的运行时间。

内核中大部分内核线程的优先级都低于普通任务，比如 kwork ：

[root@localhost ~]# top -n 1 | grep kwork5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H16 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H21 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/2:0H26 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/3:0H502 root      20   0       0      0      0 S   0.0  0.0   0:00.88 kworker/0:1

kwork 的 nice 值为 -20

硬中断优先级 > 软中断 > 普通进程 ≥ 内核线程

内核线程性能问题：
在 Linux 中，每个 CPU 都对应一个软中断内核线程，名字是 ksoftirqd/CPU 编号。当软中断事件的频率过高时，内核线程也会因为 CPU 使用率过高而导致软中断处理不及时，进而引发网络收发延迟、调度缓慢等性能问题。
软中断 CPU 使用率（softirq）升高是一种很常见的性能问题。虽然软中断的类型很多，但实际生产中，我们遇到的性能瓶颈大多是网络收发类型的软中断，特别是网络接收的软中断。

3.3.2 内核源码解析

（1）softirqs

// linux-3.10/fs/proc/softirqs.c/** /proc/softirqs  ... display the number of softirqs*/
static int show_softirqs(struct seq_file *p, void *v)
{int i, j;seq_puts(p, "                    ");for_each_possible_cpu(i)seq_printf(p, "CPU%-8d", i);seq_putc(p, '\n');for (i = 0; i < NR_SOFTIRQS; i++) {seq_printf(p, "%12s:", softirq_to_name[i]);for_each_possible_cpu(j)seq_printf(p, " %10u", kstat_softirqs_cpu(i, j));seq_putc(p, '\n');}return 0;
}static int softirqs_open(struct inode *inode, struct file *file)
{return single_open(file, show_softirqs, NULL);
}static const struct file_operations proc_softirqs_operations = {.open		= softirqs_open,.read		= seq_read,.llseek		= seq_lseek,.release	= single_release,
};static int __init proc_softirqs_init(void)
{proc_create("softirqs", 0, NULL, &proc_softirqs_operations);return 0;
}
module_init(proc_softirqs_init);

// linux-3.10/include/linux/interrupt.h/* PLEASE, avoid to allocate new softirqs, if you need not _really_ highfrequency threaded job scheduling. For almost all the purposestasklets are more than enough. F.e. all serial device BHs etal. should be converted to tasklets, not to softirqs.*/enum
{HI_SOFTIRQ=0,TIMER_SOFTIRQ,NET_TX_SOFTIRQ,NET_RX_SOFTIRQ,BLOCK_SOFTIRQ,BLOCK_IOPOLL_SOFTIRQ,TASKLET_SOFTIRQ,SCHED_SOFTIRQ,HRTIMER_SOFTIRQ,RCU_SOFTIRQ,    /* Preferable RCU should always be the last softirq */NR_SOFTIRQS
};/* map softirq index to softirq name. update 'softirq_to_name' in* kernel/softirq.c when adding a new softirq.*/
extern char *softirq_to_name[NR_SOFTIRQS];

// linux-3.10/kernel/softirq.cstatic struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;char *softirq_to_name[NR_SOFTIRQS] = {"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL","TASKLET", "SCHED", "HRTIMER", "RCU"
};

定义per-cpu变量：struct kernel_stat kstat，并且将 kstat 符号导出

// linux-3.10/kernel/sched/core.cDEFINE_PER_CPU(struct kernel_stat, kstat);
EXPORT_PER_CPU_SYMBOL(kstat);

[root@localhost ~]# cat /proc/kallsyms | grep '\'
0000000000015b60 A kstat

[root@localhost ~]# cat /proc/kallsyms | grep '\<__per_cpu_start\>'
0000000000000000 A __per_cpu_start
[root@localhost ~]# cat /proc/kallsyms | grep '\<__per_cpu_end\>'
000000000001d000 A __per_cpu_end

kstat 在 _per_cpu_start 和 __per_cpu_end 范围内，是内核中的per-cpu变量。

读取 softirqs 数据：

// linux-3.10/include/linux/kernel_stat.hstruct kernel_stat {
#ifndef CONFIG_GENERIC_HARDIRQSunsigned int irqs[NR_IRQS];
#endifunsigned long irqs_sum;unsigned int softirqs[NR_SOFTIRQS];
};DECLARE_PER_CPU(struct kernel_stat, kstat);#define kstat_cpu(cpu) per_cpu(kstat, cpu)static inline unsigned int kstat_softirqs_cpu(unsigned int irq, int cpu)
{return kstat_cpu(cpu).softirqs[irq];
}

（2）ksoftirqd

定义ksoftirqd，用一个struct task_struct表示，存放在per-cpu内存中：

// linux-3.10/kernel/softirq.cDEFINE_PER_CPU(struct task_struct *, ksoftirqd);/** we cannot loop indefinitely here to avoid userspace starvation,* but we also don't want to introduce a worst case 1/HZ latency* to the pending events, so lets the scheduler to balance* the softirq load for us.*/
/*
不能在这里无限循环以避免用户空间不足，但我们也不想为 the pending events 引入最坏的1/HZ延迟
因此让调度器为我们平衡 the softirq 负载。
*/
static void wakeup_softirqd(void)
{/* Interrupts are disabled: no need to stop preemption */struct task_struct *tsk = __this_cpu_read(ksoftirqd);if (tsk && tsk->state != TASK_RUNNING)wake_up_process(tsk);
}

声明 ksoftirqd：

// linux-3.10/include/linux/interrupt.hDECLARE_PER_CPU(struct task_struct *, ksoftirqd);static inline struct task_struct *this_cpu_ksoftirqd(void)
{return this_cpu_read(ksoftirqd);
}

参考资料

Linux内核 3.10.0

Linux内核设计与实现
极客时间：Linux性能优化

词库加载错误:未能找到文件“E:\highferrum_mysql\Configuration\Dict_Stopwords.txt”。

上一篇：设备树覆盖：实现 DTO

下一篇：第六章课后题（LSTM | GRU）