linux 3.10 的又一次hung

阅读原文时间：2021年04月21日阅读：1

最近又遇到一次hung，dmesg中堆栈如下：

[176223.913270] ------------[ cut here ]------------ [176223.913280] WARNING: CPU: 10 PID: 0 at net/sched/sch_generic.c:300 dev_watchdog+0x248/0x260-----注意cpu号是10 [176223.913282] NETDEV WATCHDOG: eth4 (ixgbe): transmit queue 5 timed out----------------warn的内容 [176223.913283] Modules linked in: witdriver(OE) mysendmsg(OE) fuse tipc(OE) ossmod(OE) iptable_filter mptctl mptbase bonding dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support skx_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev ses enclosure sg mei_me mei shpchp lpc_ich i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_cpufreq acpi_pad tcp_bbr sch_fq binfmt_misc ip_tables ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ixgbe(OE) i40e(OE) mpt3sas(OE)
[176223.913340] ptp raid_class pps_core scsi_transport_sas i2c_core dca [last unloaded: witdriver]
[176223.913347] CPU: 10 PID: 0 Comm: swapper/10 Tainted: G OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 [176223.913348] Hardware name: ZTE ZXCDN/SC621DI-16F, BIOS 1.0b 09/21/2017 [176223.913350] Call Trace:
[176223.913351] [] dump_stack+0x19/0x1b [176223.913361] [] __warn+0xd8/0x100----------------------warn了，内容见第二行 [176223.913364] [] warn_slowpath_fmt+0x5f/0x80 [176223.913369] [] dev_watchdog+0x248/0x260----------------在处理到dev_watchdog时候出现问题 [176223.913372] [] ? dev_deactivate_queue.constprop.33+0x60/0x60 [176223.913375] [] call_timer_fn+0x38/0x110----------------处理回调 [176223.913378] [] ? dev_deactivate_queue.constprop.33+0x60/0x60 [176223.913382] [] run_timer_softirq+0x22d/0x310-----------这个软中断是timer [176223.913386] [] __do_softirq+0xfd/0x290 [176223.913390] [] call_softirq+0x1c/0x30 [176223.913394] [] do_softirq+0x65/0xa0---------------------硬中断处理后，顺便处理软中断 [176223.913397] [] irq_exit+0x175/0x180---------------------硬中断结束 [176223.913400] [] smp_apic_timer_interrupt+0x48/0x60 [176223.913402] [] apic_timer_interrupt+0x162/0x170---------硬中断开始 [176223.913403] [] ? cpuidle_enter_state+0x57/0xd0 [176223.913410] [] cpuidle_idle_call+0xde/0x230 [176223.913414] [] arch_cpu_idle+0xe/0x40 [176223.913420] [] cpu_startup_entry+0x14a/0x1c0 [176223.913423] [] start_secondary+0x1f2/0x270 [176223.913425] ---[ end trace 30e7271cf4a53655 ]--- [176223.913430] ixgbe 0000:86:00.0 eth4: Fake Tx hang detected with timeout of 5 seconds
[176224.918915] ixgbe 0000:86:00.1 eth5: Fake Tx hang detected with timeout of 5 seconds
[176224.918917] ixgbe 0000:af:00.1 eth7: Fake Tx hang detected with timeout of 5 seconds
[176224.918948] ixgbe 0000:af:00.0 eth6: Fake Tx hang detected with timeout of 5 seconds
[176224.934872] ixgbe 0000:18:00.1 eth1: Fake Tx hang detected with timeout of 5 seconds
[176226.403293] hrtimer_gaq_time = 912735, soft_irq_time = 42216, cpu = 7 [176233.921764] ixgbe 0000:86:00.0 eth4: Fake Tx hang detected with timeout of 10 seconds
[176234.911438] ixgbe 0000:af:00.1 eth7: Fake Tx hang detected with timeout of 10 seconds
[176234.911441] ixgbe 0000:86:00.1 eth5: Fake Tx hang detected with timeout of 10 seconds
[176234.911474] ixgbe 0000:af:00.0 eth6: Fake Tx hang detected with timeout of 10 seconds
[176234.943364] ixgbe 0000:18:00.1 eth1: Fake Tx hang detected with timeout of 10 seconds
[176237.664127] hrtimer_gaq_time = 854268, soft_irq_time = 72171, cpu = 43

如果只盯着这个堆栈看，可以看出，网卡的 dev_watchdog 函数检测到了eth4的queue 5 出现了 trans_timeout。

超时的检测周期，不同的设备是不一样的，intel的ixgbe对应的超时时间是 ixgbe_main.c 中设置的

netdev->watchdog_timeo = 5 * HZ;

检测的原理就是，每当发包成功的时候，设置 txq->trans_start ，如果这个值在timer到期检测的时候，和当前时间相差一个超时周期(默认是5s，各种驱动可能不同)，则认为出现发送

超时。依赖于定时器软中断，检测的对象是处于running状态的网卡已经处于stop状态的发送队列，话说网卡如果没有running，检测也没有意义。

static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, bool more)
{ const struct net_device_ops *ops = dev->netdev_ops; int rc; /*__netdev_start_xmit 里面就完全是使用driver 的ops去发包了，其实到此为止，一个skb已经从netdevice层送到driver层了，接下来会等待driver的返回*/ rc = __netdev_start_xmit(ops, skb, dev, more); if (rc == NETDEV_TX_OK)
txq_trans_update(txq); return rc;
}

static inline void txq_trans_update(struct netdev_queue *txq)
{ if (txq->xmit_lock_owner != -1)
txq->trans_start = jiffies;---------------------更新当前发送队列txq的最后发送时间
}

ixgbe驱动会检测是否pending，也就是有的tx中的ring没变动了，那么打印warning。

接下来，又有一个堆栈：

[176242.505583] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kthread_send/9:178838]-----------注意这个时候是cpu9
[176242.506532] Modules linked in: witdriver(OE) mysendmsg(OE) fuse tipc(OE) ossmod(OE) iptable_filter mptctl mptbase bonding dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support skx_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev ses enclosure sg mei_me mei shpchp lpc_ich i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_cpufreq acpi_pad tcp_bbr sch_fq binfmt_misc ip_tables ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ixgbe(OE) i40e(OE) mpt3sas(OE)
[176242.506588] ptp raid_class pps_core scsi_transport_sas i2c_core dca [last unloaded: witdriver]
[176242.506597] CPU: 9 PID: 178838 Comm: kthread_send/9 Tainted: G W OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 [176242.506599] Hardware name: ZTE ZXCDN/SC621DI-16F, BIOS 1.0b 09/21/2017 [176242.506601] task: ffff88290f7ddee0 ti: ffff882b680d0000 task.ti: ffff882b680d0000
[176242.506603] RIP: 0010:[] [] native_queued_spin_lock_slowpath+0x1ce/0x200 [176242.506610] RSP: 0018:ffff882fbe843dc0 EFLAGS: 00000202 [176242.506611] RAX: 0000000000000001 RBX: ffff882fbe843d50 RCX: 0000000000000001 [176242.506613] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff882fb80a0bc0-------大家都在抢的锁
[176242.506614] RBP: ffff882fbe843dc0 R08: 0000000000000101 R09: ffff8801363f4000
[176242.506616] R10: 0000000000000000 R11: ffff882fbe843da8 R12: ffff882fbe843d38
[176242.506617] R13: ffffffff816c6732 R14: ffff882fbe843dc0 R15: ffff882fb80a0b40
[176242.506619] FS: 0000000000000000(0000) GS:ffff882fbe840000(0000) knlGS:0000000000000000 [176242.506621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [176242.506622] CR2: 00007fd86fd46316 CR3: 0000000001a0a000 CR4: 00000000003607e0
[176242.506624] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [176242.506626] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [176242.506627] Call Trace:
[176242.506629] [176242.506630]
[176242.506635] [] queued_spin_lock_slowpath+0xb/0xf [176242.506640] [] _raw_spin_lock+0x20/0x30 [176242.506645] [] dev_watchdog+0x72/0x260-----------------------同上分析，不过这次是拿不到锁 [176242.506649] [] ? dev_deactivate_queue.constprop.33+0x60/0x60 [176242.506653] [] call_timer_fn+0x38/0x110 [176242.506655] [] ? dev_deactivate_queue.constprop.33+0x60/0x60 [176242.506658] [] run_timer_softirq+0x22d/0x310 [176242.506663] [] __do_softirq+0xfd/0x290 [176242.506667] [] call_softirq+0x1c/0x30 [176242.506672] [] do_softirq+0x65/0xa0 [176242.506675] [] irq_exit+0x175/0x180 [176242.506679] [] smp_apic_timer_interrupt+0x48/0x60 [176242.506682] [] apic_timer_interrupt+0x162/0x170----------------被硬中断打断 [176242.506684]

[176242.506702] [] ? ixgbe_xmit_frame_ring+0x53/0xf30 [ixgbe]
[176242.506710] [] ixgbe_xmit_frame+0x58/0xd0 [ixgbe]
[176242.506715] [] wit_send_tasklet+0x73c/0xae0 [witdriver]
[176242.506719] [] wit_kthread_xmit_fn+0xb5/0x150 [witdriver]----------我们的内核线程kthread_send调用的发包函数
[176242.506723] [] ? wit_send_tasklet+0xae0/0xae0 [witdriver]
[176242.506726] [] kthread+0xd1/0xe0
[176242.506729] [] ? insert_kthread_work+0x40/0x40
[176242.506733] [] ret_from_fork+0x77/0xb0
[176242.506736] [] ? insert_kthread_work+0x40/0x40

分析这个堆栈，dev_watchdog 这次拿不到锁了，已经持续了22s，走查代码，它需要哪一把锁呢？

etif_tx_lock(dev);

static inline void netif_tx_lock(struct net_device *dev)
{
unsigned int i; int cpu;

spin\_lock(&dev->tx\_global\_lock);
cpu \= smp\_processor\_id(); for (i = 0; i < dev->num\_tx\_queues; i++) { struct netdev\_queue \*txq = netdev\_get\_tx\_queue(dev, i); /\* We are the only thread of execution doing a
     \* freeze, but we have to grab the \_xmit\_lock in
     \* order to synchronize with threads which are in
     \* the ->hard\_start\_xmit() handler and already
     \* checked the frozen bit. \*/ \_\_netif\_tx\_lock(txq, cpu);
    set\_bit(\_\_QUEUE\_STATE\_FROZEN, &txq->state);
    \_\_netif\_tx\_unlock(txq);
}

}

那这把锁被谁拿了呢？被 wit_kthread_xmit_fn 函数拿了，因为我们正在发包啊。

也就是说，我们的内核线程发包的时候，拿了锁，结果被中断，然后硬中断处理dev_watchdog的时候，也要拿这把锁，死锁了。

接下来，又引发其他cpu死锁：（每次进入_raw_spin_lock，rdi中必然存放的就是锁，x86反汇编看的话确实是rdi，其他架构不是如此）

[176246.611936] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [a.out:189281]-------------------------------cpu25
[176246.612901] Modules linked in: witdriver(OE) mysendmsg(OE) fuse tipc(OE) ossmod(OE) iptable_filter mptctl mptbase bonding dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support skx_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev ses enclosure sg mei_me mei shpchp lpc_ich i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_cpufreq acpi_pad tcp_bbr sch_fq binfmt_misc ip_tables ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ixgbe(OE) i40e(OE) mpt3sas(OE)
[176246.612961] ptp raid_class pps_core scsi_transport_sas i2c_core dca [last unloaded: witdriver]
[176246.612969] CPU: 25 PID: 189281 Comm: a.out Tainted: G W OEL ------------ 3.10.0-693.21.1.el7.x86_64 #1 [176246.612970] Hardware name: ZTE ZXCDN/SC621DI-16F, BIOS 1.0b 09/21/2017 [176246.612973] task: ffff882ba14a0000 ti: ffff882ba92c8000 task.ti: ffff882ba92c8000
[176246.612974] RIP: 0010:[] [] native_queued_spin_lock_slowpath+0x156/0x200 [176246.612982] RSP: 0018:ffff882ba92cb978 EFLAGS: 00000202 [176246.612983] RAX: 0000000000000101 RBX: 0000090190134ba9 RCX: 0000000000c90000
[176246.612985] RDX: 0000000000c90101 RSI: 0000000000000101 RDI: ffff882fb80a0bc0------------大家都在抢的香饽饽存放在此，就是这把锁
[176246.612986] RBP: ffff882ba92cb978 R08: ffff882fbe959700 R09: 0000000000000000 [176246.612988] R10: 0000000000000001 R11: ffffea017c3e5580 R12: ffff882c18a85c00
[176246.612989] R13: 00000000000005dc R14: ffff882ba2091100 R15: ffffffff811e63d1
[176246.612991] FS: 00007fbf25f9b700(0000) GS:ffff882fbe940000(0000) knlGS:0000000000000000 [176246.612993] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [176246.612995] CR2: 0000000000c79000 CR3: 000000593a2ec000 CR4: 00000000003607e0
[176246.612996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [176246.612998] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [176246.612999] Call Trace:
[176246.613006] [] queued_spin_lock_slowpath+0xb/0xf [176246.613011] [] _raw_spin_lock+0x20/0x30 [176246.613015] [] sch_direct_xmit+0x13f/0x240------------qdisc中sys调用发包 [176246.613017] [] __qdisc_run+0x95/0x1c0 [176246.613022] [] __dev_queue_xmit+0x2c9/0x550-----------这个是实际物理网卡的 [176246.613025] [] dev_queue_xmit+0x10/0x20 [176246.613032] [] bond_dev_queue_xmit+0x32/0x80 [bonding]
[176246.613036] [] bond_start_xmit+0x1be/0x420 [bonding]
[176246.613039] [] dev_hard_start_xmit+0x90/0x1a0 [176246.613041] [] __dev_queue_xmit+0x448/0x550-----------这个是bond的 [176246.613044] [] dev_queue_xmit+0x10/0x20 [176246.613048] [] ip_finish_output+0x546/0x7a0 [176246.613050] [] ip_output+0x73/0xe0 [176246.613053] [] ? __ip_local_out_sk+0xf6/0x100 [176246.613055] [] ip_local_out_sk+0x37/0x40 [176246.613058] [] ip_send_skb+0x16/0x50 [176246.613062] [] udp_send_skb+0xac/0x2b0 [176246.613065] [] udp_push_pending_frames+0x3e/0x60 [176246.613068] [] udp_sendpage+0x119/0x1c0 [176246.613071] [] inet_sendpage+0x70/0xe0 [176246.613075] [] kernel_sendpage+0x1e/0x30 [176246.613079] [] sendfile_slowpath_2+0x1a8/0x1f0 [witdriver]
[176246.613084] [] sys_mycall+0x776/0x8f0 [witdriver]
[176246.613090] [] system_call_fastpath+0x1c/0x21 [176246.613091] Code: 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 17 0f b7 c2 85 c0 74 21 83 f8 03 75 10 eb 1a 66 2e 0f 1f 84 00 00 00 00 00 85 c0 74 0c f3 90 <8b> 17 0f b7 c2 83 f8 03 75 f0 be 01 00 00 00 eb 15 66 0f 1f 84

由于我们拿的是dev的锁，在没开启抢占的情况下不调度出去，但也并没有关中断，这样其他cpu在系统调用中要发包，也会去尝试拿这把锁，好吧，你也死锁了，等着吧。

可以想象，后续其他cpu要拿这把锁的话，大家都会挂，一个cpu的softlock导致其他cpu全挂，只要你这个cpu来拿那锁就会如此。

由于我们拿的是spin_lock,并没有关中断，所以它还可以响应中断，看如下堆栈就明白，最早打印网卡异常的9号cpu，堆栈在死的时候是如下这样的：

crash> bt
PID: 178838 TASK: ffff88290f7ddee0 CPU: 9 COMMAND: "kthread_send/9" #0 [ffff882fbe843a70] machine_kexec at ffffffff8105d77b
#1 [ffff882fbe843ad0] __crash_kexec at ffffffff8110aca2
#2 [ffff882fbe843ba0] panic at ffffffff816ad52f
#3 [ffff882fbe843c20] watchdog_timer_fn at ffffffff81135a51----------------------------hardlock检测中hrtimer的回调
#4 [ffff882fbe843c58] __hrtimer_run_queues at ffffffff810b93a6
#5 [ffff882fbe843cb0] hrtimer_interrupt at ffffffff810b993f
#6 [ffff882fbe843cf8] local_apic_timer_interrupt at ffffffff8105467b
#7 [ffff882fbe843d10] smp_apic_timer_interrupt at ffffffff816c9e83
#8 [ffff882fbe843d28] apic_timer_interrupt at ffffffff816c6732--------------------------死之前又增加的堆栈，这个是硬中断
#9 [ffff882fbe843dc8] queued_spin_lock_slowpath at ffffffff816adeee---------------------一开始打印hung的堆栈，在自旋
#10 [ffff882fbe843dd8] _raw_spin_lock at ffffffff816bb080
#11 [ffff882fbe843de8] dev_watchdog at ffffffff815bca52
#12 [ffff882fbe843e28] call_timer_fn at ffffffff8109a9c8
#13 [ffff882fbe843e60] run_timer_softirq at ffffffff8109ceed
#14 [ffff882fbe843ed8] __do_softirq at ffffffff8109404d
#15 [ffff882fbe843f48] call_softirq at ffffffff816c8afc
#16 [ffff882fbe843f60] do_softirq at ffffffff8102d435
#17 [ffff882fbe843f80] irq_exit at ffffffff81094495
#18 [ffff882fbe843f98] smp_apic_timer_interrupt at ffffffff816c9e88
#19 [ffff882fbe843fb0] apic_timer_interrupt at ffffffff816c6732 --- --- #20 [ffff882b680d3c28] apic_timer_interrupt at ffffffff816c6732
[exception RIP: ixgbe_xmit_frame_ring+83]
RIP: ffffffffc01299e3 RSP: ffff882b680d3cd0 RFLAGS: 00000212 RAX: 0000000000000562 RBX: 0000000000000001 RCX: 000000000000403d
RDX: ffff882fb9331c00 RSI: ffff8828d7b8fac0 RDI: 0000000000000001 RBP: ffff882b680d3d48 R8: 0000000000000008 R9: 0000a0a5447b9d78
R10: ffff8828c6e84f00 R11: 000000002b3000b8 R12: ffff8828c0291b00
R13: 0000000022300000 R14: 0000000000000001 R15: ffff882b680d3cc0
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #21 [ffff882b680d3d50] ixgbe_xmit_frame at ffffffffc012a918 [ixgbe]
#22 [ffff882b680d3d80] wit_send_tasklet at ffffffffc043b63c [witdriver]
#23 [ffff882b680d3e78] wit_kthread_xmit_fn at ffffffffc043ba95 [witdriver]
#24 [ffff882b680d3ec8] kthread at ffffffff810b5241
#25 [ffff882b680d3f50] ret_from_fork at ffffffff816c5577

总结：

1、解决办法很简单，就是在 wit_kthread_xmit_fn 函数中，加了一个关闭软中断就行，因为就是软中断和sys拿锁冲突。

2、分析软锁，最好从dmesg中开始分析，那个时候时间还早，会按时间顺序来打印堆栈，比起直接crash来分析要快一些，特别是在对自己模块代码比较熟悉的情况下。

转载于:https://www.cnblogs.com/10087622blog/p/9770945.html

手机扫一扫

移动阅读更方便

你可能感兴趣的文章