一次mmc0: Timeout waiting for hardware interrupt问题分析过程
阅读原文时间:2021年04月20日阅读:1

1、单板环境


xilinx zynq7z15 soc,Arm cortex A9双核。外设:mmc、emmc、spi、qspi、i2c、gpio、can、uart、cdns-gem(网口)、dsp、usb等。

其中emmc来自arasan,文件系统选用btrfs。内核版本3.10。

2、问题现象

开源测试套件在btrfs文件系统中运行(由于单板内存较低),出现如下异常打印:

[sched_delayed] sched: RT throttling activated
mmc0: Timeout waiting for hardware interrupt.
mmc0: Timeout waiting for hardware interrupt.
mmcblk0: error -110 sending status command, retrying
mmc0: Timeout waiting for hardware interrupt.
mmcblk0: error -110 sending status command, retrying
mmc0: Timeout waiting for hardware interrupt.
mmcblk0: error -110 sending status command, aborting
mmc0: Timeout waiting for hardware interrupt.
mmc0: Timeout waiting for hardware interrupt.
mmc0: Timeout waiting for hardware interrupt.
mmc0: Timeout waiting for hardware interrupt.
btrfs: bdev /dev/mmc1 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
mmc0: card 0001 removed
btrfs: bdev /dev/mmc1 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0
btrfs: bdev /dev/mmc1 errs: wr 2, rd 1, flush 0, corrupt 0, gen 0
btrfs: bdev /dev/mmc1 errs: wr 3, rd 1, flush 0, corrupt 0, gen 0
btrfs: bdev /dev/mmc1 errs: wr 4, rd 1, flush 0, corrupt 0, gen 0
btrfs: bdev /dev/mmc1 errs: wr 5, rd 1, flush 0, corrupt 0, gen 0
BTRFS error (device mmcblk0p1) in btrfs_commit_transaction:1815: errno=-5 IO failure (Error while writing out transaction)
BTRFS info (device mmcblk0p1): forced readonly
BTRFS warning (device mmcblk0p1): Skipping commit of aborted transaction.
------------[ cut here ]------------
WARNING: at .....linux/fs/btrfs/super.c:254 __btrfs_abort_transaction+0x48/0xc8()
Modules linked in:
CPU: 0 PID: 744 Comm: btrfs-transacti Not tainted ........... #5
[<c000f788>] (unwind_backtrace+0x0/0xe0) from [<c000c9ac>] (show_stack+0x10/0x14)
[<c000c9ac>] (show_stack+0x10/0x14) from [<c001fc20>] (warn_slowpath_common+0x4c/0x6c)
[<c001fc20>] (warn_slowpath_common+0x4c/0x6c) from [<c001fc6c>] (warn_slowpath_fmt+0x2c/0x3c)
[<c001fc6c>] (warn_slowpath_fmt+0x2c/0x3c) from [<c01e4f5c>] (__btrfs_abort_transaction+0x48/0xc8)
[<c01e4f5c>] (__btrfs_abort_transaction+0x48/0xc8) from [<c020d924>] (btrfs_commit_transaction+0xa94/0xdbc)
[<c020d924>] (btrfs_commit_transaction+0xa94/0xdbc) from [<c0205ff0>] (transaction_kthread+0x150/0x1dc)
[<c0205ff0>] (transaction_kthread+0x150/0x1dc) from [<c003f8c0>] (kthread+0x98/0xa0)
[<c003f8c0>] (kthread+0x98/0xa0) from [<c0009288>] (ret_from_fork+0x14/0x2c)
---[ end trace ee1c2575b6a806c3 ]---
BTRFS error (device mmcblk0p1) in cleanup_transaction:1455: errno=-5 IO failure
delayed_refs has NO entry
mmc0: Timeout waiting for hardware interrupt.
mmc0: Timeout waiting for hardware interrupt.
mmc0: Timeout waiting for hardware interrupt.

异常打印后,大多数时候,btrfs还能继续使用,没有明确具体影响。偶尔出现btrfs无法再挂载,甚至单板挂死的现象。问题必现。

3、问题分析

3-1  从驱动着手

由于问题出现看起来都是中断等待超时,首先怀疑驱动存在问题。

1)驱动bug修复: 搜索开源补丁, 主要关注sdhci-of-arasan.c和sdhci.c驱动中的bug修复,合入怀疑有相关性的补丁,sdhci-of-arasan.c甚至整体替换,单没有效果。
2)驱动整体升级:mmc整体升级为3.14, emmc识别出现错误,分析原因进展较慢,最后选择放弃。

3)调大mmc中断超时等待时间,效果不明显。

3-2 寄存器配置分析

配置中断超时寄存器信息打印:

mmc0: Timeout waiting for hardware interrupt.
sdhci: =========== REGISTER DUMP (mmc0)===========
sdhci: Sys addr: 0x02742000 | Version:  0x00008901
sdhci: Blk size: 0x00007200 | Blk cnt:  0x00000000
sdhci: Argument: 0x00000038 | Trn mode: 0x00000033
sdhci: Present:  0x01f20000 | Host ctl: 0x00000017
sdhci: Power:    0x0000000f | Blk gap:  0x00000000
sdhci: Wake-up:  0x00000000 | Clock:    0x00000007
sdhci: Timeout:  0x0000000e | Int stat: 0x00000001
sdhci: Int enab: 0x02ff000b | Sig enab: 0x02ff000b
sdhci: AC12 err: 0x00000000 | Slot int: 0x00000001
sdhci: Caps:     0x69ec0080 | Caps_1:   0x00000000
sdhci: Cmd:      0x0000171a | Max curr: 0x00000001

sdhci: Host ctl2: 0x00000000

可以看到,中断状态寄存器为0x00000001,说明mmc应该产生了中断,但是cpu未处理导致超时。可能原因 1)mmc中断被屏蔽  2)cpu处于关中断状态

复现过程中,当  mmc0: Timeout waiting for hardware interrupt出现时,串口和网口无反应,排除原因1),同时屏蔽多种外设的可能性很小。
mmc0: Timeout waiting for hardware interrupt.周期性打印,排除2),如果cpu关中断,时钟中断也应该不响应,但超时打印连续出来,说明时钟中断正常。

3-3 排除其他外设影响

为了排除其他外设影响导致的mmc中断问题,关闭其他外设,仅保留网口和串口,问题仍然复现。

3-4 中断线程化的影响

根据前面的分析,开始怀疑是否为3.10内核一直存在的问题。手中正好有其它架构单板,使用sd卡 + vfat进行相同测试,问题没有复现。但中途看到[sched_delayed] sched: RT throttling activated 打印,同时注意到上述异常打印中也有[sched_delayed] sched: RT throttling activated信息。了解到此单板实时界别配置较低,没有中断线程化。

进过反复验证,mmc中断超时异常与是否配置中断线程化相关。开始分析开源测试套件代码,发现套件中有99级别的任务,最终导致目前的问题。