由于Zabbix中经常出现IO系统错误,所以在查看时发现错误内核:BUG: soft lockup --CPU#x,所以处理的时候做个备忘。

环境

  • ESXi 6.0
  • CentOS 7.2

错误等

当我检查/var/log/messages时,输出了以下错误。

Jun 29 03:30:38 HOSTNAME kernel: BUG: soft lockup - CPU#1 stuck for 23s! [xfsaild/dm-0:401]
Jun 29 03:30:38 HOSTNAME kernel: Modules linked in: ip6t_rpfilter ipt_REJECT ip6t_REJECT xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmw_vsock_vmci_transport vsock coretemp crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr ppdev sg vmw_balloon parport_pc parport shpchp i2c_piix4 vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel serio_raw vmwgfx e1000 drm_kms_helper
Jun 29 03:30:38 HOSTNAME kernel: ttm mptspi scsi_transport_spi mptscsih drm mptbase ata_piix i2c_core libata floppy dm_mirror dm_region_hash dm_log dm_mod
Jun 29 03:30:38 HOSTNAME kernel: CPU: 1 PID: 401 Comm: xfsaild/dm-0 Tainted: G             L ------------   3.10.0-327.el7.x86_64 #1
Jun 29 03:30:38 HOSTNAME kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
Jun 29 03:30:38 HOSTNAME kernel: task: ffff880135d96780 ti: ffff880036380000 task.ti: ffff880036380000
Jun 29 03:30:38 HOSTNAME kernel: RIP: 0010:[]  [] mpt_put_msg_frame+0x5e/0x80 [mptbase]
Jun 29 03:30:38 HOSTNAME kernel: RSP: 0018:ffff880036383b00  EFLAGS: 00000246
Jun 29 03:30:38 HOSTNAME kernel: RAX: ffffc90008780000 RBX: ffff880036383ad0 RCX: 0000000000000018
Jun 29 03:30:38 HOSTNAME kernel: RDX: ffff880135c4f600 RSI: ffff880139157000 RDI: 000000000000000e
Jun 29 03:30:38 HOSTNAME kernel: RBP: ffff880036383b10 R08: 0000000000000002 R09: ffff88003640b0d8
Jun 29 03:30:38 HOSTNAME kernel: R10: ffff8800ba2c16c0 R11: ffff8800ba2c16c0 R12: 0000000100000001
Jun 29 03:30:38 HOSTNAME kernel: R13: ffff880097c20300 R14: ffff880036383a74 R15: ffffffff812d06bb
Jun 29 03:30:38 HOSTNAME kernel: FS:  0000000000000000(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
Jun 29 03:30:38 HOSTNAME kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 29 03:30:38 HOSTNAME kernel: CR2: 00007fc2c650f000 CR3: 00000000ba1e0000 CR4: 00000000000407e0
Jun 29 03:30:38 HOSTNAME kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 29 03:30:38 HOSTNAME kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 29 03:30:38 HOSTNAME kernel: Stack:
Jun 29 03:30:38 HOSTNAME kernel: ffff880139157d30 000000000000003c ffff880036383bb8 ffffffffa008b729
Jun 29 03:30:38 HOSTNAME kernel: ffff880139157008 0400000036383b78 0000000000000018 ffff8800ba2c16c0
Jun 29 03:30:38 HOSTNAME kernel: 0000000000000060 ffff880139157188 0000006000000018 ffff880036603130
Jun 29 03:30:38 HOSTNAME kernel: Call Trace:
Jun 29 03:30:38 HOSTNAME kernel: [] mptscsih_qcmd+0x249/0x820 [mptscsih]
Jun 29 03:30:38 HOSTNAME kernel: [] mptspi_qcmd+0x50/0xe0 [mptspi]
Jun 29 03:30:38 HOSTNAME kernel: [] scsi_dispatch_cmd+0xaa/0x230
Jun 29 03:30:38 HOSTNAME kernel: [] scsi_request_fn+0x501/0x770
Jun 29 03:30:38 HOSTNAME kernel: [] __blk_run_queue+0x33/0x40
Jun 29 03:30:38 HOSTNAME kernel: [] queue_unplugged+0x2a/0xa0
Jun 29 03:30:38 HOSTNAME kernel: [] blk_flush_plug_list+0x1d8/0x230
Jun 29 03:30:38 HOSTNAME kernel: [] blk_finish_plug+0x14/0x40
Jun 29 03:30:38 HOSTNAME kernel: [] __xfs_buf_delwri_submit+0x1e9/0x250 [xfs]
Jun 29 03:30:38 HOSTNAME kernel: [] ? xfs_buf_delwri_submit_nowait+0x2f/0x50 [xfs]
Jun 29 03:30:38 HOSTNAME kernel: [] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Jun 29 03:30:38 HOSTNAME kernel: [] xfs_buf_delwri_submit_nowait+0x2f/0x50 [xfs]
Jun 29 03:30:38 HOSTNAME kernel: [] xfsaild+0x240/0x5e0 [xfs]
Jun 29 03:30:38 HOSTNAME kernel: [] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Jun 29 03:30:38 HOSTNAME kernel: [] kthread+0xcf/0xe0
Jun 29 03:30:38 HOSTNAME kernel: [] ? kthread_create_on_node+0x140/0x140
Jun 29 03:30:38 HOSTNAME kernel: [] ret_from_fork+0x58/0x90
Jun 29 03:30:38 HOSTNAME kernel: [] ? kthread_create_on_node+0x140/0x140
Jun 29 03:30:38 HOSTNAME kernel: Code: 8b 96 68 01 00 00 0f b7 c8 44 03 a6 b0 01 00 00 44 8b 04 8a 45 09 c4 f686 e0 00 00 00 04 75 10 48 8b 83 e8 00 00 00 44 89 60 40 <5b> 41 5c 5d c3 48 8d 76 08 0f b7 c8 44 89 e2 48 c7 c7 78 c3 0c

 

由于是半夜发生的,所以没有什么特别的效果,但是我用zabbix查看监控状态时,经常出现IO相关的错误。然而,它在几分钟内安定下来,恢复了原来的状态。

经检查,VMware 的知识库文章“A soft lockup message is output from a Linux kernel running on an SMP-enabled virtual machine (2094326)”指出:

 Linux 内核为软锁定提供了一个看门狗线程,并且这个看门狗线程调度超过 10 秒,则会打印一条软锁定消息。

作为对策,有如下描述,但我看到这消息还是不舒服,所以我想处理它。.. ..

软锁定消息不是内核故障,可以安全地忽略。

上面的VM页面显示了如何更改软锁定阈值,但这不是最终的解决方案?当我搜索它时,一位似乎是专家的人解释说“BUG:soft lockup --stops at CPU #” 。

在VM环境overcommit的情况下,VM并不总是占用真正的CPU(当其他VM忙时,自然没有资源分配给空闲VM) 资源池 除非它被控制。

当然,显示消息的环境是一个不错的过度使用条件。从经验来看,内存对过度使用有点紧张,因为很容易通过一些开销看到性能差异,但是 CPU 计算非常困难,因为并非所有 VM 都忙(经常)。不要(经常,我)。

处理

作为参考,如下所示。

# vi /boot/config-3.10.0-327.el7.x86_64
CONFIG_LOCKUP_DETECTOR=y
↓
CONFIG_LOCKUP_DETECTOR=n
重启

# shutdown -r now

就这样。

参与评论