We have two identical racks of Dell servers that each include an R415 and an R515. These four servers were rebooting periodically, typically within 6 weeks, but sometimes within a couple of weeks or a couple of days. No kernel panic, no operating system errors, just system reboots, as if the plug was pulled and reinserted.
Please note: the problem described below has not been resolved for these machines. Varying BMC/iDRAC firmware levels and specific configuration of IPMI enables seems to change the periods between reboots, but they don’t appear to have been fixed.
For more background information, read on, but please bear in mind that the problem still exists. This page will be updated when more information is available…
None of the other servers in either of the racks was showing this problem, despite having a very similar configuration (operating system, network configuration, etc.) The racks are in different data-centres, so environmental differences seemed unlikely.
The BMC/iDRAC6 system event logs variously showed as a faulty DIMMs (ECC failure), CPU Machine Checks, power supply sensor failures – communication error, and OEM Diagnostic Events, sometimes with dates in the 1970’s. The hardware was quite obviously fine – slim chance of genuine failures occurring across four servers bought in two batches a couple of months apart.
Downgrading the BMC/iDRAC6 firmware from 1.70 to 1.54 made no difference, nor did upgrading to iDRAC6 to 1.80 (no upgrade was available for the BMC). All other firmware was updated to the very latest as of February 2012, but reboots were still happening.
Dell’s support were little more than a waste of my time. In spite of the obvious conclusion that hardware components are not actually faulty, and that the errors are red-herrings, they insisted that an engineer come to replace parts – no rationale behind this, other than to appear to be doing something. I didn’t have time to waste pointlessly messing about with production servers.
I was suspicious that the servers giving problems had chipsets reported by lspci as ‘Dell Device 0488‘ and ‘Dell Device 0489‘ – these servers were all very similar in terms of their motherboards.
It seemed very much like the BMC was receiving corrupt data, either on the SMBus or else somehow reaching the Event Message Buffer.
I tried setting PCI registers to prevent any initialisation of the PCI->SMBus bridge, in case another PCI device was corrupting the bus in a way that was pushing random data onto the SMBus, but this made no difference.
Periodically cold restarting the BMC or iDRAC6 effectively made no difference.
My last attempt to avoid these reboots was to disable the BMC’s Event Message Buffer and System Event Log. I also took the opportunity to set the BMC time correctly (it was out by an hour on my servers), and I decided to reboot the BMC before disabling the event buffer and system log. So, in order: –
ipmitool sel time set now ipmitool mc reset cold (wait a minute or two) ipmitool bmc setenables event_msg=off ipmitool bmc setenables system_event_log=off
I had deliberately left the event_msg option enabled on one of the four servers to serve as a control for the experiment, and sure enough this was the one that rebooted (after 6 weeks). I disabled the option on this machine.
So far at least, disabling the Event Message Buffer seems to help prevent the reboots, but the two R415 servers did reboot after three months. The event_msg option had been re-enabled in the BMC, I suspect by the Open Manage software on startup, so I can’t tell if it had been re-enabled prior to the reboot.
I’ve decided to cold-reset the BMC once every 24 hours to see if this prevents the corruption and 3 monthly reboots on the R415s. Time will tell…
I may try re-enabling the System Event Log if and when I feel convinced that the reboot problem is solved, but for now I’d rather have a stable motherboard than system management data, especially if the management controller itself is causing the problems!
Based on my own experience, and similar experiences I’ve seen reported by others on the Internet, I suspect that there is a bug in Dell’s BMC that manifests itself in various ways according to the system configuration. This may, of course, be in response to other errors, such as the PSU firmware corrupting the SMBus, but the BMC should be resilient against external anomalies. I guess that for many people it’s benign, but under certain configurations it will have more obvious effects.
If I’m wrong, then the bus corruption must be directly affecting another component – e.g. the reset is not being mediated by the BMC, but rather the corruption is inadvertently telling the CPU to reset. The CPU communications look to be on an entirely separate bus (according to one system management diagram that I’ve seen), but I can’t find enough definitive information.
If you’ve had similar problems with Dell 11G servers that sound like they might be BMC related, please leave a comment to share your experience.