I am currently working through an issue with my 3-node RAC clusters (RHEL 5.6 x86_64 and Oracle RAC 11g running on HP DL580-G7 servers). They seem to enjoy rebooting themselves at will. There is nothing glaring for a root cause, other than some messages in the syslog about blocking for more than 120 seconds. Anyhow - after quite a bit of research I have discovered some things I really like about Linux. They have made the disk scheduler modular (in a sense). Therefore you can utilize your disk access in one of 4 methods. The CCISS is the HP Smart Array driver which is loaded and should be consistent in most Linux releases.
If you initially look at the "scheduler" file - you can see your four options. The one in use is surrounded by brackets. I am hoping that by changing the access for ONLY the cciss device to noop, that my reboots go away - and I leave a positive legacy behind at my customer-site ;-)
root@dbslp0067:/root
# cd /sys/block/cciss\!c0d0/queue/
root@dbslp0067:/sys/block/ cciss!c0d0/queue
# cat scheduler
noop anticipatory deadline [cfq]
root@dbslp0067:/sys/block/ cciss!c0d0/queue
# echo noop > scheduler
root@dbslp0067:/sys/block/ cciss!c0d0/queue
# cat scheduler
[noop] anticipatory deadline cfq
root@dbslp0067:/sys/block/ cciss!c0d0/queue
After trying this work-around on my system, I'm disappointed to report it did not help my cause. I will leave this out there, as I may need to tune a system for this at a later point.
Turned out to be a bad "CPU on a SAN switch blade". Not sure why multipath didn't handle the event better than the box locking up and subsequently rebooting itself. Might have to investigate the Multipath tunables?
Comments
Post a Comment