Exploring DRBD: Notes for a Newbie
For your reference, and mine!
Over the past month I've been converting an existing pair of NFS/ext4 file servers (live & hot-spare with regular rsync-based synchronization) to a two-node Class C DRBD high availability cluster. It's been a learning experience, to say the least. And while the DRBD documentation is excellent, there were some concepts and tasks that did not immediately sink in. The below notes are for my future reference, and anyone who is also new to DRBD.
- Making a DRBD resource mount on boot is not trivial. i've left my fstab with noauto in the configuration. Why? The thought of a node rebooting, becoming stale, and then automatically setting itself to the primary scares me!
- A DRBD resource that is a Class C secondary is not mountable. you'll get errors like:
root@umhlanga:/home/stu# mount /mnt/data mount: block device /dev/drbd1 is write-protected, mounting read-only mount: Wrong medium type - If you want a DRBD to come up from a cold boot to be primary, then setting up a heartbeat is a requirement. But you don't have to do this for a basic, first step HA cluster.
- Add the startup timeouts options to the configuration file. if your kit is at a data center, without it you may end up making a site visit because of a hung boot. it'll sit there until done. My
/etc/drbd.confsnippet:common { protocol C; startup { # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb; wfc-timeout 600; degr-wfc-timeout 600; outdated-wfc-timeout 600; } } - A fundamental task to perform is querying the state of the drbd resource(s). There are two ways to do this. Example:
stu@umhlanga:~$ cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) GIT-hash: ... build by root@umhlanga, 2010-07-28 11:28:28 1: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r---- ns:0 nr:0 dw:0 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:9765272500 stu@umhlanga:~$ stu@umhlanga:~$ sudo drbd-overview [sudo] password for stu: 1:r0 WFConnection Primary/Unknown UpToDate/DUnknown C r---- /mnt/data ext4 9.0T 1.5T 7.5T 17% stu@umhlanga:~$This output contains lots of important information. The details of each method and the output can be found on DRBD's documentation page 'Checking DRBD Status'
- The high level steps required to bring up a DRBD resource:
- kernel starts
- network stats
- DRBD starts
- bring up DRBD resource
- make primary (if it is the primary)
- mount resource
- After having created a resource, and rebooting, you get this message with
drbdadm up resource-name, don't follow the instructions in the error message.drbdadm up r0 1: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup 1 disk /dev/sdb /dev/sdb internal --set-defaults --create-device' terminated with exit code 10 - If you, like me, are exporting the mounted file system with NFS then be careful with the auto-mount configuration in
/ets/fstab.
Spooky mcelog messages: MCA: Generic CACHE Level-2 Generic Error
New error message haunting my fresh Debian Lenny installation
In mid-June I installed Debian Lenny onto an old HP DL380 of ours. Since then there has been six of these Generic CACHE Level-2 Generic Error messages.
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 6 BANK 3 TSC 1e600ed894af4
ADDR 1719edeb0
MCG status:
MCi status:
Error enabled
MCi_ADDR register valid
MCA: Generic CACHE Level-2 Generic Error
STATUS 942000420001010a MCGSTATUS 0
Shortly after installing Debian, I also dropped in a second quad core CPU and doubled the RAM to 8GB. Coincidence? Probably not. Then again, I don't have ready access to the old RHEL4 system logs. (They are with our old hosting vendor.)
What exactly does this mean? Is it a recoverable condition? Is it a warning sign of future problems to come?
Spooky.
Simple linux bandwidth monitoring with bwm-ng
Where has this tool been all my life?
These past few days I've been running some load tests on new servers, configurations and our Internet link. While we have a Cacti installation graphing everything via SNMP, the 5 minute polling interval means waiting for what seems like forever.
I just want some quick and clean bandwidth statistics!
After some years of needing a simple tool, finally bwm-ng has entered my life. Where have you been all these years, bwm-ng? I assume that bwm-ng is short for "bandwidth monitor, next generation."
My only complaint is that when monitoring network interfaces, ethernet bonds are treated as equals to the real interfaces rather than the virtual cumulative devices they are.
Other than that, this tool rocks. Some features that are most pleasing:
- Displays different units: bps to Mbps, Bps to MBps
- Network interfaces and hard drives
- Simple, clean and easy to comprehend user interface
- 100ms polling interval granularity
apt-getinstallation on Debian Lenny
#geekloveatfirstsight
Multiple bonds on Debian Lenny, and related No such device error
SIOCSIFADDR: No such device on Debian Lenny with NIC bonding

While setting up a few Debian Lenny machines this summer, I came across the error message SIOCSIFADDR: No such device a few times. All of these servers have NIC bonding configured, and a few of them have multiple Ethernet bonds. Here are a couple of potential causes for this error message:
- The
/etc/modprobe.d/arch/X86_64file does not contain the bonded device name. For multiple bonded devices, the file must contain an alias entry for each. Here is an example for a two device system named bond0 and bond1:alias bond0 bonding alias bond1 bonding - The
/etc/modulesmodule for bonding (bonding mode=4 miimon=100 max_bonds=2) is configured for fewer bonds than the server has. The heremax_bonds=2is the maximum number of bonding devices your system will have. The default is 1. If you machine has more, thenSIOCSIFADDR: No such devicewill appear for the devices that did not come up.
Adding a hot spare to a P400 controller in an HP DL380 at the command line in linux
Just so I don't forget. Again.
berea:~# hpacucli
HP Array Configuration Utility CLI 8.28-13.0
Detecting Controllers...Done.
Type "help" for a list of supported commands.
Type "exit" to close the console.
=> ctrl all show
Smart Array P400 in Slot 1 (sn: P61620D9SUK034)
=> ctrl slot=1 show
Smart Array P400 in Slot 1
Bus Interface: PCI
Slot: 1
Serial Number: P61620D9SUK034
Cache Serial Number: PA82C0H9SUJL94
RAID 6 (ADG) Status: Disabled
Controller Status: OK
Chassis Slot:
Hardware Revision: Rev D
Firmware Version: 7.08
Rebuild Priority: Medium
Expand Priority: Medium
Surface Scan Delay: 15 secs
Post Prompt Timeout: 0 secs
Cache Board Present: True
Cache Status: OK
Accelerator Ratio: 100% Read / 0% Write
Drive Write Cache: Disabled
Total Cache Size: 256 MB
No-Battery Write Cache: Disabled
Battery/Capacitor Count: 0
SATA NCQ Supported: True
=> ctrl slot=1 array all show
Smart Array P400 in Slot 1
array A (SAS, Unused Space: 0 MB)
array B (SAS, Unused Space: 0 MB)
=> ctrl slot=1 ld all show
Smart Array P400 in Slot 1
array A
logicaldrive 1 (68.3 GB, RAID 1, OK)
array B
logicaldrive 2 (136.7 GB, RAID 5, OK)
=> ctrl slot=1 pd all show
Smart Array P400 in Slot 1
array A
physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 72 GB, OK)
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 72 GB, OK)
array B
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 72 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 72 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 72 GB, OK)
unassigned
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 73.5 GB, OK)
=> ctrl slot=1 array a add spares=2I:1:3
=> ctrl slot=1 pd all show
Smart Array P400 in Slot 1
array A
physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 72 GB, OK)
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 72 GB, OK)
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 73.5 GB, OK, spare)
array B
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 72 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 72 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 72 GB, OK)
=>
=> exit