<< Previous | Home | Next >>

Exploring DRBD: Notes for a Newbie

For your reference, and mine!
Bookmark and Share

drbd logoOver the past month I've been converting an existing pair of NFS/ext4 file servers (live & hot-spare with regular rsync-based synchronization) to a two-node Class C DRBD high availability cluster. It's been a learning experience, to say the least. And while the DRBD documentation is excellent, there were some concepts and tasks that did not immediately sink in. The below notes are for my future reference, and anyone who is also new to DRBD.

  • Making a DRBD resource mount on boot is not trivial. i've left my fstab with noauto in the configuration. Why? The thought of a node rebooting, becoming stale, and then automatically setting itself to the primary scares me!

  • A DRBD resource that is a Class C secondary is not mountable. you'll get errors like:
    root@umhlanga:/home/stu# mount /mnt/data
    mount: block device /dev/drbd1 is write-protected, mounting read-only
    mount: Wrong medium type
  • If you want a DRBD to come up from a cold boot to be primary, then setting up a heartbeat is a requirement. But you don't have to do this for a basic, first step HA cluster.
  • Add the startup timeouts options to the configuration file. if your kit is at a data center, without it you may end up making a site visit because of a hung boot. it'll sit there until done. My /etc/drbd.conf snippet:
    common {
      	protocol C;
    	startup {
    		# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb;
    		wfc-timeout 600;
    		degr-wfc-timeout 600;
    		outdated-wfc-timeout 600;
    	}
    }
  • A fundamental task to perform is querying the state of the drbd resource(s). There are two ways to do this. Example:
    stu@umhlanga:~$ cat /proc/drbd 
    version: 8.3.7 (api:88/proto:86-91)
    GIT-hash: ... build by root@umhlanga, 2010-07-28 11:28:28
     1: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----
        ns:0 nr:0 dw:0 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:9765272500
    stu@umhlanga:~$ 
    stu@umhlanga:~$ sudo drbd-overview
    [sudo] password for stu: 
      1:r0  WFConnection Primary/Unknown UpToDate/DUnknown C r---- /mnt/data ext4 9.0T 1.5T 7.5T 17% 
    stu@umhlanga:~$ 
    
    

    This output contains lots of important information. The details of each method and the output can be found on DRBD's documentation page 'Checking DRBD Status'

  • The high level steps required to bring up a DRBD resource:
    1. kernel starts
    2. network stats
    3. DRBD starts
    4. bring up DRBD resource
    5. make primary (if it is the primary)
    6. mount resource
    And presto! The DRBD resource is ready for use.

  • After having created a resource, and rebooting, you get this message with drbdadm up resource-name, don't follow the instructions in the error message.
    drbdadm up r0 
    1: Failure: (124) Device is attached to a disk (use detach first)
    Command 'drbdsetup 1 disk /dev/sdb /dev/sdb internal --set-defaults 
        --create-device' terminated with exit code 10
    
  • If you, like me, are exporting the mounted file system with NFS then be careful with the auto-mount configuration in /ets/fstab.
Tags :

Spooky mcelog messages: MCA: Generic CACHE Level-2 Generic Error

New error message haunting my fresh Debian Lenny installation
Bookmark and Share

In mid-June I installed Debian Lenny onto an old HP DL380 of ours. Since then there has been six of these Generic CACHE Level-2 Generic Error messages.

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 6 BANK 3 TSC 1e600ed894af4
ADDR 1719edeb0 
MCG status:
MCi status:
Error enabled
MCi_ADDR register valid
MCA: Generic CACHE Level-2 Generic Error
STATUS 942000420001010a MCGSTATUS 0
The Scream

Shortly after installing Debian, I also dropped in a second quad core CPU and doubled the RAM to 8GB. Coincidence? Probably not. Then again, I don't have ready access to the old RHEL4 system logs. (They are with our old hosting vendor.)

What exactly does this mean? Is it a recoverable condition? Is it a warning sign of future problems to come?

Spooky.

Simple linux bandwidth monitoring with bwm-ng

Where has this tool been all my life?
Bookmark and Share

These past few days I've been running some load tests on new servers, configurations and our Internet link. While we have a Cacti installation graphing everything via SNMP, the 5 minute polling interval means waiting for what seems like forever.

I just want some quick and clean bandwidth statistics!

After some years of needing a simple tool, finally bwm-ng has entered my life. Where have you been all these years, bwm-ng? I assume that bwm-ng is short for "bandwidth monitor, next generation."

bwm-ng in action

My only complaint is that when monitoring network interfaces, ethernet bonds are treated as equals to the real interfaces rather than the virtual cumulative devices they are.

Other than that, this tool rocks. Some features that are most pleasing:

  • Displays different units: bps to Mbps, Bps to MBps
  • Network interfaces and hard drives
  • Simple, clean and easy to comprehend user interface
  • 100ms polling interval granularity
  • apt-get installation on Debian Lenny

#geekloveatfirstsight

Multiple bonds on Debian Lenny, and related No such device error

SIOCSIFADDR: No such device on Debian Lenny with NIC bonding
Bookmark and Share

While setting up a few Debian Lenny machines this summer, I came across the error message SIOCSIFADDR: No such device a few times. All of these servers have NIC bonding configured, and a few of them have multiple Ethernet bonds. Here are a couple of potential causes for this error message:

  • The /etc/modprobe.d/arch/X86_64 file does not contain the bonded device name. For multiple bonded devices, the file must contain an alias entry for each. Here is an example for a two device system named bond0 and bond1:
    
    alias bond0 bonding
    alias bond1 bonding
    
    
  • The /etc/modules module for bonding (bonding mode=4 miimon=100 max_bonds=2) is configured for fewer bonds than the server has. The here max_bonds=2 is the maximum number of bonding devices your system will have. The default is 1. If you machine has more, then SIOCSIFADDR: No such device will appear for the devices that did not come up.

Adding a hot spare to a P400 controller in an HP DL380 at the command line in linux

Bookmark and Share

Just so I don't forget. Again.

berea:~# hpacucli 
HP Array Configuration Utility CLI 8.28-13.0
Detecting Controllers...Done.
Type "help" for a list of supported commands.
Type "exit" to close the console.

=> ctrl all show

Smart Array P400 in Slot 1                (sn: P61620D9SUK034)

=> ctrl slot=1 show

Smart Array P400 in Slot 1
   Bus Interface: PCI
   Slot: 1
   Serial Number: P61620D9SUK034
   Cache Serial Number: PA82C0H9SUJL94
   RAID 6 (ADG) Status: Disabled
   Controller Status: OK
   Chassis Slot: 
   Hardware Revision: Rev D
   Firmware Version: 7.08
   Rebuild Priority: Medium
   Expand Priority: Medium
   Surface Scan Delay: 15 secs
   Post Prompt Timeout: 0 secs
   Cache Board Present: True
   Cache Status: OK
   Accelerator Ratio: 100% Read / 0% Write
   Drive Write Cache: Disabled
   Total Cache Size: 256 MB
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0
   SATA NCQ Supported: True

=> ctrl slot=1 array all show

Smart Array P400 in Slot 1

   array A (SAS, Unused Space: 0 MB)
   array B (SAS, Unused Space: 0 MB)

=> ctrl slot=1 ld all show

Smart Array P400 in Slot 1

   array A

      logicaldrive 1 (68.3 GB, RAID 1, OK)

   array B

      logicaldrive 2 (136.7 GB, RAID 5, OK)

=> ctrl slot=1 pd all show

Smart Array P400 in Slot 1

   array A

      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 72 GB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 72 GB, OK)

   array B

      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 72 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 72 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 72 GB, OK)

   unassigned

      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 73.5 GB, OK)

=> ctrl slot=1 array a add spares=2I:1:3
=> ctrl slot=1 pd all show

Smart Array P400 in Slot 1

   array A

      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 72 GB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 72 GB, OK)
      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 73.5 GB, OK, spare)

   array B

      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 72 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 72 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 72 GB, OK)

=> 
=> exit
Tags :