DRBD, iSCSI, and Linux clustering == cheap SAN solution posted Sat, 13 Oct 2012 02:04:21 UTC

As promised, here are my notes for building a home made, pennies on the dollar SAN solution on the off chance you’ve been recently eyeballing one of those ludicrously expensive commercial offerings and you’ve come to the conclusion that yes, they are in fact ludicrously expensive. While I’m normally a Debian user personally, these notes will be geared towards Red Hat based distributions since that’s what I have the (mis)fortune of using at work. But whatever. It should be easy enough to adapt to whichever distribution you so choose. It’s also worth mentioning that I originally did almost this exact same configuration, but using a single DRBD resource and then managing LVM itself via DRBD. Both approaches have their merits, but I prefer this method instead.

There are a couple of things to note with the following information. First, in all cases where we are creating resources inside of Pacemaker, we’re going to be specifying the operational parameters based on the advisory minimums which you can view by typing something like:

crm ra meta ocf:heartbeat:iSCSITarget

or whichever resource agent provider you wish to view. Also, for this particular instance, we will be running tgtd directly at boot time instead of managing the resource via the cluster stack. Since the example documentation from places like the DRBD manual are implementation agnostic and tgtd can be running all the time on both nodes without causing any problems, we’ll just start the service at boot and assume that it’s always running. If we have problems with tgtd segfaulting for whatever reason, we will need to add a provider based on the lsb:tgtd resource agent which directly manages the starting and stopping of tgtd.

As a final precursor, you will probably want to name the LVM volume group on each machine identically as it will simplify the DRBD resource configuration below. I need to correct myself here actually. If you try to specify the disk as an inherited option, then the device path becomes /dev/drbd/by-res/r1/0 instead of just /dev/drbd/by-res/r1. Since we’re not using volumes here, I prefer the latter syntax. But go ahead and name the volume group the same anyway just to make life easier.

Currently we’re using the Scientific Linux repositories for installing the necessary, up to date versions of all the various cluster related packages (we could also have used the RHEL DVD initially, but then we wouldn’t be getting any updates for these packages past the intial version availabile on the DVD). In order to use the SL repositories, we will install the yum-plugin-priorities package so that the offical RHEL repositories take precedence over the SL repositories.

yum install yum-plugin-priorities

Once that is installed, you really only need to configure the RHN repositories to change from the default priority of 99 to a much higer priority of 20 (arbitrary choice to allow some higher priorities if necessary). So /etc/yum/pluginconf.d/rhnplugin.conf should now look something like:

[main]
enabled = 1
gpgcheck = 1

# You can specify options per channel, e.g.:
#
#[rhel-i386-server-5]
#enabled = 1
#
#[some-unsigned-custom-channel]
#gpgcheck = 0

priority=20

[rpmforge-el6-x86_64]
exclude=nagios*
priority=99

Once that is configured, we can add the actual SL repository by doing cat > /etc/yum.repos.d/sl.repo:

[scientific-linux]
name=Scientific Linux - $releasever
#baseurl=http://mirror3.cs.wisc.edu/pub/mirrors/linux/scientificlinux.org/$releasever/$ARCH/SL/
baseurl=http://ftp.scientificlinux.org/linux/scientific/6/$basearch/os/
# baseurl=http://centos.alt.ru/repository/centos/5/$basearch/
enabled=1
gpgcheck=0
#includepkgs=*xfs* cluster-cim cluster-glue cluster-glue-libs clusterlib cluster-snmp cman cmirror corosync corosynclib ctdb dlm-pcmk fence-agents fence-virt gfs-pcmk httpd httpd-tools ipvsadm luci lvm2-cluster modcluster openais openaislib pacemaker pacemaker-libs pexpect piranha python-repoze-who-friendlyform resource-agents rgmanager ricci tdb-tools

For DRBD, we will want to use the El Repo repository. You can find the instructions for installing this repository here. We will be using v8.4 (as of this writing) of DRBD.

Now that everything is configured correctly, we can start installing the necessary packages:

yum install corosync pacemaker fence-agents kmod-drbd84 scsi-target-utils

For now (RHEL 6.x; 2.6.32), we’ll be using the older STGT iSCSI target as LIO wasn’t included in the Linux kernel until 2.6.38. Newer versions of Red Hat or Linux in general will probably require updated instructions here and below.

The instructions for configuring the cluster itself can generally be found here. I will include the necessary pieces here just in case that is unavailable for whatever reason.

You need to run:

corosync-keygen

Next you need to do cat > /etc/corosync/service.d/pcmk:


service {
   ver: 1
   name: pacemaker
}

And then you need cat > /etc/corosync/corosync.conf (appropriately configured):


compatibility: whitetank

amf {
  mode: disabled
}

logging {
  fileline: off
  to_stderr: no
  to_logfile: yes
  to_syslog: yes
  logfile: /var/log/cluster/corosync.log
  debug: on
  tags: enter|leave|trace1|trace2|trace3
  timestamp: on
  logger_subsys {
    subsys: AMF
    debug: on
  }
}

totem {
  version: 2
  token: 5000
  token_retransmits_before_loss_const: 20
  join: 1000
  consensus: 7500
  vsftype: none
  max_messages: 20
  secauth: off
  threads: 0
  rrp_mode: passive


  interface {
    ringnumber: 0
    bindnetaddr: 172.16.165.0
    broadcast: yes
    mcastport: 5405
    ttl: 1
  }
  interface {
    ringnumber: 1
    bindnetaddr: 10.0.0.0
    broadcast: yes
    mcastport: 5405
    ttl: 1
  }
}

aisexec {
  user: root
  group: root
}

corosync {
  user: root
  group: root
}

Note that the above configuration assumes that you have a second interface directly connected between both machines. That should already be configured, but it should look something like this in /etc/sysconfig/network-scripts/ifcfg-bond1 or something similar:

DEVICE=bond1
NM_CONTROLLED=yes
ONBOOT=yes
BOOTPROTO=none
IPV6INIT=no
USERCTL=no
IPADDR=10.0.0.134
NETMASK=255.255.255.0
BONDING_OPTS="miimon=100 updelay=200 downdelay=200 mode=4"

and ifcfg-eth4 along with another device if bonding like:

DEVICE=eth4
HWADDR=00:10:18:9e:0f:00
NM_CONTROLLED=yes
ONBOOT=yes
MASTER=bond1
SLAVE=yes

To make sure the two cluster machines can see each other completely, make sure to modify /etc/sysconfig/iptables to include something like:

:iscsi-initiators - [0:0]

-A INPUT -i bond1 -j ACCEPT

-A INPUT -m comment --comment "accept anything from cluster nodes"
-A INPUT -s 172.16.165.10,172.16.165.11 -m state --state NEW -j ACCEPT

-A INPUT -m comment --comment "accept iSCSI"
-A INPUT -p tcp --dport 3260 -m state --state NEW -j iscsi-initiators

-A iscsi-initiators -m comment --comment "only accept iSCSI from these hosts"
-A iscsi-initiators -s 172.16.165.18,172.16.165.19,172.16.165.20,172.16.165.21 -j ACCEPT
-A iscsi-initiators -j RETURN

And to make the cluster configuration simpler, we want to use the shortened host name for each machine. Modify /etc/sysconfig/network to look something like this:

NETWORKING=yes
HOSTNAME=salt

and modify /etc/hosts to make sure both cluster nodes always know where to find the other by name:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

172.16.165.11 salt.bitgnome.net salt
172.16.165.10 pepper.bitgnome.net pepper

If you don’t want to reboot, use hostname to force the changes now:

hostname salt

At this point, configure everything to start automatically and start the services:

chkconfig corosync on
chkconfig pacemaker on
chkconfig tgtd on
service corosync start
service pacemaker start
service tgtd start

You should now have a running cluster which you can check the status of (from either node) using:

corosync-cfgtool -s
crm_mon -1

For the rest of this configuration, any commands which somehow modify the cluster configuration can most likely be run from either cluster node.

A very useful command to dump the entire cluster stack and start over (except for the nodes themselves) is:

crm configure erase

If you end up with ORPHANED resources after doing the above, you might also need to do something like:

crm resource cleanup resource-name

where resource-name is of course the name of the resource showing as ORPHANED. It is worth mentioning though that this will most likely not stop or remove the actual resource being referenced here. It will just remove it from the cluster’s awareness. If you had a virtual IP address resource here for example, that IP would most likely still be configured and up on the node which was last assigned that resource. It might be worth rebooting any cluster nodes after clearing the configuration to guarantee everything has been cleared out as thoroughly as possible short of deleting directories on the file system entirely.

You might also need to look at something like:

crm_resource -L
crm_resource -C -r r0

to get the last lingering pieces left on the LRM side of things.

You can verify the clean configuration with both:

crm configure show
cibadmin -Q

making sure that there are no LRM resources left in the cibadmin output also.

Anyway, moving right along… In order to maintain quorum between our two node cluster, we must set the following:

crm configure property no-quorum-policy=ignore

While initially configuring the cluster, resources will not be started unless you disable STONITH. You can either issue the following:

crm configure property stonith-enabled=false

or you can go ahead and set up STONITH correctly. To do so, you need to create fencing primitives for every node in the cluster. The parameters for each primitive will come from the IPMI LAN configuration for the DRAC, BMC, iLO, or whatever other type of dedicated management card is installed in each node. To see the different possible fencing agents and their parameters, do:

stonith_admin --list-installed
stonith_admin --metadata --agent fence_ipmilan

We’re going to use the generic IPMI LAN agent for our Dell DRAC’s even though there are dedicated DRAC agents because IPMI is simply easier and you don’t have to do anything special like you do with the DRAC agents (and it can vary from one DRAC version to the next). We also need to sticky the primitive we create with a second location command:

crm configure primitive fence-salt stonith:fence_ipmilan \
    params ipaddr="172.16.74.153" \
    passwd="abcd1234" \
    login="laitsadmin" \
    verbose="true" \
    pcmk_host_list="salt" \
    op start interval="0" timeout="20" \
    op stop interval="0" timeout="20"
crm configure location salt-fencing fence-salt -inf: salt

Make sure to this for each cluster node. Once you’ve done this, you can test it by first shutting down the cluster on one of the nodes (and whatever else you might want to do: file system sync, read-only mounts, whatever you feel safest doing since you’re about to yank the power plug essentially) and then shooting it in the head:

service pacemaker stop
service corosync stop
(sync && mount -o remount,ro / && etc.)
stonith_admin --fence salt

You will probably want to test this on each node just to confirm that the IPMI configuration is correct for every node.

Next we want to alter the behavior of Pacemaker a bit by configuring a basic property known as resource stickiness. Out of the box, if a passive node becomes active again after having been active previously, Pacemaker will automatically migrate all the resources back to the new active node and set the existing active node to passive. This is not really something we need for our set of resources, so we want to inform Pacemaker to leave resources where they are unless we manually move them ourselves or the active node fails:

crm configure property default-resource-stickiness=1

To set up a resource for a shared IP address, do the following:

crm configure primitive ip ocf:heartbeat:IPaddr2 \
    params ip="172.16.165.12" \
    cidr_netmask="25" \
    op start interval="0" timeout="20" \
    op stop interval="0" timeout="20" \
    op monitor interval="10" timeout="20"

Next we need to setup our iSCSI target (note the escaped quotes to prevent bad shell/CRM interaction):

crm configure primitive tgt ocf:heartbeat:iSCSITarget \
    params iqn="iqn.2012-10.net.bitgnome:vh-storage" \
    tid="1" \
    allowed_initiators=\"172.16.165.18 172.16.165.19 172.16.165.20 172.16.165.21\" \
    op start interval="0" timeout="10" \
    op stop interval="0" timeout="10" \
    op monitor interval="10" timeout="10"

Now before defining our iSCSI logical units, let’s check our DRBD configuration. The standard DRBD configuration in /etc/drbd.conf should look like:

include "drbd.d/global_common.conf";
include "drbd.d/*.res";

Configuring the basic options in /etc/drbd.d/global_common.conf should look like:


global {
	usage-count no;
	# minor-count should be larger than the number of active resources
	# depending on your distro, larger values might not work as expected
	minor-count 100;
}

common {
	handlers {
		fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
		after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
	}

	startup {
	}

	disk {
		resync-rate 100M;
		on-io-error detach;
		fencing resource-only;
	}

	net {
		protocol C;
		cram-hmac-alg sha1;
		shared-secret "something_secret";
		# the following is not recommend in production because of CPU costs
		#data-integrity-alg sha1;
		verify-alg sha1;
	}
}

And finally, you need a resource file for each resource. Before we get to that file, we need to create the logical volumes on each node which will ultimately hold this new DRBD resource. To do that, we need to issue something like the following:

lvcreate -L 20G -n vm-test vg0

Use the same name (vm-test in this example) on the other node as well (with hopefully the same volume group name to make the next part easier). Now that we have the logical volume created, we can go ahead and create an appropriate resource file for DRBD. Start at r1 and increase the resource number by one to keep things simple and to match our LUN numbering later on, so the file will be /etc/drbd.d/r1.res:


resource r1 {
	disk {
		#resync-after r1;
	}

	# inheritable parameters
	device minor 1;
	meta-disk internal;

	on pepper {
		disk /dev/vg0/vm-test;
		address 172.16.165.10:7789;
	}

	on salt {
		disk /dev/vg0/vm-test;
		address 172.16.165.11:7789;
	}
}

You will need to uncomment the resync-after option and make the parameter refer to the last sequential resource number still in existence. This also means that if you remove a resource later, you will need to update the resource files to reflect any changes made. If you fail to make the changes, affected resources will fail to start and consequently the entire cluster stack will be down. This is a BAD situation. So, make the necessary changes as you remove old resources, and then issue the following on both nodes:

drbdadm adjust r2

or whatever the resource name that has been affected by a dangling reference to an old, recently removed resource.

Related to the sanity of the configuration files in general is the fact that even if you haven’t created or activated a resource in any way yet using drbdadm, the very presence of r?.res files in /etc/drbd.d can cause the cluster stack to stop working. The monitors that the cluster stack employs to check the health of DRBD in general require a 100% sane configuration at all times, including any and all files which might end in .res. This means that if you are creating new resources by copying the existing resource files, you need to either copy them to a name that doesn’t end in .res initially and then move them into place with the appropriately numbered resource name, or copy them to some other location first, and then move them back into place.

Also relevant is that when setting up new resources with a running, production stack, you will momentarily be forcing one of the two cluster nodes as the primary (as seen a few steps below here) to get the DRBD resource into a consistent state. When you do this, both nodes will start giving Nagios alerts because of the incosistent state of the newly added resource. You’ll probably want to disable notifications until your new resources are in a consistent state again.

Under Red Hat Enterprise Linux, you will want to verify the drbd service is NOT set to run automatically, but go ahead and load the module if it hasn’t been already so that we can play around with DRBD:

chkconfig drbd off
modprobe drbd

The reason for not loading DRBD at boot is because the OCF resource agent in the cluster will handle this for us.

And then on each node, you need to issue the following commands to initialize and activate the resource:

drbdadm create-md r1
drbdadm up r1

At this point, you should be able to see something like:

cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by phil@Build64R6, 2012-04-17 11:28:08

 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:20970844

And finally, you need to tell DRBD which node is considered the primary. Since neither node’s logical volume should have had anything useful on it when we started this process, let’s go with the node where resources are currently active (check the output from crm_mon to find the current node hosting the storage virtual IP address) so that we can add the resource to the cluster stack immediately (if you fail to set the current master node as the primary for this newly defined resource and add the resource to the cluster stack before it is consistent, you will bring down the entire cluster stack until both nodes are consistent) and then on that node only run:

drbdadm primary --force r1

As an example, let’s go ahead and create a second DRBD resource. The configuration in /etc/drbd.d/r2.res will look like:


resource r2 {
	disk {
		resync-after r1;
	}

	# inheritable parameters
	device minor 2;
	meta-disk internal;

	on pepper {
		disk /dev/vg0/vm-test2;
		address 172.16.165.10:7790;
	}

	on salt {
		disk /dev/vg0/vm-test2;
		address 172.16.165.11:7790;
	}
}

The most notable differences here are the resource name change itself, the device minor number bump, and the port number bump. All of those need to increment for each additional resource, along with the resync-after directive.

So, now we have some DRBD resources. Let’s set up the cluster to be aware of them. For each DRBD resource we add to the cluster, we need to define two separate cluster resources, a basic primitive resource and a master-slave resource. The catch is, they both must be defined at the same time and in the correct order. To accomplish this, do the following:

cat << EOF | crm -f -
cib new tmp
cib use tmp
configure primitive r1 ocf:linbit:drbd \
        params drbd_resource="r1" \
        op start interval="0" timeout="240" \
        op promote interval="0" timeout="90" \
        op demote interval="0" timeout="90" \
        op notify interval="0" timeout="90" \
        op stop interval="0" timeout="100" \
        op monitor interval=20 timeout=20 role="Slave" \
        op monitor interval=10 timeout=20 role="Master"
configure ms ms-r1 r1 \
        meta master-max="1" \
        master-node-max="1" \
        clone-max="2" \
        clone-node-max="1" \
        notify="true"
cib commit tmp
cib use live
cib delete tmp
EOF

And now we can define iSCSI LUN targets for each of our DRBD resources. That looks like:

crm configure primitive lun1 ocf:heartbeat:iSCSILogicalUnit \
    params target_iqn="iqn.2011-05.edu.utexas.la:vh-storage" \
    lun="1" \
    path="/dev/drbd/by-res/r1" \
    additional_parameters="mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0" \
    op start interval="0" timeout="10" \
    op stop interval="0" timeout="10" \
    op monitor interval="10" timeout="10"

Lastly, we need to tie all of the above together into the proper order and make sure all the resources end up in the same place via colocation. Since these all go together logically, I’ll use the same construct as above when adding DRBD resources to add all of these constraints at the same time (this is coming from a configuration with 3 DRBD resources and LUN’s defined):

cat << EOF | crm -f -
cib new tmp
cib use tmp
configure colocation ip-with-lun1 inf: ip lun1
configure colocation ip-with-lun2 inf: ip lun2
configure colocation ip-with-lun3 inf: ip lun3
configure colocation lun-with-r1 inf: lun1 ms-r1
configure colocation lun-with-r2 inf: lun2 ms-r2
configure colocation lun-with-r3 inf: lun3 ms-r3
configure colocation r1-with-tgt inf: ms-r1:Master tgt:Started
configure colocation r2-with-tgt inf: ms-r2:Master tgt:Started
configure colocation r3-with-tgt inf: ms-r3:Master tgt:Started
configure order lun1-before-ip inf: lun1 ip
configure order lun2-before-ip inf: lun2 ip
configure order lun3-before-ip inf: lun3 ip
configure order r1-before-lun inf: ms-r1:promote lun1:start
configure order r2-before-lun inf: ms-r2:promote lun2:start
configure order r3-before-lun inf: ms-r3:promote lun3:start
configure order tgt-before-r1 inf: tgt ms-r1
configure order tgt-before-r2 inf: tgt ms-r2
configure order tgt-before-r3 inf: tgt ms-r3
cib commit tmp
cib use live
cib delete tmp
EOF

That’s pretty much it! I highly recommend using crm_mon -r to view the health of your stack. Or if you prefer the graphical version, go grab a copy of LCMC here.