current version of my qemu script posted Sat, 29 Aug 2015 09:37:23 CDT

Since I keep posting it other places, but have yet to post it here, I'm including a copy of my current shell script to start qemu:

#!/bin/zsh

keyboard_id="04d9:0169"
mouse_id="046d:c24a"

keyboard=$(lsusb | grep "${keyboard_id}" | cut -d ' ' -f 2,4 | grep -Eo '[[:digit:]]+' | sed -e 's/^0*//' | xargs -n 2 | sed -e 's/ /./')
mouse=$(lsusb | grep "${mouse_id}" | cut -d ' ' -f 2,4 | grep -Eo '[[:digit:]]+' | sed -e 's/^0*//' | xargs -n 2 | sed -e 's/ /./')

if [[ -z "${keyboard}" || -z "${mouse}" ]]; then
        echo "keyboard (${keyboard}) or mouse (${mouse}) cannot be found; exiting"
        exit 1
fi

for i in {4..7}; do
        echo performance > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
        #cat /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
done

taskset -ac 4-7 qemu-system-x86_64 \
        -qmp unix:/run/qmp-sock,server,nowait \
        -display none \
        -enable-kvm \
        -M q35,accel=kvm \
        -m 8192 \
        -cpu host,kvm=off \
        -smp 4,sockets=1,cores=4,threads=1 \
        -mem-path /dev/hugepages \
        -rtc base=localtime,driftfix=slew \
        -device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root \
        -device vfio-pci,host=02:00.0,bus=root,addr=00.0,multifunction=on,x-vga=on -vga none \
        -device vfio-pci,host=02:00.1,bus=root,addr=00.1 \
        -usb -usbdevice host:${keyboard} -usbdevice host:${mouse} \
        -device virtio-scsi-pci,id=scsi \
        -drive if=none,file=/dev/win/cdrive,format=raw,cache=none,id=win-c -device scsi-hd,drive=win-c \
        -drive if=none,format=raw,file=/dev/sr0,id=blu-ray -device scsi-block,drive=blu-ray \
        -device virtio-net-pci,netdev=net0 -netdev bridge,id=net0,helper=/usr/lib/qemu/qemu-bridge-helper &

sleep 5

#cpuid=0
cpuid=4
for threadpid in $(echo 'query-cpus' | qmp-shell /run/qmp-sock | grep '^(QEMU) {"return":' | sed -e 's/^(QEMU) //' | jq -r '.return[].thread_id'); do
        taskset -p -c ${cpuid} ${threadpid}
        ((cpuid+=1))
done

wait

for i in {4..7}; do
        echo ondemand > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
        #cat /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
done

The only real change was to automatically search for the keyboard and mouse I want to pass through in case they get unplugged and end up at a different bus address.

even more QEMU-KVM news posted Fri, 28 Aug 2015 14:23:09 CDT

It seems like a recent Debian kernel change may have moved the vfio_iommu_type1 feature in the kernel from being statically compiled to a module. This meant I was getting the following when trying to start up qemu:


qemu-system-x86_64: -device vfio-pci,host=02:00.0,bus=root,addr=00.0,multifunction=on,x-vga=on: vfio: No available IOMMU models
qemu-system-x86_64: -device vfio-pci,host=02:00.0,bus=root,addr=00.0,multifunction=on,x-vga=on: vfio: failed to setup container for group 18
qemu-system-x86_64: -device vfio-pci,host=02:00.0,bus=root,addr=00.0,multifunction=on,x-vga=on: vfio: failed to get group 18
qemu-system-x86_64: -device vfio-pci,host=02:00.0,bus=root,addr=00.0,multifunction=on,x-vga=on: Device initialization failed
qemu-system-x86_64: -device vfio-pci,host=02:00.0,bus=root,addr=00.0,multifunction=on,x-vga=on: Device 'vfio-pci' could not be initialized

The esteemed Alex Williamson was quick to reply that this was due to a missing kernel module, vfio_iommu_type1 to be exact. So, add that into /etc/modules and go ahead and modprobe it to avoid a reboot, and you should be good to go.

more QEMU-KVM news posted Sat, 01 Aug 2015 05:52:04 CDT

I'm still running a virtualized Windows environment. I just upgraded to Windows 10 Pro using the 2012R2 virtio drivers since native drivers don't seem to exist yet. All of that is working well.

I ended up skipping the nonsense with irqbalance and simply let it run on every processor. It doesn't seem to make a huge amount of difference either way. I'm still running with the performance CPU frequency governor as I end up with too much jitter in video and audio playback otherwise. I wonder if running on an Intel processor would be better in this particular area?

I stopped passing through my USB ports directly as I was getting a lot of AMD-Vi error messages from the kernel. Everything was still working, but it was aggravating to have my dmesg full of garbage. So now my /etc/modprobe.d/local.conf looks like:


install vfio_pci /sbin/modprobe --first-time --ignore-install vfio_pci ; \
        /bin/echo 0000:02:00.0 > /sys/bus/pci/devices/0000:02:00.0/driver/unbind ; \
        /bin/echo 10de 1189 > /sys/bus/pci/drivers/vfio-pci/new_id ; \
        /bin/echo 0000:02:00.1 > /sys/bus/pci/devices/0000:02:00.1/driver/unbind ; \
        /bin/echo 10de 0e0a > /sys/bus/pci/drivers/vfio-pci/new_id
options kvm-amd npt=0

And I've updated my qemu-system command accordingly:

taskset -ac 4-7 qemu-system-x86_64 \
        -qmp unix:/run/qmp-sock,server,nowait \
        -display none \
        -enable-kvm \
        -M q35,accel=kvm \
        -m 8192 \
        -cpu host,kvm=off \
        -smp 4,sockets=1,cores=4,threads=1 \
        -mem-path /dev/hugepages \
        -rtc base=localtime,driftfix=slew \
        -device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root \
        -device vfio-pci,host=02:00.0,bus=root,addr=00.0,multifunction=on,x-vga=on -vga none \
        -device vfio-pci,host=02:00.1,bus=root,addr=00.1 \
        -usb -usbdevice host:10.4 -usbdevice host:10.5 \
        -device virtio-scsi-pci,id=scsi \
        -drive if=none,file=/dev/win/cdrive,format=raw,cache=none,id=win-c -device scsi-hd,drive=win-c \
        -drive if=none,format=raw,file=/dev/sr0,id=blu-ray -device scsi-block,drive=blu-ray \
        -device virtio-net-pci,netdev=net0 -netdev bridge,id=net0,helper=/usr/lib/qemu/qemu-bridge-helper &

I'm using the host:bus.addr format for usbdevice as otherwise I'd be passing through a whole lot of USB ports that would match the vendor_id:product_id format. I also get back my USB3 ports under Linux, should I ever really need them (and can always pass them through to Windows should I need to using this same functionality instead of dealing with the vfio-pci stuff).

I also upgraded my host GPU to a passively cooled GeForce 730 as the 8400 was causing weirdness with my receiver trying to detect audio constantly over the DVI to HDMI converter. This kept interrupting the S/PDIF audio I had coming in from the motherboard. Now everything comes over a proper HDMI connection. However, I was disappointed to discover that apparently there hasn't been a lot of progress made on passing through lossless, high quality audio formats like TrueHD or DTS-HD MA under Linux. mplayer, mpv, and vlc all seemed to be a bust in this regard and Kodi (formerly XBMC) just crashed my machine due to an unrelated nouveau bug, so I didn't get to test it any further. I can get normal DTS/AC-3 stuff working over HDMI just fine, but not the fancy stuff. I guess I'll stick to Windows for playing that stuff back even though it's all stored on my Linux machine. It would have been nice to get that working directly from Linux.

QEMU, KVM, and GPU passthrough on Debian testing posted Wed, 15 Jul 2015 09:55:03 CDT

I decided to take the plunge and try to run everything on one machine. I gutted both of my existing machines and bought a few extra parts. The final configuration ended up using an AMD FX-8350 on an ASRock 970 Extreme4 motherboard with 32GB of RAM in a Fractal Design R5 case. I've got a GeForce 8400 acting as the display under Linux and a GeForce 670 GTX being passed through to Windows.

I am using the following extra arguments on my kernel command line:


pci-stub.ids=10de:1189,10de:0e0a rd.driver.pre=pci-stub isolcpus=4-7 nohz=off

The identifiers I'm specifying are for the GPU and HDMI audio on my Geforce 670 so the nouveau driver doesn't latch onto the card. To further help prevent that situation, the rd.driver.pre statement should load the pci-stub driver as early as possible during the boot process. It's worth noting I'm using dracut. And finally, isolcpus is basically blocking off those 4 cores to prevent Linux from scheduling any processes on those cores. Along that same line of thinking, I tried to add the following to /etc/default/irqbalance:


IRQBALANCE_BANNED_CPUS=000000f0

but realized the current init.d script that systemd is using to start irqbalance won't ever pass along that environment variable correctly, so for now, I'm starting irqbalance by hand after boot.

I added these modules to /etc/modules:


vfio
vfio_pci

I added these options to /etc/modprobe.d/local.conf (you might need to remove the continuation characters and make that all one line):


install vfio_pci /sbin/modprobe --first-time --ignore-install vfio_pci ; /bin/echo 0000:02:00.0 > /sys/bus/pci/devices/0000:02:00.0/driver/unbind ; \
        /bin/echo 10de 1189 > /sys/bus/pci/drivers/vfio-pci/new_id ; /bin/echo 0000:02:00.1 > /sys/bus/pci/devices/0000:02:00.1/driver/unbind ; \
        /bin/echo 10de 0e0a > /sys/bus/pci/drivers/vfio-pci/new_id ; /bin/echo 0000:05:00.0 > /sys/bus/pci/devices/0000:05:00.0/driver/unbind ; \
        /bin/echo 1b21 1042 > /sys/bus/pci/drivers/vfio-pci/new_id ; /bin/echo 0000:00:13.0 > /sys/bus/pci/devices/0000:00:13.0/driver/unbind ; \
        /bin/echo 0000:00:13.2 > /sys/bus/pci/devices/0000:00:13.2/driver/unbind ; /bin/echo 1002 4397 > /sys/bus/pci/drivers/vfio-pci/new_id ; \
        /bin/echo 1002 4396 > /sys/bus/pci/drivers/vfio-pci/new_id
options kvm-amd npt=0

So that looks like a mess, but it's fairly straightforward really. Since I can't easily pass my device identifiers for USB through the kernel command line, I'm unbinding them individually and rebinding them to the vfio-pci driver. I'm passing through all the USB2 and USB3 devices running the ports on the front of my case. I'm also binding the GPU/audio device to vfio-pci here and specifying an option to KVM which is suppose to help performance on AMD machines. I setup some hugepage reservations and enabled IPv4 forwarding in /etc/sysctl.d/local.conf:


# Set hugetables / hugepages for KVM single guest needing 8GB RAM
vm.nr_hugepages = 4126

# forward traffic
net.ipv4.ip_forward = 1

Since bridging is my network usage model of choice, I needed to change /etc/network/interfaces:


auto lo br0
iface lo inet loopback

iface eth0 inet manual

iface br0 inet dhcp
        bridge_ports eth0
        bridge_stp off
        bridge_waitport 0
        bridge_fd 0

Putting everything I've discovered together, I've created a shell script. It includes all of the different things that need to happen like setting the cpufreq governor and pinning individual virtual CPU thread process identifiers to their respective physical CPU's. I'm using zsh as that is my goto shell for all things, but most anything should suffice. The script also depends on the presence of the qmp-shell script available here. You will want both the qmp-shell script itself, and the dependent Python library called qmp.py. Once all of that is in place, here is the final script to start everything:

#!/bin/zsh

for i in {4..7}; do
        echo performance > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
        #cat /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
done

taskset -ac 4-7 qemu-system-x86_64 -qmp unix:/run/qmp-sock,server,nowait -display none -enable-kvm -M q35,accel=kvm -m 8192 -cpu host,kvm=off \
        -smp 4,sockets=1,cores=4,threads=1 -mem-path /dev/hugepages -rtc base=localtime -device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root \
        -device vfio-pci,host=02:00.0,bus=root,addr=00.0,multifunction=on,x-vga=on -vga none -device vfio-pci,host=02:00.1,bus=root,addr=00.1 \
        -device vfio-pci,host=05:00.0 -device vfio-pci,host=00:13.0 -device vfio-pci,host=00:13.2 -device virtio-scsi-pci,id=scsi \
        -drive if=none,file=/dev/win/cdrive,format=raw,cache=none,id=win-c -device scsi-hd,drive=win-c -drive if=none,file=/dev/win/ddrive,format=raw,cache=none,id=win-d \
        -device scsi-hd,drive=win-d -drive if=none,format=raw,file=/dev/sr0,id=blu-ray -device scsi-block,drive=blu-ray -device virtio-net-pci,netdev=net0 \
        -netdev bridge,id=net0,helper=/usr/lib/qemu/qemu-bridge-helper &

sleep 5

cpuid=4
for threadpid in $(echo 'query-cpus' | qmp-shell /run/qmp-sock | grep '^(QEMU) {"return":' | sed -e 's/^(QEMU) //' | jq -r '.return[].thread_id'); do
        taskset -p -c ${cpuid} ${threadpid}
        ((cpuid+=1))
done

wait

for i in {4..7}; do
        echo ondemand > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
        #cat /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
done

I force the CPU cores assigned to the VM to run at their maximum frequency for the duration of the guest, after which, they scale back down into their normal on-demand mode. I found this helps to smooth out things a little bit more and helps to provide something approaching a physical machine experience, even though I'm using more power to get there. I'm also using qmp-shell to check the PID's of the vCPU threads and assigning each of them to individual pCPU's.

I ended up using the q35 virtual machine layout instead of the default. I'm not positive this matters, but I did end up adding the ioh3420 device later in my testing and it really did seem to improve performance a little bit more. Whether that requires using the q35, I'm not certain. And anyway, once the devices were detected and running under Windows after I first moved from physical to virtual, it wasn't worth it to me to switch back to the default machine type. I'm also using the legacy SeaBIOS instead of OVMF since I was migrating from physical to virtual and it was too much trouble trying to make a UEFI BIOS work after the fact.

Initially I wasn't using virtio based hardware, so you'll possibly need to change that to get up and running and then add in the virtual devices and load the proper virtio drivers. I did run into some weirdness here for a long time where Windows 7 kept crashing trying to install the drivers for either virtio-blk-pci or virtio-scsi-pci. I was using the current testing kernel (linux-image-3.16.0-4-amd64) and never really found a solution. I did end up installing a clean copy of Windows and was able to install the virtio stuff, but this really didn't help me. I finally ended up installing the latest unstable kernel which is linux-image-4.0.0-2-amd64 and I was finally able to install the virtio stuff without the guest OS crashing. I have no idea if that was the actual fix, but it seemed to be the relevant change.

Another thing that took awhile to figure out was how to properly pass through my Blu-ray drive to Windows so that things like AnyDVD HD worked correctly. I finally stumbled across this PDF which actually included qemu related commands to doing passthrough. It ended up being a simple change from scsi-cd to scsi-block.

I also had to forcibly set the GPU and audio drivers under Windows to use MSI by following these directions. Before doing this, audio was atrocious and video was pretty awful too.

That's most of it I think. When I originally posted this, I still wasn't quite happy with the performance of everything. However, in the current incarnation, aside from the possibly excessive power consumption caused by keeping the CPU's running at full tilt, I'm actually really happy with the performance. Hopefully other people will find this useful too!

DRBD v2 posted Wed, 08 May 2013 10:12:17 CDT

Previously I had written a fairly lengthy post on creating a cheap SAN using DRBD, iSCSI, and corosync/pacemaker. It was actually the second time we had done this setup at work, having originally done iSCSI LUN's using logical volumes on top of a single DRBD resource instead of what I described in my last post where we did iSCSI LUN's which were themselves separate DRBD resources on top of local logical volumes on each node of the cluster. Having run with that for awhile, and added around forty LUN's, I will say that it is rather slow at migrating from the primary to secondary node and only takes longer as we continue to add new DRBD resources.

Since we're in the process of setting up a new DRBD cluster, we've decided to go back to using the original design of using iSCSI LUN's using logical volumes on top of one large, single DRBD resource. I'll also mention that we had some real nightmares using the latest and greatest versions of Pacemaker 1.1.8 in Red Hat Enterprise Linux 6.4, so we're also pegging our cluster tools at the previous versions of everything which shipped in 6.3. Maybe the 6.4 stuff would have wokred if we were running a cluster in the more tradional Red Hat way (using CMAN).

So now our sl.repo file specifies the 6.3 release:


[scientific-linux]
name=Scientific Linux - $releasever
baseurl=http://ftp.scientificlinux.org/linux/scientific/6.3/$basearch/os/
enabled=1
gpgcheck=0

And we've also added a newer version of crmsh which must be installed forcibly from the RPM itself as it overwrites some of the files in the RHEL 6.3 pacemaker packages:


rpm --replacefiles -Uvh http://download.opensuse.org/repositories/network:/ha-clustering/RedHat_RHEL-6/x86_64/crmsh-1.2.5-55.3.x86_64.rpm

We did this specifically to allow use of rsc_template in our cluster which cleans everything up and makes the configuration hilariously simple.

We've also cleaned up the corosync configuration a bit by removing /etc/corosync/service.d/pcmk and adding that to the main configuration, as well as making use of the key we generated using corosync-keygen by enabling secauth:


amf {
  mode: disabled
}
 
logging {
  fileline: off
  to_stderr: no
  to_logfile: yes
  to_syslog: no
  logfile: /var/log/cluster/corosync.log
  debug: off
  timestamp: on
  logger_subsys {
    subsys: AMF
    debug: off
    tags: enter|leave|trace1|trace2|trace3|trace4|trace6
  }
}
 
totem {
  version: 2
  token: 10000
  token_retransmits_before_loss_const: 10
  vsftype: none
  secauth: on
  threads: 0
  rrp_mode: active
 
 
  interface {
    ringnumber: 0
    bindnetaddr: 172.16.165.0
    broadcast: yes
    mcastport: 5405
  }
  interface {
    ringnumber: 1
    bindnetaddr: 10.0.0.0
    broadcast: yes
    mcastport: 5405
  }
}

service {
  ver: 1
  name: pacemaker
}

aisexec {
  user: root
  group: root
}
 
corosync {
  user: root
  group: root
}

Other than that, there's onle one DRBD resource now. And once it's configured, you shouldn't ever really need to touch DRBD at all. lvcreate happens only once, and only on the primary storage node. We've also learned that corosync-cfgtool -s may not always be the best way to check membership, so you can also check corosync-objctl | grep member.

We also ran across a DRBD related bug in 6.4 which seems to affect this mixed 6.3/6.4 environment as well. We're still using kmod-drbd84 from El Repo, which is currently at version 8.4.2. Apparently in the shipping version of 8.4.3, they've fixed the bug that causes the file /usr/lib/drbd/crm-fence-peer.sh to break things horribly under 6.4 but also seems to work better even using Pacemaker 1.1.7 under 6.3. I recommend grabbing the tarball for 8.4.3 and overwriting the version shipping with 8.4.2. I'm sure as soon as 8.4.3 is packaged and available on El Repo, this won't be necessary.

You might want to set up a cronjob to run this DRBD verification script once a month or so:


#!/bin/sh

for i in $(drbdsetup show all | grep ^resource | awk '{print $2}' | sed -e 's/^r//'); do
	drbdsetup verify $i
	drbdsetup wait-sync $i
done

echo "DRBD device verification completed"

And maybe run this cluster backup script nightly just so you always have a reference point if something significant changes in your cluster:

#!/bin/bash

#define some variables
PATH=/bin:/sbin:/usr/bin:/usr/sbin
hour=$(date +"%H%M")
today=$(date +"%Y%m%d")
basedir="/srv/backups/cluster"
daily=$basedir/daily/$today
monthly=$basedir/monthly
lock="/tmp/$(basename $0)"

if test -f $lock; then
	echo "exiting; lockfile $lock exists; please check for existing backup process"
	exit 1
else
	touch $lock
fi

if ! test -d $daily ; then
	mkdir -p $daily
fi

if ! test -d $monthly ; then
	mkdir -p $monthly
fi


# dump and compress both CRM and CIB
crm_dumpfile="crm-$today-$hour.txt.xz"
if ! crm configure show | xz -c > $daily/$crm_dumpfile; then
	echo "something went wrong while dumping CRM on $(hostname -s)"
else
	echo "successfully dumped CRM on $(hostname -s)"
fi

cib_dumpfile="cib-$today-$hour.xml.xz"
if ! cibadmin -Q | xz -c > $daily/$cib_dumpfile; then
	echo "something went wrong while dumping CIB on $(hostname -s)"
else
	echo "successfully dumped CIB on $(hostname -s)"
fi

# keep a monthly copy
if test "x$(date +"%d")" == "x01" ; then
	monthly=$monthly/$today
	mkdir -p $monthly
	cp $daily/$crm_dumpfile $monthly
	cp $daily/$cib_dumpfile $monthly
fi

# remove daily backups after 2 weeks
for dir in $(find "$basedir/daily/" -type d -mtime +14| sort); do
	if test -d "$dir"; then
		echo "removing $dir"
		rm -rf "$dir"
	else
		echo "$dir not found"
	fi
done

# remove monthly backups after 6 months
for dir in $(find "$basedir/monthly/" -type d -mtime +180| sort); do
	if test -d "$dir"; then
		echo "removing $dir"
		rm -rf "$dir"
	else
		echo "$dir not found"
	fi
done

rm -f $lock

And finally, we have the actual cluster configuration itself, more or less straight out of production:


node salt
node pepper
rsc_template lun ocf:heartbeat:iSCSILogicalUnit \
	params target_iqn="iqn.2013-04.net.bitgnome:vh-storage01" additional_parameters="mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0" \
	op start interval="0" timeout="10" \
	op stop interval="0" timeout="10" \
	op monitor interval="10" timeout="10"
primitive fence-salt stonith:fence_ipmilan \
	params ipaddr="172.16.74.164" passwd="abcd1234" login="laitsadmin" verbose="true" pcmk_host_list="salt" \
	op start interval="0" timeout="20" \
	op stop interval="0" timeout="20"
primitive fence-pepper stonith:fence_ipmilan \
	params ipaddr="172.16.74.165" passwd="abcd1234" login="laitsadmin" verbose="true" pcmk_host_list="pepper" \
	op start interval="0" timeout="20" \
	op stop interval="0" timeout="20"
primitive ip ocf:heartbeat:IPaddr2 \
	params ip="172.16.165.24" cidr_netmask="25" \
	op start interval="0" timeout="20" \
	op stop interval="0" timeout="20" \
	op monitor interval="10" timeout="20"
primitive lun1 @lun \
	params lun="1" path="/dev/vg0/vm-ldap1"
primitive lun2 @lun \
	params lun="2" path="/dev/vg0/vm-test1"
primitive lun3 @lun \
	params lun="3" path="/dev/vg0/vm-mail11"
primitive lun4 @lun \
	params lun="4" path="/dev/vg0/vm-mail2"
primitive lun5 @lun \
	params lun="5" path="/dev/vg0/vm-www1"
primitive lun6 @lun \
	params lun="6" path="/dev/vg0/vm-ldap-slave1"
primitive lun7 @lun \
	params lun="7" path="/dev/vg0/vm-ldap-slave2"
primitive lun8 @lun \
	params lun="8" path="/dev/vg0/vm-ldap-slave3"
primitive lun9 @lun \
	params lun="9" path="/dev/vg0/vm-www2"
primitive lvm_vg0 ocf:heartbeat:LVM \
	params volgrpname="vg0" \
	op start interval="0" timeout="30" \
	op stop interval="0" timeout="30" \
	op monitor interval="10" timeout="30" depth="0"
primitive r0 ocf:linbit:drbd \
	params drbd_resource="r0" \
	op start interval="0" timeout="240" \
	op promote interval="0" timeout="90" \
	op demote interval="0" timeout="90" \
	op notify interval="0" timeout="90" \
	op stop interval="0" timeout="100" \
	op monitor interval="20" role="Slave" timeout="20" \
	op monitor interval="10" role="Master" timeout="20"
primitive tgt ocf:heartbeat:iSCSITarget \
	params iqn="iqn.2013-04.net.bitgnome:vh-storage01" tid="1" allowed_initiators="172.16.165.18 172.16.165.19 172.16.165.20 172.16.165.21" \
	op start interval="0" timeout="10" \
	op stop interval="0" timeout="10" \
	op monitor interval="10" timeout="10"
ms ms-r0 r0 \
	meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
location salt-fencing fence-salt -inf: salt
location pepper-fencing fence-pepper -inf: pepper
colocation drbd-with-tgt inf: ms-r0:Master tgt:Started
colocation ip-with-lun inf: ip lun
colocation lun-with-lvm inf: lun lvm_stor01
colocation lvm-with-drbd inf: lvm_stor01 ms-r0:Master
order drbd-before-lvm inf: ms-r0:promote lvm_stor01:start
order lun-before-ip inf: lun ip
order lvm-before-lun inf: lvm_stor01 lun
order tgt-before-drbd inf: tgt ms-r0
property $id="cib-bootstrap-options" \
	dc-version="1.1.7-6.el6-abcd1234" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	no-quorum-policy="ignore" \
	stonith-enabled="true" \
	last-lrm-refresh="1368030674" \
	stonith-action="reboot"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

The great part about this configuration is that the constraints are all tied to the rsc_template, so you don't need to specify new constraints each time you add a new LUN. And because we're using a template, the actual LUN primitives are as short as possible while still uniquely identifying each unit. It's quite lovely really.

the hell of Java keystores and existing server certificates posted Thu, 08 Nov 2012 07:56:16 CST

As a semi-conscientious netizen, I feel it's my duty to post about the insanity of dealing with Java keystore files when you already have X.509 PEM encoded certificates and intermediate CA certificates. I spent multiple hours over the last few days trying to grok this mess and I never want to spend another moment of my life trying to reinvent the wheel when I have to do this again several years from now.

Like most server administrators probably, I have an existing set of signed server certificates along with a bundle of CA signed intermediate certificates, all in X.509 PEM format (those base64 encoded ASCII text files that everyone knows and loves in the Unix world; if you're using Windows and have PKCS #12 encoded files, you'll need to look up how to convert them using the openssl command). But now I need to deploy something Java based (often Tomcat applications) which requires a Java keystore file instead of the much saner X.509 PEM format that practically everything that isn't Java uses without any problems. This is where the insanity starts. And yes, I realize that newer versions of Tomcat can use OpenSSL directly which allows you to use X.509 PEM encoded files directly also, but that wasn't an option here. And yes, I also realize that you could do some crazy wrapper setup using Apache on the front and the Tomcat application on the back. But that's ludicrous just to work around how idiotic Java applications are about handling SSL certificates.

Every other piece of Unix software I've ever configured expects a server certificate and private key and possibly a single or even multiple intermediate certificates to enable SSL or TLS functionality. Granted, some applications are better about explicitly supporting intermediate certificates. But even ones that don't almost always allow you to concatenate all of your certificates together (in order from least to most trusted; so, server certificate signed by some intermediate signed by possibly another intermediate signed by a self-signed CA certificate, where the top level CA certificate is normally left off of the chain). The point is, the end client ends up getting said blob and can then check that the last intermediate certificate is signed by a locally trusted top level CA certificate already present on the client's device.

All of the documentation I could find says to import the intermediate and CA certificates into the keystore using the -trustcacerts option and using different aliases. The problem I was seeing though was that testing the validity of my server's certificate after I installed the keystore this way using OpenSSL's s_client always resulted in the server certificate not validating. Looking at s_client with -showcerts enabled, all I was ever getting back from the server during the initial SSL handshake was the lone server certificate without any of the intermediate certificates, unlike any of my other Apache or nginx server where the entire certificate blob was being passed from the server to the client, allowing s_client to verify that the certificate was in fact trusted by my local CA bundle installed as part of my operating system. If you want to try validating your own server's certificate, use something like:

openssl s_client -CAfile /etc/ssl/certs/ca-certificates.crt -showcerts -connect www.bitgnome.net:443

I finally ran across a post which mentioned keystore as part of the vtcrypt project. This turned out to be the key in making everything work the way I normally expect them to work.

Now before you run off to do the magic below, you will need to convert your PKCS #8 PEM formatted private key into a DER formatted key. You will need to do something like:

openssl pkcs8 -topk8 -nocrypt -outform DER -in server.key -out server.pkcs8

The handy thing about keystore is that it will ingest a standard X.509 PEM encoded certificate file, even when it has multiple certificates present, and spit out that desperately needed Java keystore with an alias that actually has multiple ceritficates present also! I include the magic here for demonstration purposes:

~/vt-crypt-2.1.4/bin/keystore -import -keystore test.jks -storepass changeit -alias tomcat -cert server+intermediate.crt -key server.pkcs8

That's it! The test.jks keystore doesn't need to exist. This will create it. Check to make sure that the keystore now contains the correct information:

~/vt-crypt-2.1.4/bin/keystore -list -keystore test.jks -storepass changeit

and you should see your certificate chain starting with your server certificate and ending with your last intermediate certificate. Once I installed the keystore in my application, s_client was able to successfully verify the now complete chain of trust from server certificate to my locally trusted CA root certificate.

pretty VIM colors posted Mon, 22 Oct 2012 11:29:57 CDT

This one is more for myself so I don't forget about it (and I can find it again later). There is a nifty project here that is storing a repository of VIM color settings. It's a damn slick interface and well worth checking out if you're a VIM user.

Yep.

ZFS on Linux posted Fri, 12 Oct 2012 22:01:03 CDT

Since I'm on a roll here with my posts (can you tell I'm bored on a Friday night?), I figured I would also chime in here a bit with my experiences using ZFS on Linux.

Quite some time ago now, I posted about OpenSolaris and ZFS. Fast forward a few years, and I would beg you to pretty much ignore everything I said then. The problem of course is that OpenSolaris doesn't really exist now that the asshats at Oracle have basically ruined anything good that ever came out of Sun Microsystems, post acquisition. No real surprises there I guess. I can't think of anyone really whom I've known over the years who actually likes Oracle as a company. They've managed to bungle just about everything they've ever touched and continue to do so in spades.

Now, the knowledgeable reader might say at this point, but what about all of the forks? Sorry folks, I just don't see a whole lot of traction in any of these camps. Certainly not enough to warrant dropping all of my data onto any of their platforms anyway. And sure, you could run FreeBSD to get ZFS. But again, it seems to me the BSD camp in general has been dying the death of a thousand cuts over the years and continues to fade away into irrelevance (to be fair, I'm still rooting for the OpenBSD project; but I'd probably just be content to get PF on Linux at some point and call it a day).

What I'm trying to say of course is that Linux has had the bulk of the lion's share in real, capital resources funding development and maintenance for years on end now. So while you might not agree with everything that's happened over the years (devfs anyone? hell, udev now?), it's hard to argue that Linux can't do just about anything you want to do with a computer platform nowadays, whether that be the smartphone in your pocket or the several thousand node supercomputer at your local university, and everything in between.

Getting back to the whole point of this post, the one things that is glaringly missing from the Linux world still is ZFS. Sure, Btrfs is slowly making its way out of the birth canal. But it's still under heavy development. And while I thought running ReiserFS v3 back in the day was cool and fun (you know, before Hans murdered his wife) when ext2 was still the de facto file system for Linux, I simply refuse to entrust the several terabytes of storage I have at home now to Btrfs on the off chance it won't corrupt the entire file system.

So, where does that leave us? Thankfully the nice folks over at Lawrence Livermore National Laboratory, under a Department of Energy contract, have done all the hard work in porting ZFS to run on Linux natively. This means that you can get all the fantastic data integrity which ZFS provides on an operating system that generally doesn't suck! Everyone wins!

Now I've known about the ZFS on FUSE project for awhile along with the LLNL project. I've stayed away from both because it just didn't quite seem like either was ready for prime time just yet. But I finally took the plunge a month or so ago and copied everything off a dual 3.5" external USB enclosure I have for backups which currently has two 1.5TB hard drives in it and slapped a ZFS mirror onto those puppies. I'm running all of this on the latest Debian testing kernel (3.2.0-3-amd64 at the moment) built directly from source into easily installable .deb packages, and I must say, I'm very impressed thus far.

Just knowing that every single byte sitting on those drives has some kind of checksum associated with it thrills me beyond rational understanding. I had been running a native Linux software RAID-1 array previously using mdadm. And sure, it would periodically check the integrity of the RAID-1 mirror just like my zpool scrub does now. But I just didn't have the same level of trust in the data like I do now. As great as Linux might be, I've still seen the kernel flip out enough times doing low level stuff that I'm always at least a little bit leery of what's going on behind the scenes (my most recent foray with disaster was with the same mdadm subsystem trying to do software RAID across 81 multipath connected SAS drives and we ended up buying hardware RAID cards instead of continuing to deal with how broken that whole configuration was; and that was earlier this year).

My next project will most likely involve rebuilding my Linux file server at home with eight 2-3TB hard drives and dumping the entirety of my multimedia collection onto a really large RAID-Z2 or RAID-Z3 ZFS volume. I've actually been looking forward to it. Now just as soon as someone starts selling large capacity SATA drives at a reasonable rate, I'll probably buy some up and go to town.

DRBD, iSCSI, and Linux clustering == cheap SAN solution posted Fri, 12 Oct 2012 21:04:21 CDT

As promised, here are my notes for building a home made, pennies on the dollar SAN solution on the off chance you've been recently eyeballing one of those ludicrously expensive commercial offerings and you've come to the conclusion that yes, they are in fact ludicrously expensive. While I'm normally a Debian user personally, these notes will be geared towards Red Hat based distributions since that's what I have the (mis)fortune of using at work. But whatever. It should be easy enough to adapt to whichever distribution you so choose. It's also worth mentioning that I originally did almost this exact same configuration, but using a single DRBD resource and then managing LVM itself via DRBD. Both approaches have their merits, but I prefer this method instead.

There are a couple of things to note with the following information. First, in all cases where we are creating resources inside of Pacemaker, we're going to be specifying the operational parameters based on the advisory minimums which you can view by typing something like:


crm ra meta ocf:heartbeat:iSCSITarget

or whichever resource agent provider you wish to view. Also, for this particular instance, we will be running tgtd directly at boot time instead of managing the resource via the cluster stack. Since the example documentation from places like the DRBD manual are implementation agnostic and tgtd can be running all the time on both nodes without causing any problems, we'll just start the service at boot and assume that it's always running. If we have problems with tgtd segfaulting for whatever reason, we will need to add a provider based on the lsb:tgtd resource agent which directly manages the starting and stopping of tgtd.

As a final precursor, you will probably want to name the LVM volume group on each machine identically as it will simplify the DRBD resource configuration below. I need to correct myself here actually. If you try to specify the disk as an inherited option, then the device path becomes /dev/drbd/by-res/r1/0 instead of just /dev/drbd/by-res/r1. Since we're not using volumes here, I prefer the latter syntax. But go ahead and name the volume group the same anyway just to make life easier.

Currently we're using the Scientific Linux repositories for installing the necessary, up to date versions of all the various cluster related packages (we could also have used the RHEL DVD initially, but then we wouldn't be getting any updates for these packages past the intial version availabile on the DVD). In order to use the SL repositories, we will install the yum-plugin-priorities package so that the offical RHEL repositories take precedence over the SL repositories.


yum install yum-plugin-priorities

Once that is installed, you really only need to configure the RHN repositories to change from the default priority of 99 to a much higer priority of 20 (arbitrary choice to allow some higher priorities if necessary). So /etc/yum/pluginconf.d/rhnplugin.conf should now look something like:


[main]
enabled = 1
gpgcheck = 1

# You can specify options per channel, e.g.:
#
#[rhel-i386-server-5]
#enabled = 1
#
#[some-unsigned-custom-channel]
#gpgcheck = 0

priority=20

[rpmforge-el6-x86_64]
exclude=nagios*
priority=99

Once that is configured, we can add the actual SL repository by doing cat > /etc/yum.repos.d/sl.repo:


[scientific-linux]
name=Scientific Linux - $releasever
#baseurl=http://mirror3.cs.wisc.edu/pub/mirrors/linux/scientificlinux.org/$releasever/$ARCH/SL/
baseurl=http://ftp.scientificlinux.org/linux/scientific/6/$basearch/os/
# baseurl=http://centos.alt.ru/repository/centos/5/$basearch/
enabled=1
gpgcheck=0
#includepkgs=*xfs* cluster-cim cluster-glue cluster-glue-libs clusterlib cluster-snmp cman cmirror corosync corosynclib ctdb dlm-pcmk fence-agents fence-virt gfs-pcmk httpd httpd-tools ipvsadm luci lvm2-cluster modcluster openais openaislib pacemaker pacemaker-libs pexpect piranha python-repoze-who-friendlyform resource-agents rgmanager ricci tdb-tools

For DRBD, we will want to use the El Repo repository. You can find the instructions for installing this repository here. We will be using v8.4 (as of this writing) of DRBD.

Now that everything is configured correctly, we can start installing the necessary packages:


yum install corosync pacemaker fence-agents kmod-drbd84 scsi-target-utils

For now (RHEL 6.x; 2.6.32), we'll be using the older STGT iSCSI target as LIO wasn't included in the Linux kernel until 2.6.38. Newer versions of Red Hat or Linux in general will probably require updated instructions here and below.

The instructions for configuring the cluster itself can generally be found here. I will include the necessary pieces here just in case that is unavailable for whatever reason.

You need to run:


corosync-keygen

Next you need to do cat > /etc/corosync/service.d/pcmk:


service {
   ver: 1
   name: pacemaker
}

And then you need cat > /etc/corosync/corosync.conf (appropriately configured):


compatibility: whitetank

amf {
  mode: disabled
}

logging {
  fileline: off
  to_stderr: no
  to_logfile: yes
  to_syslog: yes
  logfile: /var/log/cluster/corosync.log
  debug: on
  tags: enter|leave|trace1|trace2|trace3
  timestamp: on
  logger_subsys {
    subsys: AMF
    debug: on
  }
}

totem {
  version: 2
  token: 5000
  token_retransmits_before_loss_const: 20
  join: 1000
  consensus: 7500
  vsftype: none
  max_messages: 20
  secauth: off
  threads: 0
  rrp_mode: passive


  interface {
    ringnumber: 0
    bindnetaddr: 172.16.165.0
    broadcast: yes
    mcastport: 5405
    ttl: 1
  }
  interface {
    ringnumber: 1
    bindnetaddr: 10.0.0.0
    broadcast: yes
    mcastport: 5405
    ttl: 1
  }
}

aisexec {
  user: root
  group: root
}

corosync {
  user: root
  group: root
}

Note that the above configuration assumes that you have a second interface directly connected between both machines. That should already be configured, but it should look something like this in /etc/sysconfig/network-scripts/ifcfg-bond1 or something similar:


DEVICE=bond1
NM_CONTROLLED=yes
ONBOOT=yes
BOOTPROTO=none
IPV6INIT=no
USERCTL=no
IPADDR=10.0.0.134
NETMASK=255.255.255.0
BONDING_OPTS="miimon=100 updelay=200 downdelay=200 mode=4"

and ifcfg-eth4 along with another device if bonding like:


DEVICE=eth4
HWADDR=00:10:18:9e:0f:00
NM_CONTROLLED=yes
ONBOOT=yes
MASTER=bond1
SLAVE=yes

To make sure the two cluster machines can see each other completely, make sure to modify /etc/sysconfig/iptables to include something like:


:iscsi-initiators - [0:0]

-A INPUT -i bond1 -j ACCEPT

-A INPUT -m comment --comment "accept anything from cluster nodes"
-A INPUT -s 172.16.165.10,172.16.165.11 -m state --state NEW -j ACCEPT

-A INPUT -m comment --comment "accept iSCSI"
-A INPUT -p tcp --dport 3260 -m state --state NEW -j iscsi-initiators

-A iscsi-initiators -m comment --comment "only accept iSCSI from these hosts"
-A iscsi-initiators -s 172.16.165.18,172.16.165.19,172.16.165.20,172.16.165.21 -j ACCEPT
-A iscsi-initiators -j RETURN

And to make the cluster configuration simpler, we want to use the shortened host name for each machine. Modify /etc/sysconfig/network to look something like this:


NETWORKING=yes
HOSTNAME=salt

and modify /etc/hosts to make sure both cluster nodes always know where to find the other by name:


127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

172.16.165.11 salt.bitgnome.net salt
172.16.165.10 pepper.bitgnome.net pepper

If you don’t want to reboot, use hostname to force the changes now:


hostname salt

At this point, configure everything to start automatically and start the services:


chkconfig corosync on
chkconfig pacemaker on
chkconfig tgtd on
service corosync start
service pacemaker start
service tgtd start

You should now have a running cluster which you can check the status of (from either node) using:


corosync-cfgtool -s
crm_mon -1

For the rest of this configuration, any commands which somehow modify the cluster configuration can most likely be run from either cluster node.

A very useful command to dump the entire cluster stack and start over (except for the nodes themselves) is:


crm configure erase

If you end up with ORPHANED resources after doing the above, you might also need to do something like:


crm resource cleanup resource-name

where resource-name is of course the name of the resource showing as ORPHANED. It is worth mentioning though that this will most likely not stop or remove the actual resource being referenced here. It will just remove it from the cluster’s awareness. If you had a virtual IP address resource here for example, that IP would most likely still be configured and up on the node which was last assigned that resource. It might be worth rebooting any cluster nodes after clearing the configuration to guarantee everything has been cleared out as thoroughly as possible short of deleting directories on the file system entirely.

You might also need to look at something like:


crm_resource -L
crm_resource -C -r r0

to get the last lingering pieces left on the LRM side of things.

You can verify the clean configuration with both:


crm configure show
cibadmin -Q

making sure that there are no LRM resources left in the cibadmin output also.

Anyway, moving right along… In order to maintain quorum between our two node cluster, we must set the following:


crm configure property no-quorum-policy=ignore

While initially configuring the cluster, resources will not be started unless you disable STONITH. You can either issue the following:


crm configure property stonith-enabled=false

or you can go ahead and set up STONITH correctly. To do so, you need to create fencing primitives for every node in the cluster. The parameters for each primitive will come from the IPMI LAN configuration for the DRAC, BMC, iLO, or whatever other type of dedicated management card is installed in each node. To see the different possible fencing agents and their parameters, do:


stonith_admin --list-installed
stonith_admin --metadata --agent fence_ipmilan

We’re going to use the generic IPMI LAN agent for our Dell DRAC’s even though there are dedicated DRAC agents because IPMI is simply easier and you don’t have to do anything special like you do with the DRAC agents (and it can vary from one DRAC version to the next). We also need to sticky the primitive we create with a second location command:


crm configure primitive fence-salt stonith:fence_ipmilan \
    params ipaddr="172.16.74.153" \
    passwd="abcd1234" \
    login="laitsadmin" \
    verbose="true" \
    pcmk_host_list="salt" \
    op start interval="0" timeout="20" \
    op stop interval="0" timeout="20"
crm configure location salt-fencing fence-salt -inf: salt

Make sure to this for each cluster node. Once you’ve done this, you can test it by first shutting down the cluster on one of the nodes (and whatever else you might want to do: file system sync, read-only mounts, whatever you feel safest doing since you’re about to yank the power plug essentially) and then shooting it in the head:


service pacemaker stop
service corosync stop
(sync && mount -o remount,ro / && etc.)
stonith_admin --fence salt

You will probably want to test this on each node just to confirm that the IPMI configuration is correct for every node.

Next we want to alter the behavior of Pacemaker a bit by configuring a basic property known as resource stickiness. Out of the box, if a passive node becomes active again after having been active previously, Pacemaker will automatically migrate all the resources back to the new active node and set the existing active node to passive. This is not really something we need for our set of resources, so we want to inform Pacemaker to leave resources where they are unless we manually move them ourselves or the active node fails:


crm configure property default-resource-stickiness=1

To set up a resource for a shared IP address, do the following:


crm configure primitive ip ocf:heartbeat:IPaddr2 \
    params ip="172.16.165.12" \
    cidr_netmask="25" \
    op start interval="0" timeout="20" \
    op stop interval="0" timeout="20" \
    op monitor interval="10" timeout="20"

Next we need to setup our iSCSI target (note the escaped quotes to prevent bad shell/CRM interaction):


crm configure primitive tgt ocf:heartbeat:iSCSITarget \
    params iqn="iqn.2012-10.net.bitgnome:vh-storage" \
    tid="1" \
    allowed_initiators=\"172.16.165.18 172.16.165.19 172.16.165.20 172.16.165.21\" \
    op start interval="0" timeout="10" \
    op stop interval="0" timeout="10" \
    op monitor interval="10" timeout="10"

Now before defining our iSCSI logical units, let’s check our DRBD configuration. The standard DRBD configuration in /etc/drbd.conf should look like:


include "drbd.d/global_common.conf";
include "drbd.d/*.res";

Configuring the basic options in /etc/drbd.d/global_common.conf should look like:


global {
	usage-count no;
	# minor-count should be larger than the number of active resources
	# depending on your distro, larger values might not work as expected
	minor-count 100;
}

common {
	handlers {
		fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
		after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
	}

	startup {
	}

	disk {
		resync-rate 100M;
		on-io-error detach;
		fencing resource-only;
	}

	net {
		protocol C;
		cram-hmac-alg sha1;
		shared-secret "something_secret";
		# the following is not recommend in production because of CPU costs
		#data-integrity-alg sha1;
		verify-alg sha1;
	}
}

And finally, you need a resource file for each resource. Before we get to that file, we need to create the logical volumes on each node which will ultimately hold this new DRBD resource. To do that, we need to issue something like the following:


lvcreate -L 20G -n vm-test vg0

Use the same name (vm-test in this example) on the other node as well (with hopefully the same volume group name to make the next part easier). Now that we have the logical volume created, we can go ahead and create an appropriate resource file for DRBD. Start at r1 and increase the resource number by one to keep things simple and to match our LUN numbering later on, so the file will be /etc/drbd.d/r1.res:


resource r1 {
	disk {
		#resync-after r1;
	}

	# inheritable parameters
	device minor 1;
	meta-disk internal;

	on pepper {
		disk /dev/vg0/vm-test;
		address 172.16.165.10:7789;
	}

	on salt {
		disk /dev/vg0/vm-test;
		address 172.16.165.11:7789;
	}
}

You will need to uncomment the resync-after option and make the parameter refer to the last sequential resource number still in existence. This also means that if you remove a resource later, you will need to update the resource files to reflect any changes made. If you fail to make the changes, affected resources will fail to start and consequently the entire cluster stack will be down. This is a BAD situation. So, make the necessary changes as you remove old resources, and then issue the following on both nodes:


drbdadm adjust r2

or whatever the resource name that has been affected by a dangling reference to an old, recently removed resource.

Related to the sanity of the configuration files in general is the fact that even if you haven’t created or activated a resource in any way yet using drbdadm, the very presence of r?.res files in /etc/drbd.d can cause the cluster stack to stop working. The monitors that the cluster stack employs to check the health of DRBD in general require a 100% sane configuration at all times, including any and all files which might end in .res. This means that if you are creating new resources by copying the existing resource files, you need to either copy them to a name that doesn’t end in .res initially and then move them into place with the appropriately numbered resource name, or copy them to some other location first, and then move them back into place.

Also relevant is that when setting up new resources with a running, production stack, you will momentarily be forcing one of the two cluster nodes as the primary (as seen a few steps below here) to get the DRBD resource into a consistent state. When you do this, both nodes will start giving Nagios alerts because of the incosistent state of the newly added resource. You’ll probably want to disable notifications until your new resources are in a consistent state again.

Under Red Hat Enterprise Linux, you will want to verify the drbd service is NOT set to run automatically, but go ahead and load the module if it hasn’t been already so that we can play around with DRBD:


chkconfig drbd off
modprobe drbd

The reason for not loading DRBD at boot is because the OCF resource agent in the cluster will handle this for us.

And then on each node, you need to issue the following commands to initialize and activate the resource:


drbdadm create-md r1
drbdadm up r1

At this point, you should be able to see something like:


cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by phil@Build64R6, 2012-04-17 11:28:08

 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:20970844

And finally, you need to tell DRBD which node is considered the primary. Since neither node’s logical volume should have had anything useful on it when we started this process, let’s go with the node where resources are currently active (check the output from crm_mon to find the current node hosting the storage virtual IP address) so that we can add the resource to the cluster stack immediately (if you fail to set the current master node as the primary for this newly defined resource and add the resource to the cluster stack before it is consistent, you will bring down the entire cluster stack until both nodes are consistent) and then on that node only run:


drbdadm primary --force r1

As an example, let’s go ahead and create a second DRBD resource. The configuration in /etc/drbd.d/r2.res will look like:


resource r2 {
	disk {
		resync-after r1;
	}

	# inheritable parameters
	device minor 2;
	meta-disk internal;

	on pepper {
		disk /dev/vg0/vm-test2;
		address 172.16.165.10:7790;
	}

	on salt {
		disk /dev/vg0/vm-test2;
		address 172.16.165.11:7790;
	}
}

The most notable differences here are the resource name change itself, the device minor number bump, and the port number bump. All of those need to increment for each additional resource, along with the resync-after directive.

So, now we have some DRBD resources. Let’s set up the cluster to be aware of them. For each DRBD resource we add to the cluster, we need to define two separate cluster resources, a basic primitive resource and a master-slave resource. The catch is, they both must be defined at the same time and in the correct order. To accomplish this, do the following:


cat << EOF | crm -f -
cib new tmp
cib use tmp
configure primitive r1 ocf:linbit:drbd \
        params drbd_resource="r1" \
        op start interval="0" timeout="240" \
        op promote interval="0" timeout="90" \
        op demote interval="0" timeout="90" \
        op notify interval="0" timeout="90" \
        op stop interval="0" timeout="100" \
        op monitor interval=20 timeout=20 role="Slave" \
        op monitor interval=10 timeout=20 role="Master"
configure ms ms-r1 r1 \
        meta master-max="1" \
        master-node-max="1" \
        clone-max="2" \
        clone-node-max="1" \
        notify="true"
cib commit tmp
cib use live
cib delete tmp
EOF

And now we can define iSCSI LUN targets for each of our DRBD resources. That looks like:


crm configure primitive lun1 ocf:heartbeat:iSCSILogicalUnit \
    params target_iqn="iqn.2011-05.edu.utexas.la:vh-storage" \
    lun="1" \
    path="/dev/drbd/by-res/r1" \
    additional_parameters="mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0" \
    op start interval="0" timeout="10" \
    op stop interval="0" timeout="10" \
    op monitor interval="10" timeout="10"

Lastly, we need to tie all of the above together into the proper order and make sure all the resources end up in the same place via colocation. Since these all go together logically, I’ll use the same construct as above when adding DRBD resources to add all of these constraints at the same time (this is coming from a configuration with 3 DRBD resources and LUN’s defined):


cat << EOF | crm -f -
cib new tmp
cib use tmp
configure colocation ip-with-lun1 inf: ip lun1
configure colocation ip-with-lun2 inf: ip lun2
configure colocation ip-with-lun3 inf: ip lun3
configure colocation lun-with-r1 inf: lun1 ms-r1
configure colocation lun-with-r2 inf: lun2 ms-r2
configure colocation lun-with-r3 inf: lun3 ms-r3
configure colocation r1-with-tgt inf: ms-r1:Master tgt:Started
configure colocation r2-with-tgt inf: ms-r2:Master tgt:Started
configure colocation r3-with-tgt inf: ms-r3:Master tgt:Started
configure order lun1-before-ip inf: lun1 ip
configure order lun2-before-ip inf: lun2 ip
configure order lun3-before-ip inf: lun3 ip
configure order r1-before-lun inf: ms-r1:promote lun1:start
configure order r2-before-lun inf: ms-r2:promote lun2:start
configure order r3-before-lun inf: ms-r3:promote lun3:start
configure order tgt-before-r1 inf: tgt ms-r1
configure order tgt-before-r2 inf: tgt ms-r2
configure order tgt-before-r3 inf: tgt ms-r3
cib commit tmp
cib use live
cib delete tmp
EOF

That's pretty much it! I highly recommend using crm_mon -r to view the health of your stack. Or if you prefer the graphical version, go grab a copy of LCMC here.

long time, no see posted Fri, 12 Oct 2012 20:33:16 CDT

Well, it's been over two years now since my last blog post, not that anyone was paying attention. It's been so long in fact, I actually had to revisit how I was even storing these posts in the SQLite table I'm using to back this blog system. Damned if I didn't simply choose to use standard HTML tags within the body of these things! Now I remember why I did real time validation via Javascript while editing the body. It's all coming back to me now...

Since I spend most of my waking hours delving ever deeper into the realm of Linux system administration specifically and computers and technology much more generally, I'm going to try to start posting more actual knowledge here, if for no one's benefit other than my own down the road. I've spent such a large amount of my time recently building and modifying Linux clusters that I feel it would be a total waste not to put it someplace publicly. Hopefully you'll even see that post not too long after this one.

Anyway, back to your regularly scheduled boredom for the time being. Oh, and Guild Wars 2 is proving to be one of the best, all around MMORPG's I've experienced to date. Keep an eye out for any upcoming sales if you want to dump a whole bunch of your real life into a completely meaningless virtual one.

Giganews header compression posted Mon, 24 May 2010 21:22:38 CDT

After messing around a bit with TCL, I finally figured out how to read the compressed headers from Giganews. Yay.

Thanks to a post over here, I was able to start with the basic NNTP conversation and add the rest I pieced together over the past couple of nights. My version with compression and SSL looks like this:

#!/usr/bin/tclsh

# load tls package
package require tls

# configure socket
set sock [tls::socket news.giganews.com 563]
fconfigure $sock -encoding binary -translation crlf -buffering none

# authenticate to GN
puts stderr [gets $sock]
puts stderr "sending user name"
puts $sock "authinfo user xxxxxxxx"
puts stderr [gets $sock]
puts stderr "sending password"
puts $sock "authinfo pass yyyyyyyy"
puts stderr [gets $sock]

# enable compression
puts stderr "sending xfeature command"
puts $sock "xfeature compress gzip"
puts stderr [gets $sock]

# set group
puts stderr "sending group command"
puts $sock "group giganews.announce"
puts stderr [gets $sock]

# issue xover command based on group posts
puts stderr "sending xover command"
puts $sock "xover 2-48"
set resp [gets $sock]
puts stderr $resp

# if the response is 224, parse the results
if {[lindex [split $resp] 0] == "224"} {

# loop through uncompressed results
#       while {[gets $sock resp] &gt; 0} {
#               if {$resp == "."} {
#                       puts stdout $resp
#                       break
#               }
#               puts stdout $resp
#       }

# loop through compressed results
        while {[gets $sock resp] &gt; 0} {
                if {[string index $resp end] == "."} {
                        append buf [string range $resp 0 end-1]
                        break
                }
                append buf $resp
        }
}

# uncompress those headers!
puts -nonewline stdout [zlib decompress $buf]

# issue a quit command
puts stderr "sending quit"
puts $sock quit
puts -nonewline stderr [read $sock]

Feel free to take the results and run. I'm not sure if there is a limit to how many headers you can fetch in a single go. I imagine it's more or less limited to your local buffer size, so don't grab too many at a time (at least in TCL I imagine). Anything more aggressive would require some fine tuning no doubt. But this was all just proof of concept to see if I could make it work. Now to write my newzbin.com replacement!

fceux and multitap fun posted Sat, 16 Jan 2010 08:26:09 CST

I've been trying to get four player support in fceux working. I finally broke down sometime ago and wrote a couple of the programmers working on the project. It seems the SDL port of the game had missed a core change somewhere along the way to maintain working multitap support.

But after a quick look at the code apparently, one of the programmers got things back in working order, and now things like Gauntlet 2 and Super Off-road can be played in all their four player glory!

The mysterious option to use is fceux -fourscore 1 game.nes.

Along the same time, I discovered an undocumented nastiness. If you're foolish enough to change some of the options via the command line, specifically the --input1 gamepad option for example (which I'm fairly certain worked in some previous incarnation of fceu(x), you will wonder why suddenly all of your controls have stopped working. Looking at the generated fceux.cfg, those options should now be --input1 GamePad.0 for example. Use 1-3 for the others. If you just leave things alone though, this will be the default.

avoiding potential online fraud posted Tue, 12 Jan 2010 13:46:39 CST

So I am looking at buying this spiffy new gadget, a Roku SoundBridge. I found someone wanting to get rid of a couple used ones for a reasonable price. The problem is, he replies telling me he doesn't accept PayPal, but cash or a money order will suffice. Wait, what!? Of course, he also assures me he's a reputable person and I can verify this by checking some other online forum where he apparently engages in some kind of online commerce. Well great.

In case you haven't already run into this before, this should be an immediate warning sign! I would think by 2010, everyone would understand the ins and outs of Internet commerce and both buyers and sellers would have the self awareness to educate themselves otherwise. Apparently not.

It's all a simple matter of trust. Do I know this person? Hell no. Should I trust this person to any measurable degree? Well, ideally yes. But it's an imperfect world full of people with varying values. Regardless of whether I think people should commit fraud, the fact of the matter is they do, every moment of every day. I'd love to accept the idea that people are generally honest and that everything will turn out just fine. But having been around the blocks a few times myself, I cannot.

So I do a little digging myself for online systems to safely manage online transactions. There is of course the aforementioned PayPal. It is not alone in its space, but I think it's safe to say, certainly the most recognized.

C.O.D.'s also came to mind. But apparently regardless of the carrier (USPS, UPS, FedEx), C.O.D.'s are absolutely useless and will most likely get you a whole lot of nothing as someone trying to sell an item for cash. There is a LOT of fraud happening in the C.O.D. world, so it's probably best to avoid it entirely.

And finally, there are the online escrow services. Escrow.com seems like a good place to start for such things. I did a little more digging to verify they were in fact a reputable entity, and as it turns out, such entities are fairly well regulated. In this particular case, you can check a governmental web site in California to verify they are a legitimate business and licensed by the state to conduct business as an escrow service. In my particular case, the minimum fee of $25 seems a little much since it's a significant percentage of the actual cost of the items. But it's well worth it if nothing else can be agreed upon.

So anyway, I hope someone eventually finds their way here, and any of this information proves useful. There are probably countless other businesses which provide similar services, but please make sure you try to verify the company is legitimate. Don't just accept that Better Business Bureau logo at the bottom of the very company's page of which you're trying to establish legitimacy. At the very least, don't send an unmarked wad of cash to someone you don't know. Seems like that goes without saying. But as David Hannum (not P. T. Barnum) said, "There's a sucker born every minute."

publishing real SPF resource records with tinydns posted Tue, 12 Jan 2010 13:45:36 CST

Since I just suffered a bit trying to figure this out on my own, I figured I'd blog about it so no one else would have to suffer. I was snooping around earlier looking at my exim configuration and messing with my current SPF records. Because of the handy SPF tool here, I learned that there is now a dedicated SPF resource record (there has been for awhile apparently as defined in RFC 4408).

So being who I am, I immediately set out to discover how to publish such a record via tinydns, my chosen DNS server software.

Since the stock version of tinydns doesn't support the SPF record type directly, you're left using the generic declaration. My current TXT record for bitgnome.net is:


'bitgnome.net:v=spf1 a mx a\072arrakis.bitgnome.net -all:86400

The proper form of this as an actual SPF resource record in the generic tinydns format becomes:


:bitgnome.net:99:\047v=spf1 a mx a\072arrakis.bitgnome.net -all:86400

Now, if you're at all familiar with SPF records in general, the \072 will probably make sense as the octal code for a colon. The tricky part that had me confused was the \047 which happens to be an apostrophe. Using a command like dnsq txt bitgnome.net ns.bitgnome.net gave me a TXT record with the expected SPF string as a return, but prepended by a single apostrophe.

Once I finally realized that it was giving me the length of the record in bytes in octal (\047, or 39 bytes for this particular record), everything finally clicked! I initially tried prepending my other domains with the exact same value and kept wondering why host -t spf bitgnome.com was returning ;; Warning: Message parser reports malformed message packet.!

So simply convert the SPF record length (everything from v= to the end of the string (-all in my case)) in bytes from decimal to octal, slap it on the front of that generic record definition, and away you go!

progress, progress, and more progress posted Tue, 12 Jan 2010 11:14:02 CST

I've been banging my head against a few walls lately, all related to the inevitable, yet sometimes annoying march of progress.

The first problem to rear its ugly head was based on a somewhat recent change in the netbase package of Debian. Specifically, the sysctl net.ipv6.bindv6only was finally enabled to bring Debian up to snuff in relation to most other modern operating systems. This is all well and good, since IPv6 is fairly solid at this point I imagine. The problem is, a few outlying programs weren't quite prepared for the change. In my case, several Java interpreters (Sun and OpenJDK, or sun-java6-jre and openjdk-6-jre in Debian) and murmurd from the Mumble project (mumble-server in Debian).

I reported the murmurd problem on both SourceForge and the Debian bug tracker. The problem had actually already been fixed by the developers, it just hadn't made it into a release yet. That was all fixed though with Mumble 1.2.1. Along the way, I learned a lot more about IPV6_V6ONLY and RFC 3493 than I ever wanted.

Java required a workaround, since things haven't been fixed yet on any release for the Sun or OpenJDK interpreters. All that was needed was -Djava.net.preferIPv4Stack=true added to my command line and voila, everything is happy again.

The other serious problem was that thanks to a recent SSL/TLS protocol vulnerability (CVE-2009-3555), several more things broke. The first problem was with stuff at work. I had been using the SSLCipherSuite option in a lot of our virtual host directives in Apache. The problem with that seemed to be that it always forced a renegotiation of the cipher being used, which would subsequently cause the session to die with "Re-negotiation handshake failed: Not accepted by client!?". Simply removing the SSLCipherSuite directive seemed to make all the clients happy again, but it's a lingering issue I'm sure, as is the whole mess in general since the protocol itself is having to be extended to fix this fundamental design flaw.

Along these same lines, I also ran into an issue trying to connect to my useful, yet often cantankerous Darwin Calendar Server. Everything was working just fine using iceowl to talk to my server's instance of DCS. And then, it wasn't. I'm fairly certain at this point that it's all related to changes made in Debian's version of the OpenSSL libraries, again, working around the aforementioned vulnerability. But the ultimate reality was, I couldn't connect with my calendar client any longer.

Once I pieced together that it was a problem with TLS1/SSL2, I simply configured my client to only allow SSL3. This works fine now with the self-signed, expired certificate which ships in the DCS source tree. I still can't manage to get things working with my perfectly valid GoDaddy certificate, but I'm happy with a working, encrypted, remote connection for the time being. My post to the user list describing the one change necessary to get Sunbird/iceowl working is here.

lighttpd, magnet, and more! posted Thu, 10 Dec 2009 11:44:26 CST

Today I was asked to deal with some broken e-mail marketing links we've been publishing for awhile now. We previously handled these misprinted URI's via PHP, but since we've moved all of our static content to lighttpd servers recently, this wasn't an option.

The solution it turns out was fairly straight forward. lighttpd does fortunately allow some amount of arbitrary logic during each request using LUA as part of mod_magnet. So after installing lighttpd-mod-magnet on my Debian servers and enabling it, I ended up adding the following to my lighttpd configuration:

$HTTP["url"] =~ "^/logos" {
        magnet.attract-physical-path-to = ( "/etc/lighttpd/strtolower.lua" )
}

and the following LUA script:

-- helper function to check for existence of a file
function file_exists(path)
        local attr = lighty.stat(path)
        if (attr) then
                return true
        else
                return false
        end
end

-- main code block
-- look for requested file first, then all lower case version of the same
if (file_exists(lighty.env["physical.path"])) then
        lighty.content = 
        return 200
elseif (file_exists(string.lower(lighty.env["physical.path"]))) then
        lighty.content = 
        return 200
else
        -- top level domains to search through
        local tld = { ".com", ".net", ".info", ".biz", ".ws", ".org", ".us" }
        for i,v in ipairs(tld) do
                local file = lighty.env["physical.path"]
                file = string.sub(file, 1, -5)
                if (file_exists(string.lower(file .. v .. ".gif"))) then
                        lighty.content = 
                        return 200
                end
        end
        return 404
end

And that was it! The script checks for the existence of the requested file, and if it fails, it first forces the string to lowercase (since someone in our marketing department felt it would be a good idea to use mixed case URI's in our marketing publications) and failing that, it will also look for the same file with a few top level domains inserted into the name (again, brilliance by someone in marketing publishing crap with the wrong file names).

Failing all of that, you get a 404. Sorry, we tried.

OpenSolaris and ZFS for the win! posted Wed, 09 Dec 2009 20:51:57 CST

The whole reason I started writing this little blog system was so I could point out technical crap I run across. I imagine this is why most technical folks start blogging actually. It's more of a reminder than anything else and it's nice to be able to go back and reference things later, long after I would have normally forgotten about them.

Anyway, I recently built a huge software based SAN using OpenSolaris 2009.06 at work. The hardware was pretty basic stuff, and included a SuperMicro MBD-H8Di3+-F-O motherboard, 2 AMD Istanbul processors (12 cores total), 32GB of RAM, and 24 Western Digital RE4-GP 7200RPM 2TB hard drives, all stacked inside a SuperMicro SC846E2-R900B chassis. The total cost was a little over $12,000.

Needless to say, this thing is a beast. Thanks to a little help from this blog post which pointed me to a different LSI firmware available here, I was able to see all 24 drives and boot from the hard drives. One thing to note though was that I did have to disable the boot support of the LSI controller initially to get booting from the OpenSolaris CD to work at all. Once I had installed everything, I simply went back into the controller configuration screen, and re-enabled boot support.

After getting everything up and running initially, it was then a matter of installing and configuring everything. I found the following pages to be rather invaluable in assisting me in this:

  • this will get you up and running almost all the way on the OpenSolaris side using the newer COMSTAR iSCSI target
  • this will get you the rest of the way on OpenSolaris with the all important views needing to be established at the end of it all
  • this will get your Windows machines setup to talk to this new iSCSI target on the network

And that should be all you need! So far things have been running well. My only problem so far was making the mistake of upgrading to snv_127 which is a development version. The MPT driver broke a bit somewhere between that and snv_111b which is what the 2009.06 release is based on. The breakage would cause the system to bog down overtime and eventually hang completely. Not acceptable behavior to say the last on our shiny new SAN. There are a few posts about this issue here and here. I'll just wait until the next stable version to upgrade at this point.

cute HTTP Apache rewrite trick posted Tue, 08 Dec 2009 11:33:17 CST

I ran across a rather neat trick over here recently. So I don't forget about it (since I will probably end up needing to use it at some point), I'm going to copy it here.

The idea is to avoid duplicate rules when handling both HTTP and HTTPS based URI's. The trick is as follows:

RewriteCond %{HTTPS} =on
RewriteRule ^(.+)$ - [env=ps:https]
RewriteCond %{HTTPS} !=on
RewriteRule ^(.+)$ - [env=ps:http]

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)index\.html\ HTTP/ [NC]
RewriteRule ^.*$ %{ENV:ps}://%{SERVER_NAME}/%1 [R=301,L]

or the even shorter:

RewriteCond %{SERVER_PORT}s ^(443(s)|[0-9]+s)$
RewriteRule ^(.+)$ - [env=askapache:%2]

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html$ http%{ENV:askapache}://%{HTTP_HOST}/$1 [R=301,L]

my first real blog post posted Mon, 07 Dec 2009 12:39:14 CST

If you can see this, then my newly coded blog system is accepting posts as intended entirely via the web. Please take a moment to pray and give thanks to the wonders of modern science.