rants, tirades, ruminations
NixOS laptop homelab posted Wed, 13 Nov 2024 02:06:19 UTC
The Why
For years I’ve put off building any kind of homelab setup. Mostly due to my unwillingness to “compromise” on using non-x86 based devices and therefore ultimately, cost.
I had been keeping an eye out for a long time since the actual price of just any x86 based system has been fairly affordable for a long time. A lot of that affordability though, came in the form of older Haswell to Skylake era Intel mini PC’s or one of the newer, but much lower end Atom or other similarly low powered, lower performing processors. I almost pulled the trigger on some well priced Lenovo ThinkCentre mini PC’s, and I might still end up there at some point.
As I was browsing though, I threw laptops into the mix and happened across an Ebay listing with four remaining Lenovo ThinkPad L14 Gen 1 Ryzen based systems (model 20U6S1VW00 specifically) for only $208 each when buying all four. Since they each came with 32GB of DDR4 RAM and a 512GB NVMe drive right out of the gate, as well as a more than competent 6-core/12-thread AMD Ryzen 5 Pro 4650U, I grabbed all four! They’ve got three PCIe 3.0 x4 slots (2x M.2 2280 and 1x M.2 2242) and support up to 64GB of RAM. As a starting point, they seem like a risk free investment which I can always hand out to folks (or myself!) as replacement laptops if I end up not finding much use for them.
I’ve spent the last few days piecing together all the things which I think will make this an excellent homelab setup for me. And a fair amount of that is being facilitated by NixOS, as we’ll see.
So what did I want to be able to do exactly? The goal was to have gear that functioned essentially the same as enterprise gear that I could easily provision however I wanted anytime. I knew I would need to leverage my existing DHCP server on my NixOS router to pass along PXE information of some kind to netboot into a live Linux environment from which I would then be able to provision each machine however I chose. I also wanted to be able to send Wake-on-LAN requests to wake up the machines so they wouldn’t need to run 24x7, but turning them back on would be a cinch.
The What
A brief note about NixOS before I dive into the technical stuff. This isn’t the place to learn all things Nix. However, as I’ve experienced over the past year on my own Nix journey, the more examples of various styles of Nix incantations out in the wild, the more likely you’ll see someone doing something that finally makes things click for you about how all of this seeming magic actually works.
And NixOS lets you do some really incredible stuff. I’m not sure if it is, in fact, the one and only way to put together any operating system. But the more I fall into it, the more I feel like my mind is opening up to some greater knowledge of how systems are meant to be designed. The modular design and the level of functionality in the tooling are standout, best in class features.
All of which is to say, you could certainly do all of what follows with any number of other pieces of software. But the idea of using Nix to build an enterprise cluster of some kind seems like it should end up being far easier to capture and maintain long term as a NixOS configuration. Then we get all the infrastructure as code benefits that NixOS brings without slogging through the endless morass of Helm charts I see in my professional life. Maybe we’ll still end up there in the end? But even if we do, I want at least an incredibly easily reproducible base from which to start whenever I want and all mostly at the push of a remote button.
Here are just a few of the places I referenced as I worked my way through this:
- https://chengeric.com/homelab/
- https://carlosvaz.com/posts/ipxe-booting-with-nixos/
- https://www.haskellforall.com/2023/01/announcing-nixos-rebuild-new-deployment.html
- https://aldoborrero.com/posts/2023/01/15/setting-up-my-machines-nix-style/
I’m not sure that NixOS has a documentation problem as much as it does a flexibility problem. There’s just a lot of different ways to arrive at the same configuration ultimately. But as I said earlier, the more people document their various recipes, the easier I think it will be in the future for people to continue adopting and contributing to Nix.
Having said that, I’m going to start dropping a bunch of Nix configuration snippets at this point. Most of them will be just that, snippets. You’ll need to fit them into however you’re doing things within your own configuration. I am using flakes here, so all those usual qualifiers apply. Feel free to come back later and pick through whichever pieces you might find useful if you’re not yet at a point where any of this makes sense yet. I’m not going to pretend like any kind of subject matter expert here. It’s a lot to wrap your head around. But here’s a link to my paltry efforts, if you should like to try to piece together everything below with all the rest:
I’m only using a handful of sops-nix secrets, so most everything is in the clear.
The How
DHCP, PXE and iPXE
My NixOS router, darkstar, was already running kea for DHCP services. I saw folks using Pixiecore elsewhere to supplement all the other pieces necessary at this point. But I wanted to follow a more familiar design and provide the PXE information myself directly from my DHCP server:
services.kea.dhcp4 = {
enable = true;
settings = {
interfaces-config.interfaces = [ "enp116s0" ];
lease-database = {
name = "/var/lib/kea/dhcp4.leases";
persist = true;
type = "memfile";
};
renew-timer = 900;
rebind-timer = 1800;
valid-lifetime = 3600;
This starts off with a fairly normal Nix service enablement block which quickly morphs into virtually the exact JSON style configuration file syntax kea expects with equal signs instead of colons.
We start by binding the service to the router’s internal LAN interface, telling kea where and how to store its lease state, and some basic DHCP parameters that will determine how often clients need to request a new DHCP lease.
We continue:
option-data = [
{
name = "domain-name-servers";
data = "192.168.1.1";
always-send = true;
}
{
name = "domain-name";
data = "bitgnome.net";
always-send = true;
}
{
name = "ntp-servers";
data = "192.168.1.1";
always-send = true;
}
];
Another fairly standard block. I’ve got a single, flat network right now so these are the options I’m handing out in every DHCP offer.
client-classes = [
{
name = "XClient_iPXE";
test = "substring(option[77].hex,0,4) == 'iPXE'";
boot-file-name = "http://arrakis.bitgnome.net/boot/netboot.ipxe";
}
{
name = "UEFI-64-1";
test = "substring(option[60].hex,0,20) == 'PXEClient:Arch:00007'";
next-server = "192.168.1.1";
boot-file-name = "/etc/tftp/ipxe.efi";
}
{
name = "UEFI-64-2";
test = "substring(option[60].hex,0,20) == 'PXEClient:Arch:00008'";
next-server = "192.168.1.1";
boot-file-name = "/etc/tftp/ipxe.efi";
}
{
name = "UEFI-64-3";
test = "substring(option[60].hex,0,20) == 'PXEClient:Arch:00009'";
next-server = "192.168.1.1";
boot-file-name = "/etc/tftp/ipxe.efi";
}
{
name = "Legacy";
test = "substring(option[60].hex,0,20) == 'PXEClient:Arch:00000'";
next-server = "192.168.1.1";
boot-file-name = "/etc/tftp/undionly.kpxe";
}
];
Here’s the kea side of the PXE configuration on the local network segment (remember, in Nix language format, not actually kea JSON format!). If any client making a DHCP request matches one of these test cases, then the extra options provided are passed along as part of the DHCP offer. This covers not only the initial PXE boot by any matching client but also the subsequent iPXE boot we chain into from the PXE boot environment. iPXE will reference the URL provided in the first block for which files to download now via HTTP instead of the much slower PXE for how to continue booting into the custom NixOS installer image later.
subnet4 = [
{
id = 1;
subnet = "192.168.1.0/24";
pools = [ { pool = "192.168.1.100 - 192.168.1.199"; } ];
option-data = [
{
name = "routers";
data = "192.168.1.1";
}
];
reservations = [
({ hw-address = "8c:8c:aa:4e:e9:8c"; ip-address = "192.168.1.11"; }) # jupiter
({ hw-address = "38:f3:ab:59:06:e0"; ip-address = "192.168.1.12"; }) # saturn
({ hw-address = "8c:8c:aa:4e:fc:aa"; ip-address = "192.168.1.13"; }) # uranus
({ hw-address = "38:f3:ab:59:08:10"; ip-address = "192.168.1.14"; }) # neptune
];
}
];
};
};
Lastly, again, a fairly standard block to define the subnet range to allocate to DHCP clients as well as static reservations, including the Sun planet themed bunch of newly acquired Lenovo laptops, jupiter through neptune.
Along with the above, you’ll probably want the following block nearby to handle the rest of the PXE heavy lifting as well as the initial iPXE work:
environment = {
etc = {
"tftp/ipxe.efi".source = "${pkgs.ipxe}/ipxe.efi";
"tftp/undionly.kpxe".source = "${pkgs.ipxe}/undionly.kpxe";
};
networking.firewall.interfaces.enp116s0.allowedUDPPorts = [ 69 ];
systemPackages = with pkgs; [
ipxe
tftp-hpa
wol
];
};
systemd.services = {
tftpd = {
after = [ "nftables.service" ];
description = "TFTP server";
serviceConfig = {
User = "root";
Group = "root";
Restart = "always";
RestartSec = 5;
Type = "exec";
ExecStart = "${pkgs.tftp-hpa}/bin/in.tftpd -l -a 192.168.1.1:69 -P /run/tftpd.pid /etc/tftp";
TimeoutStopSec = 20;
PIDFile = "/run/tftpd.pid";
};
wantedBy = [ "multi-user.target" ];
};
};
This is adding a few useful packages along with creating a tftpd service using tftp-hpa’s in.tftpd. It’s also building a tftp path from which to serve requested files in /etc/tftp. In retrospect, I can probably point directly into the /nix/store in that ExecStart line similar to what I do later with the netboot image. I chose to use tftp-hpa over the stock inbuilt netkittftp (if using the services.tftpd.enable option) as I’m more familiar and I preferred running the service directly rather than use xinetd which is how netkittftp is configured to work via that services.tftpd option.
Now, I wasted a significant portion of a day around this point doing what I thought was going to be a really straightforward PXE boot setup. It turns out, the machine I’ve named jupiter here happens to have a busted PXE client built into its Realtek NIC. I thought I was losing my mind because it worked once, which as I was to discover, is about all you ever get out of it, and then it seemingly rarely works again. After much frustration and some really manic deep diving into things like bpftrace to watch for file access and process spawning, I finally broke down and pulled out a different laptop to try which, as it so happened, worked flawlessly every time just like the other two. So, if anyone has any suggestions as to why this might happen or how to fix it long term, I’d love to hear about it. I did notice the Realtek PXE driver client in the Lenovo BIOS mentions that it is beta! But I don’t see any available firmware updates for that through LVFS at least. Maybe I can update them under a running version of Windows (already licensed for it conveniently) with some Realtek executable?
Anyway, the short term workaround I subsequently discovered after finally realizing the actual problem was to use fwupdmgr to reinstall the latest Lenovo BIOS on jupiter. After the initial application of the BIOS, the next PXE boot has always worked thus far. It may not work more than once and often breaks within the first few attempts. But at least I know how to mitigate the issue in a way that doesn’t seem to massively interrupt any of the rest of this workflow.
I’ve also included my firewall rule here for the tftpd server. I don’t think I’m specifically opening the DHCP ports themselves anywhere as I’m also using the inbuilt nftables ruleset via networking.nftables.enable = true;
which seems to cover that.
So now we’re responding to DHCP requests and providing PXE clients with the data pointing them to download the iPXE image. Once the inbuilt PXE client boots into the iPXE client provided, it then also performs a DHCP operation where it is now given a URL to load.
iPXE and nginx
The URL in question needs to be served from some HTTP server. I’m already running nginx elsewhere on my internal network, so that’s where I’m hosting both the iPXE script that is loaded by each iPXE client and the netboot data itself to actually boot into a remotely accessible NixOS installer environment. I won’t provide my entire nginx configuration here enabling all of the SSL stuff via Let’s Encrypt, but you can refer to my repo to find all of that in this same file probably still:
services.nginx = let
sys = lib.nixosSystem {
system = "x86_64-linux";
modules = [
({ config, pkgs, lib, modulesPath, ... }: {
imports = [
(modulesPath + "/installer/netboot/netboot-minimal.nix")
../common/optional/services/nolid.nix
];
config = {
environment.systemPackages = with pkgs; [
git
rsync
];
nix.settings.experimental-features = [ "nix-command" "flakes" ];
services.openssh = {
enable = true;
openFirewall = true;
settings = {
PasswordAuthentication = false;
KbdInteractiveAuthentication = false;
};
};
users.users = {
nixos.openssh.authorizedKeys.keys = [ (builtins.readFile ../common/users/nipsy/keys/id_arrakis.pub) ];
root.openssh.authorizedKeys.keys = [ (builtins.readFile ../common/users/nipsy/keys/id_arrakis.pub) ];
};
};
})
];
};
build = sys.config.system.build;
in {
Wait, what? Okay, so I started down this path initially by figuring out how to create a custom NixOS ISO image. And you can still find that logic in my repository along with a handy zsh alias (geniso) I created so I wouldn’t have to type the entire command.
However, why bother with that crap when I can just inject the custom built netboot artifacts directly into my nginx configuration itself?
That’s what is happening here. You’ll see all the usual config options you’d see to configure a normal, running system, along with injecting my own personal SSH keys for both the root and nixos users in the resulting netboot image and installing some handy additional commands which could prove useful in the installer environment. But in this instance, we are doing all that work dynamically under a variable named build which then has its resulting built artifacts referenced here where the service.nginx block actually begins:
appendHttpConfig = ''
geo $geo {
default 0;
127.0.0.1 1;
::1 1;
192.168.1.0/24 1;
}
map $scheme $req_ssl {
default 1;
http 0 ;
}
map "$geo$req_ssl" $force_enable_ssl {
default 0;
00 1;
}
'';
enable = true;
recommendedGzipSettings = true;
recommendedOptimisation = true;
#recommendedProxySettings = true;
recommendedTlsSettings = true;
sslCiphers = "AES256+EECDH:AES256+EDH:!aNULL";
virtualHosts = {
"arrakis.bitgnome.net" = {
addSSL = true;
enableACME = true;
extraConfig = ''
if ($force_enable_ssl) {
return 301 https://$host$request_uri;
}
'';
locations = {
"= /boot/bzImage" = {
alias = "${build.kernel}/bzImage";
};
"= /boot/initrd" = {
alias = "${build.netbootRamdisk}/initrd";
};
"= /boot/netboot.ipxe" = {
alias = "${build.netbootIpxeScript}/netboot.ipxe";
};
"/" = {
tryFiles = "$uri $uri/ =404";
};
};
root = "/var/www";
};
};
};
As mentioned above, this is where the references to the netboot artifacts get filled in with aliases pointing directly into /nix/store. And the especially great thing about all of this, is that the netboot image is kept perpetually up to date and these references should always be pointing at the latest version. The rest will be cleaned up automatically by your next scheduled garbage collection once they’re no longer referenced. I discovered the syntax for these exact location definitions (using the “= /…” style for each location name) by cheating and looking at how the cgit module accomplished the same thing. It wasn’t until sometime after looking at that module’s code I finally understood how the equal sign is being used to build these names.
And you don’t even need to learn how to write iPXE scripts, because again, the netboot build process generates one for us which we can then drop in directly as a reference in our nginx configuration. And anytime anything changes, it all gets updated automatically and nginx reloaded accordingly! Neat stuff.
We finally have our custom NixOS installer up and running hopefully and we should be able to log in directly as root.
Install
I wrote a shell script for the next part, which I’ve also dropped in my repo at the top under scripts/remote-install-with-disko. Here’s the basic command sequence from that script though:
# 192.168.1.11 is jupiter per the above static reservation
ssh root@192.168.1.11 nix run github:nix-community/disko/latest -- --mode disko --flake https://arrakis.bitgnome.net/nipsy/git/nix/snapshot/nix-master.tar#jupiter
ssh root@192.168.1.11 nixos-install --flake https://arrakis.bitgnome.net/nipsy/git/nix/snapshot/nix-master.tar#jupiter
ssh root@192.168.1.11 reboot
In reality, I’m also using split-horizon DNS with unbound on darkstar to provide full DNS resolution for these LAN based devices. You can find that all in the repo also. But we don’t really need that here.
Two commands. That’s it. The first leverages the wonderful community created disko project to handle the formatting and mounting of all the drives as defined under hosts/jupiter/disks.nix. This configuration then also gets consumed and referenced during the subsequent nixos-install command to define all the file system mounts in /etc/fstab on the running system.
You of course need to define the system configuration for jupiter and all the rest in your flake.
Jupiter and Beyond the Infinite
It’s worth talking a little about the laptop NixOS configurations themselves. I’m not going to drop the entire configuration here for jupiter or any of the rest. You can go look at them easily enough in the repo, and your layout might be sufficiently different from mine that you can’t just drop mine in easily.
But some of the more important pieces include the lid handling since these are laptops and the Wake-on-LAN functionality. The lid handling was easy, and the netboot image also includes this bit as to avoid any nasty surprises by virtue of the fact all four laptops are stacked on top of one another with their lids closed:
services.logind = {
lidSwitch = "ignore";
lidSwitchDocked = "ignore";
lidSwitchExternalPower = "ignore";
};
The Wake-on-LAN was even simpler:
networking.interfaces.enp2s0f0.wakeOnLan.enable = true;
Now, this does have a corresponding option in the BIOS which I have enabled when the AC adapter is connected. I also enabled the BIOS option to power on automatically whenever AC power is restored.
And since the BIOS came up, it’s also worth mentioning the boot order. I decided to only keep two boot entries active, the first NVMe drive followed by the Realtek IPv4 PXE client. There’s also an option to define which boot entry to use when booting via Wake-on-LAN, and I also set that to the first NVMe drive. The thinking here being, all I need to do to wipe and reinstall a machine (by forcing it to PXE boot next time), is:
umount /boot && mkfs.vfat /dev/nvme0n1p1 && reboot
which wipes my EFI boot partition and reboots, forcing the PXE client to boot when the first NVMe option fails to boot correctly. I’ve tested this and it works brilliantly.
And while we’re talking about WoL, you might have already noticed the wol package installed on darkstar earlier. Once the laptops are configured for it in BIOS and you have a working OS on them to set the NIC into the correct mode (as done above), you can run this from the router (or whatever other LAN attached device you want):
wol -vi 192.168.1.255 8c:8c:aa:4e:e9:8c
to wake up jupiter for instance from a power off state.
Where?
Where to next? I’m definitely going to configure VXLAN on top of these once I get some patch cords for all four laptops to connect them up to the switch sitting right next to them. I’d also like to see how terrible something like OpenStack might be to get up and running in a declarative manner. I’ll probably end up throwing in some extra storage somewhere along the way so I can play with Ceph a bit too. If these machines end up being too limiting, it looks like the Lenovo ThinkCentre M75q’s are available in this same general price range and specification and also include a 2.5” bay for even larger storage options.