An experiment into rebuilding a 6 year old homelab.
It was beautiful until it wasn't
Organic network group is beautiful in the chaos and how much you can build and deploy. It's also wonderful while you are learning and experimenting, but it is a nightmare the moment someone new is brought in. And if the network survives long enough you learn to regret and pay for your mistakes. Random things from a poorly allocated VM to a undocumented configuration to a small utility you wrote 3 years ago with a glass of wine and have zero idea how it works, but somehow continues to survive.
My architecture was simple...2 x86 nodes internally, 3 ARM nodes (running automation and network monitoring), and one VPS hosted in a datacenter for outbound data requests like my photo library.
Many homelabers rebuild their network every few years because they realize they made some mistake or hardware failed and they can't expand anymore or they just get bored. That never happened here...from day 1 the services on the network were in active use for things like CI\CD, backups, video, and password management. And I was fairly busy so never bothered to dedicate the time to it because I was using the services.
About 3 years ago I met a beautiful woman and a year later I had moved cross country to be with her. She had been living in a hackerhouse and working with networking gear, acquiring random pieces of hardware off of ebay or through trades and when we moved in together she brought with her a set of her own services and hardware with of course completely different design considerations. Over the next 2 years she would gradually start to use more of the services and integrate\build her own. It was a rough around the edges, but functional network and being busy with a small company writing hypervisors\k8s components it served as a great test bed.
The move was chaotic and cross country. But that is probably grounds for a seperate blog post outside of tech. The short is we ended up merging hardware and went from 2 virtualization nodes to 5 when the networks were merged. I also was not able to dedicate time to correct the issues in the network. As a result the network was a merge which on paper was impressive with minimal changes...a few (20-30) virtual machines for some of our services, a couple changed paths to update things, a few new static IPs, a few new network switches, and an overall working system.
Over the next 2 years we would continue to build onto the network with new services rather than fix the existing foundation because of time and in the end the situation would deteriorate to include 6 virtualization nodes with outdated IP maps\documentation creating a functional system, but one that was breaking under its own weight anytime it needed worked on.
Where things started to break
Security was broken by design
Security was never really a design consideration on the network. You have to understand this was a personal network started between 2019-2020. It was created for my use and my testing given a few tools to keep it up to date and then left alone.
This meant not great security decisions like no authentication on a few services. Ssh using passwords rather than keys. Shortcuts made to get things up quickly. No real safeguards for supply chain attacks. And no network isolation due to not having the hardware. Effectively the barrier for security was the physical network itself and the virtualization in place which for a single user who lived alone and historically did not have a lot of people over made sense.
Eventually my situation changed. After my transition (I am a trans woman) I ended up spending more time with people and having friends then partners over. And felt more outgoing so started running small game servers for close friends. When I finally moved in with my partner this became even more common and with her being technical it now meant that any mistake made by either of us could compromise our network.
Around this same time supply chain attacks continued to become more common and eventually Mythos was announced which highlighted the capabilities of automated attacks becoming more sophisticated. The realization that a single attack vector onto the network could provide an opportunity for an exploit chain all of sudden became much more pressing.
You would think given an offensive security background I know better, but something something the cobbler's kids have no shoes. Yeah I did not practice what I preached in my professional life on my home network because I thought that single external router was enough to block people getting in.
Inefficiency was the norm
The resource allocations were also terribly inefficient as an artifact of the system being based almost entirely on VM workflows. This wasn't a big deal as we didn't have a desire to deploy a massive number of new services at the time nor plans to.
What we didn't realize is we were redlining several nodes on resources and eventually some computer hardware would become much more expensive (ex ram for our servers) resulting in us having to make a decision of whether we were willing to spend the hundreds of dollars needed for the hardware or whether we were going to agree a rebuild would be necessary someday.
Hardware was unreliable, inconsistent, and heterogeneous
Partly for the challenge and partly due to financial contraints the network was built with a mix of old desktop parts, ebay parts, traded parts, cheap mini pcs, and small business hardware. From a reduce and recycle perspective as well as one of being financially mindful this made a lot of sense, but you cannot deploy hardware with the expectation it is 100% reliable at scale or when it is based on older\less stable hardware.
Hardware also just does not work consistently on a good day in the enterprise space and it is likely to be even less consistent on home systems that are lacking ECC\are not designed for features like PCIe passthrough. The basic workflows work fine, but if you are doing things like writing to the EFI memory on a motherboard it's not likely to be super reliable or dependable. The old architecture with no documentation meant monitoring the cluster for failures due to doing things like this was going to be difficult to impossible.
The hardware being heterogenous also does not lean towards making features like live migrations work cleanly. If your VM requires a GPU and you have that exact GPU on only one machine you can't really do a migration of it can you? Or if you have a caching VM that is storing 1TB and you have at most 1gbs uplink you can't really just use block storage on that. So effectively you end up with some VMs that are locked to nodes The solution of course to that is...implement live migrations on what you can and move things to containers where possible.
Uptime was not a consideration
It's easy to not care about downtime when it's just you. But it's very difficult to deal with it the moment someone else is involved.
The entire homelab was never designed with high availability or high uptime either. For example updating the NAS would take out the DNS for 8 minutes while the system rebooted and restarted all the VMs. Meanwhile automated upgrades were effectively a 1-2 hour partial outage once a week in the early hours of the morning even when it wasn't necessary. When it's just you using a service it's fine, but the minute you have things like tunnels into the network for services externally accessible that becomes no longer suitable.
A lack of high availability meant that I could not use the homelab for showing projects or exposing things to the internet...and let's be honest here cloud computing bills are expensive.
Documentation was either missing or outdated
I did have the foresight to have all my deployments as infrastructure as code, but never bothered to document every service I had on the network mainly because of time. This made it impossible for anyone (for example my partner) to work on or integrate with it without effectively opening up the console to some random nodes and guessing until you found what you were looking for. It also meant observability was impossible.
The breaking point
My partner eventually wanted to deploy letta (similar to Open Claw) for those familiar which is an autonomous AI agent. And she wanted to expose it to the internet by things like chats so effectively this meant putting a system on out network with shell access. That carries massive security risks if the network is not properly isolated\segmented which ours was not. For the curious
I also was getting to the point of being recovered from a medical procedure and was starting to put my website\services back up as part of my job search and wanted to make use of tunnels into the network to offload some of the more expensive compute from the web host I had been utilizing. However I realized very quickly that the network was not designed for that from a security or reliability perspective.
When we returned home from a medical procedure of mine (I was gone for about a month) we were also greeted with 3 hardware failures which we should have been able to catch before we left with proper monitoring (or fix very quickly). One cable degraded coming from the back of a machine to the point it was not able to properly transfer UDP packets over the network effectively rendering me unable to stream application workloads from it (the application on this node was Athena...a custom VDI implementation for streaming low-latency applications like games). Another network switch overheated which cut our network in half. And another firewall had an NVME fail which made us spend hours as we wondered why we had no backups or documentation on it. My partner and I lost days of time to this firestorm of a failure.
While the network had served me well for all these years it was time to be significantly rebuilt.
The plan
At certain point you have to accept the only option is to fix the mistakes you have made and to stop putting on band-aids.
- Segment out the network so we are not exposing things like ssh to all machines on the network.
- Lock down all virtual machines currently deployed.
- Condense VMs where it makes sense from a security\dependency perspective.
- Move applications we can to a distributed compute platform that will allow us to spread our workloads across multiple machines for things like maintenance and failure.
- Document the network fully so I could figure out what was going on (fairly self explanatory just writing down everything done over years)
Isolating the network
The common approach here is to deploy vlans and this works...and is what I would recommend for most cases. That's not what we did here. For context just one of our virtualization nodes has enough network traffic that it currently uses 6gbs of memory just to process routing correctly. Another is 2.5-3gbs of usage. And our routing table was large enough prior to breaking out the network that it could take multiple minutes to go through our router admin page (an old Netgear router).
I also wanted hardware isolation in the event a compromise was ever found in our routers. We have so much external software and also plans to do more external integrations with services on the internet that effectively we needed to be able to ensure a breach of one network and router did not impact the entire network.
Considering the network overhead we had and general processing for network traffic a single dedicated box did not feel like a good financial decision so we opted to instead purchase 4 mini PCs off of Amazon (3 with 1gbs links and 1 with 2.5gbs links). This also gave us the isolation I was wanting, but did not require explicitly.
From this point it was just setting new static IPs on all the respective virtual and physical devices behind each router then forwarding these out onto respective ports. Services where this did not make sense such as DNS or NTP were migrated to a virtualization node that we treat as being in a DMZ.
Afterwards it was as simple as updating our dnsmasq entires with the IP of a reverse proxy deployed and then setting the proxy to forward to each respective firewall. From the outside looking in the whole thing looks like one cohesive network.
Locking down the systems
The bulk of this was adding SSH keys and deploying SSH jump servers for each firewall. I ended up having to update the code for handling server upgrades as a result of this, but the additional security was worth it.
Word from the wise...set ssh keys and disable password login...it's a few minutes of time and worth it just from the convience to not have to key in a password.
Oh and where possible deploy an immutable OS with SSH disabled, but I'll get into that later.
Condense VMs
This can be an entire blog entry so here's the short. Where possible with high trust services like dns, ntp, databases, nfs, docker repos, and smb onto NixOS. This turned out to be surpringly easy.
Ex for Maria DB:
services.mysql = {
enable = true;
package = pkgs.mariadb;
settings = {
mysqld = {
bind-address = "0.0.0.0";
};
};
};
Or to enable the NFS shares:
services.nfs.server = {
enable = true;
exports = ''
/mnt/nfs/nextcloud-config 192.168.1.21(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/nextcloud-data 192.168.1.21(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/nextcloud-config 192.168.1.19(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/nextcloud-data 192.168.1.19(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/nextcloud-config 192.168.1.29(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/nextcloud-data 192.168.1.29(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/nextcloud-config 192.168.1.24(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/nextcloud-data 192.168.1.24(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/gitea 192.168.1.21(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/gitea 192.168.1.19(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/gitea 192.168.1.29(rw,sync,no_subtree_check,no_root_squash)
/mnt/nfs/gitea 192.168.1.24(rw,sync,no_subtree_check,no_root_squash)
'';
};
Effectively because it's a single configuration file I felt a lot more comfortable having this as a single VM because configuring all the respective files on something like Ubuntu Server while doable is quite timeconsuming and you also have a wonderful bit of cleanup you have address when adding\removal things because it is non-atomic. And if I need to make a change it's as simple as just removing the entries from the file.
That said while I overall had a great experience deploying NixOS to migrate a few VMs away from Ubuntu Server there were some gotchas that are going to make it difficult to replace all of the old VMs. In particular simple configuration is easy...where things get more complicated is when you need to load from multiple channels or make modifications to things like systemd services.
I think the best way to phrase it is if there was an option in stable on NixOS with the needed configuration parameters then it was a no brainer to deploy. But the moment I had to write my own I was deploying something else because time\maintenance overhead.
Distributed compute
Ok so candidly very few homelabs have a usecase for k8s. In fact for most this is a complete waste of time outside of education. However...for me my application development style is stateless architecture due to having worked on distributed systems in the past so this is useful for me when I want to deploy personal projects without having to find a machine to do so.
It also allows me to get nice to have features like automated recovery (restart the service on another machine) and allows me to avoid having to setup or find resources on a machine for something like a docker container.
This is done with k8s running on Talos with a 6 node cluster. 3 control plane nodes and 3 worker nodes. Prefered storage systems are NFS and SMB. Databases are deployed on VMs outside of the cluster because while I can run a database on K8s I would rather not deal with that complexity.
A strange catch is Talos requires all machines to be on the same network so to do this I ended up using PCIe cards on the worker nodes passed through into VMs so these worker VMs could in turn be on a different network than their hypervisor hosts.
That said a container is not a solution for everything so I still have a few VMs that are being programmatically orchestrated for things like linux mirror caches.
Where are we at today
Today there's a very nice 5 node cluster running Proxmox and a NAS running TrueNAS Scale. There are 2 nodes running NixOS; one in the DMZ for infrastructure tasks and the other behind an internal router for running data releated tasks. A K8S cluster capable of being upgraded without downtime. And a highly secure network for any tests or tasks I may decide to deploy tunnels into the network on. Downtime on the VMs should be 60 seconds per week at this rate which is fine for me at this point in time. Brought into SLA terms this is 99.99 uptime assuming none of the hardware dies or the network connection is cut at the ISP level which is fine by me for a home cluster environment.
And above all plenty of compute for my own personal projects as well as being reliable enough for my partner and I's needs. And secure enough to feel comfortable poking holes into the network on a random night to play games with friends or have an AI go haywire across the network.