Talos\K8S In a Homelab

I going to be approaching from a bit of a different perspective. I know how to do what k8s does in k8s with operators, kubelets, etc. And I understand Linux systems very well so configuring them is not an issue. So that makes this an interesting choice on things to deploy because it wasn't for lack of skill. Rather this was done to see how immutable operating systems geared towards K8S work, their challenges, their strengths...and to save me a small amount of time on the initial setup.

I effectively chose to install Talos on my own homelab on a few VMs because I saw it to be an opportunity to move some stateless workloads from vms\containers to something that was distributed\would have less downtme.

What prompted this

For the curious you can see what inspired the entire homelab rebuild. But the short here is the resource nodes we had were very poorly provisioned and I wanted to deploy more services on the network. The problem is...well one was TrueNAS Scale so not proper provisioning support and just the general process of finding a machine with free capacity was getting very tedious.

A few notes on K8S

I personally believe that most deployments of k8s are insanely overengineered and very brittle due to the expectation that workloads be able to be stopped and restarted at will. Sometimes a bare metal install, VM, or even just the docker container itself is actually the best solution.

That said...I have seen beautiful and brilliantly engineered solution using K8S as part of a distributed system. The tooling for deploying new workloads is very mature and very powerful when it comes to determining the nodes you want to deploy on\how many resources you want to allocate. The problem you run into when you try to convert things to run in k8s is you have to be extremely mindful of application state because things can and will restart during upgrades. A node will fail causing all workloads to restart\migrate.

It is a tool like everything with a specific place and usecase. From my perspective if you are running a few stateless workloads with no I\O or limited I\O then you might as well make use of it. Or if you are willing to deploy the correct infra to make use of more I\O heavy workloads.

For my situation...my partner and I have a collection of consumer hardware from old PCs, ebay, and old machines of my partner. So we can't get away with just sticking everything on one machine and deploying virtual machines for everything across the network does not make sense.

A few notes on Talos

After running Talos for awhile. I say a few things both positive and negative.

Documentation while it exists has gaps so it very much expects you to understand k8s very well, it is insanely opinionated, and there are gaps in modifying the installation once you have installed the system. It is effectively a k8s application so the minute you need to modify it low level enough to do things that are not supported you really can't.

Performance is excellent due to the stripped down installation and rewritten system components. Installing and reinstalling new nodes is a breeze even without things like PXE booting on the network. If all you need is a basic k8s system it is perfect.

The high level deployment today

We are running Talos with a 3 node control plane (VMs) and 3 worker nodes (2 VMs, 1 bare metal). This is enough for us to run upgrades without taking out network services if replicas are set...or at a minium having low downtime if they are not.

The storage issue

Talos treats nodes with large disks as an antipattern and because it does not really give you easy access into the filesystem a backup of data on something like Longhorn would be needlessly complicated. While I understand Longhorn is supported and documented I would like to not go against the design pattern of the system in use.

I really wanted to deploy ISCSI onto the network. I wanted to have block storage for my IO intensive workloads. Configuring this in TrueNAS Scale was very simple and Talos does provide an addition for ISCSI, but it does not really have a clean way to add this to existing nodes (as of this writing). So if you want ISCSI on the network you effectively need to deploy it to start with or redeploy it. Candidly speaking Jellyfin and Kiwix would be infinitely better in a deployment with ISCSI because they are fairly I\O heavy.

Other storage mediums like an SMB drive are just a bad idea for a list of reasons. Mainly the lack of file attributes causing issues.

Inevitably the best solution to the problem was NFS for my home network, but this meant I had to be extremely mindful of what I deployed. And for the record I forced NFSv4 which is a much more mature implementation with respect to fil locking. From my perspective so long as the application did not have an sqlite database it was probably going to be fine.

Security concerns

When deploying NFS what I did is whitelist based upon the IPs of the K8S worker nodes. In theory this is great, but in practice it introduces some concerns as if somehow you managed to compromise a container and escape it you could make a jump to the NFS share. There are several ways to view this...either a known security limitation or with more complex configuration. I went with the prior.

The K8S cluster is also the home for workloads that are expected to exposed\forwarded to the internet through a reverse proxy so that cluster is in a lower trust network. Even if you did compromise the cluster you would not be able to get into most of the network as this is fairly isolated.

As a result of this extremely sensitive data really can't be deployed onto the cluster in NFS. Which is completely fine as we have proxmox system, orchestration for them, and solid tooling\caching for these systems...but it is something that has to be considered.

Migrating workloads never designed for K8S

To likely no one's surprise most homelab services are not designed to be written in K8S. And a lot of my older software was never designed for it.

Some of these were easy to migrate such as for example an ancient Wallpaper service I wrote years ago for returning a random wallpaper to a client machine. This was because I had a folder of around 1000 images that I just did not want to have locally on every machine...so I wrote a program for it. Also yes it's in JS...this was written years ago for a single user (me) while I would like to rewrite it one day it gets around a lot of performance issues by storing information about the images to the file system. Which also...really complicated things...I could have added NFS mounts for this, but ultimately determined that just storing the computed image information (generated as part of a docker build) and a small number of images (I have a self-hosted internal docker repo) was easy enough.

A silly service I wrote years ago was a service to keep track of items in a storage system in minecraft. Yeah I'm not joking, wrote a small amount of Lua for the mod computercraft just uses the http module from what can be read in game. This service is completely stateless because it just blindly receives whatever inventory is presented by the system at the time. This was one was as easy as just giving it a docker container and deploying that.

Migrating Gitea

I have used a self-hosted Gitea repo for years. This was originally deployed back when I was writing algorithmns to analyze stock market trends. Let's just say the data was useful. The repo had later grown to include my partner and I's code for the network so things like provisioning and upgrades. My partner as part of an AI project had also made an AI agent that was using memfs so all of a sudden a failure in Gitea meant a failure on the memory filesystem for an AI agent that we use and is stitting in our Discord...yeah ok was not ideal.

Something you have to know about Gitea...it can use next to no resources or it can use an insane amount of resources depending largely upon the repos deployed. I wanted to start the process of keeping local mirrors for things like QEMU, the Linux kernel, K8S, Mesa...things I have had to check the source code of over the years. This was partly in the wake of Github having reliability issues, some claims online at this time reported below 90%. These repos are also massive so cloning them or viewing them online is suboptimal at best...to fully work with them you really need local copies.

This is not just theoretically btw. I had had to read through the Linux kernel code to narrow down changes to virtual nic drivers, USB issues (including one issue where a capture card crashed the USB driver), VR headset support (unofficial Bigscreen Beyond VR patch was out of tree at this time), and sound system issues.

On Qemu I've had to dig into why a system deadlocked entirely or why the USB passthrough is so suboptimal (advice don't do it...just pass in a PCIe USB card and call it a day). Or my favorite when the flakey nature of hardware passthrough just fails to work like you expect.

K8s at one point had a bug known as ghost pods where resources where remain as pending after they had been deleted. The issue was more common in clusters with a large scale and this was something we did see at NCR during my time there. We ended up writing code to work around it, but it did require my colleague and I to dig into the source code.

And I run a custom Mesa build on my desktop because it's more performant. That's just it...no other reasons. I build the package as a modification of the upstream Arch packaging.

So due to NFS we couldn't really use the sqlite database and because I wanted local mirrors of the above projects I really needed a dedicated MySQL database which fortunately was dead simple to deploy on NixOS. Of course this application was never designed for K8S so if you kill it while running the error recovery is not graceful (things like corrupt repos when creating new mirrors) as a result you really can't use a readiness or liveliness probe here. And because the workloads can spike to require multiple full CPU cores and multiple gigs of memory (ex the Linux kernel size) you really need to have a very high max system resource usage that is multitudes higher than the minimum that is closer to the typical use in normal applications.

Deploying nextcloud

My partner and I keep all of our shared documents and things like collaborative editing in Nextcloud. It works insanely well for us for that purpose because it allows local collaborative document editing and allows us to keep financial data understandably private.

Nextcloud while it was not designed for this...as soon as you do not add readiness and liveliness works completely fine. Ended up using the linuxserver.io project. In the case of Nextcloud this works by populating a storage volume with the contents of a sidecar container. Assuming the storage volume has been written to then this is perfect, but if the process is interupted...well then the container has no way to know and the entire install fails.

This is also why we're using NFS, the attributes of files are properly supported and there is minimal file locking at play which would impact performance.

For the database we're making use of the same MySQL instance across the network for all services (Gitea, Piwigo, Nextcloud).

Fortunately Collabora for collaborative document editing was relatively easy to deploy and very well documented on how to run in K8S. So it was relatively easy to apply their example for minikube to a full k8s setup.

Future plans

I will likely retire Talos one of these days. I find the issues with getting into the OS to be a very significant problem and the value I get with it from being an appliance fairly low. I would like to have Longhorn in place on some nodes as this would move some application workloads away from being a single point of failure as the filesystem will be distributed.

Granted...I also say I am going to retire my TrueNAS Scale box one of these days, but it is still running. It's largely a...do I have the time\reason to move to something different. Right now what is in place works and the benefits of an appliance seem to be outweighing the downsides of rolling my own.

After proper block storage is on the network I will likely migrate Kiwix to block storage\k8s as well as Jellyfin.

I may also retire some older VMs from proxmox and move them to kubevirt, but that will have to wait until I have better networking as much of the network caps out at 2.5gb. Will likely consider this at 10gb networking.

Website Stats:

Website Build Version: 2026/05/29 03:26:23 PM (-07:00)

Last Website Update: edaaeae

Site Generator: Serpent Page Generator created by Lucia Smith