Network Shenanigans
One of the downsides of having a large number of virtual machines, containers, and multiple network routers is you also need to understand networking pretty well. Not just your layer one, but your infrastructure, and your actual code. Or perhaps you view this as an enjoyable experience (in which case you are in the minority and a lot like me thinking this kinda thing is fun).
Background
I make very heavy use of local application streaming through a custom application I have written called Athena. Under the hood for the remote system support I am utilizing the sunshine \ moonlight projects in addition to things like RDP for legacy systems that I want remote control on (I do support VNC, but it's not exactly the most performant so try to avoid it).
Most people are just going to do a 20-50MB/s connection and candidly speaking you will get very solid performance over this. However...I like to stream fairly high fidelity games like Cyberpunk 2077 over 1440p at 120hz. Uncompressed over the cable this is 11.59GB (3.86GBs compressed with DSC) btw. So if you stream this raw video over the wire with little to no compression and HEVC you may end up sending 350-500MBs over the network. As a tangent I would like to dabble more with AV1, but support for this requires a hardware encoder on the sender and a decoder on the reciever and it's overall a less tested path...long term I can see a migration to it making sense. I have worked on a project with video streaming as a core component many years ago...while I may have been primarialy a web developer at the time I was aware of a lot of the design consideration so this wasn't completely unfamilar.
Then you have the story of our network isolation and internal routers done largely for security and to reduce the strain on the main router (which was a Netgear prosumer router at the time...it was not designed to have 30 virtual machines and a K8S cluster on it alongside heavy application streams). Which while it does reduce strain also means that there is some complexity and complexity can equal instability.
The main streaming client for Athena is my thin client which is a virtual machine with a passed through GPU for decoding application streams as well as hardware acceleration and a PCIe card for USB devices. The host system for this virtual machine sits behind a seperate router for security reasons as it is also host to our NFS storage and several network cron tasks.
The problem
Following the homelab rebuild when I would launch a program through Athena I would randomly see a crash. Athena manages a seperate thread that kicks off the Moonlight client for the stream and this client would be the component that crashed. Sometimes I could get an hour...sometimes 10-15 minutes. To make the situation worse as of this writing Athena does not pause the remote application when a crash occurs (though I will probably add this at some point).
Random disconnections are not fun, but if rare enough I was willing to ignore. Athena does have some measures to allow quick reconnections and keeps tracks of the remote machine state so if we do crash the client this is not that big of a deal. Unfortunately...5-6 crashes in one night is not a usable system.
The investigation\analysis\solution
Investigating the crash I observed the ping on the remote would vary from 0.3-0.5ms under normal operation and spike as high as 5000ms. And my partner would observe no issues on her system. This also seemed to align randomly with syncs\changes from the cluster on the git and nextcloud instances (for context I keep internal mirrors of several open source projects like Qemu, Linux kernel, and K8s because I have to read the code on these for additions, debugging, or fixes). And randomly I would completely loose remote connectivity with the box in the rack.
Additionally we had also reallocated VMs recently and condensed a few during the rebuild which meant the CPU allocation was different and candidly while we had fewer VMs those VMs were using more processing. When observing the CPU usage on the node I noticed that in some cases there was CPU pressure where the tasks were being delayed on their scheduling for the VMs. This was roughly 5% in the worst case and would randomly spike.
For context I often provision hardware to the redline\limits because the hardware in the server cluster is often older and I enjoy building new projects fairly regularly. The node for my thin client is also the most powerful node in the cluster that is always on because it is my old gaming PC hardware (typically upgraded every 3-4 years). Effectively reallocating the CPU affinity to save a few cores to not be scheduled at all ended up removing this CPU pressure...though in hindsight this might be why the den is 8 degrees hotter than the rest of our home. This caused pings to drop from 5000ms to 300ms...good enough assuming the application is TCP or can properly handle out of order packets...Moonlight at the time would not.
As for the connectivity drops...apparently I had enabled some performance logging\monitoring for observing the network traffic which was logging for every network packet. So on random ocassions the internal router for this box would just randomly freeze and fail to process traffic correctly. To be fair this is an old mini pc that was maybe $150 on amazon at the time with fairly dated hardware running pfsense so it was not designed to run this kinda workload. Disabling this monitoring removed the need for randomly restarting this router and breaking critical infrastructure.
Now is a probably a good time to mention that our NFS system is on a virtual machine that sits on the same host as my thin client and at this time we were making use of a paravirtualized network device. What this means however is if our VMs recieve a large amount of traffic then there will be some traffic contention on the virtual nic present on the thin client. The cleanest solution to this is dedicated hardware passed directly into the virtual machine such as a network nic and do what I have done with the k8s cluster where we connect the passed through nics to a seperate network. This brought things down from 300ms to 0.2-0.4ms.
You would think this would completely stablize things...well here is the embarasing part. This last part is entirely on past me not clearing out the reserved IPs on the network and not turning off the dynamic DNS (which I no longer use in favor of network tunnels instead). As a result the router would randomly spike from 0.2-0.4ms to 5ms which was just enough to result in a few out of order packets triggering a crash on Moonight. Actually cleaning up the routing table and turning off the dynamic DNS resolved this issue. Long term I will likely deploy something more intended for small business because this router is not really suitable for the type of network I have.
A slight tangent
As of this writing Moonlight's last stable release was in 2024...going on 2 years ago. This is...not ideal as there are a fair number of improvements and fixes in later versions. I did have one crash that inevitably the only way to resolve was to bump the application version...however given the latency issues that were occuring the above changes were very necessary given the UDP nature of Moonlight.
I also would later replace the router once it hit EOL with a Ubiquiti Dream Machine. But this is it's own topic so here's the info on that if you want.