Lucia Smith: Personal Website

Network Shenanigans

One of the downsides of having numerous virtual machines, containers, and multiple network routers is you also need to understand networking pretty well. Not just your layer one, but your infrastructure, and your actual code. To make it worse when you add in things like game uncompressed game streaming and very data heavy activities like VR the network can become even further pushed to its limits. Perhaps you view this as an enjoyable experience and challenge (in which case you are in the minority and a lot like me thinking this kinda thing is fun).

Network load and issues due to application streaming

I make very heavy use of local application streaming through a custom application I have written called Athena. Under the hood for the remote system support I am utilizing the sunshine \ moonlight projects in addition to things like RDP for legacy systems that I want remote control on (I do support VNC, but it's not exactly the most performant so try to avoid it).

Most people are just going to do a 20-50MB/s connection and candidly speaking you will get very solid performance over this. However...I like to stream fairly high fidelity games like Cyberpunk 2077 over 1440p at 120hz. Uncompressed over the cable this is 11.59GB (3.86GBs compressed with DSC) btw. So if you stream this raw video over the wire with little to no compression and HEVC you may end up sending 350-500MBs over the network. As a tangent I would like to dabble more with AV1, but support for this requires a hardware encoder on the sender and a decoder on the receiver, and it's overall a less tested path...long term I can see a migration to it making sense. I have worked on a project with video streaming as a core component many years ago...while I may have been primarily a web developer at the time I was aware of a lot of the design consideration, so this wasn't completely unfamiliar.

The main streaming client for Athena is my thin client which is a virtual machine with a passed through GPU for decoding application streams as well as hardware acceleration and a PCIe card for USB devices. The host system for this virtual machine sits behind a separate router for security reasons as it is also host to our NFS storage and several network cron tasks. Additionally, this cuts down some of the network load on the edge router. Unfortunately this is yet more complexity and yet another network hop where things can go wrong.

Network load and issues due to VR applications

My partner and I originally met through VR and do on occasion spend time online still. However, this can be a very bandwidth heavy workload on our network. Let's use a couple examples.

Bandwidth my partner and I each use depending on the situation:

50-300 on the world
20-100 * (2-100) MB on the avatars
Assume new avatar loads while in the world of 5-30 * (20-100) MB per hour

Assume a formula of 2 * ( wb + ab * ac + al * h ) where:

wb = world bandwidth
ab = average avatar bandwidth
ac = avatar count
al = avatars loading in/out
h = hours in world

So a few examples we have had (each of these are 1 hour):

2 * ( 50 + 20 * 2 + 5 * 1 ) = 190 MB...small group chat
2 * ( 100 + 100 * 30 + 8 * 1 ) = 6216 MB...a small game world
2 * ( 300 + 80 * 50 + 30 * 1 ) = 8660MB...full VR club
2 * ( 100 + 100 * 50 + 40 * 1 ) = 10280MB...extra large instance

The problem with this is you are dropped into the world and the longer it takes you to load everything the less time you can actually spend in it. And the data is being requested as it's needed, typically not very far ahead of time...so every second spent is another second where you can't fully see what is going on in the world. Because of this these numbers look worse in practice because the traffic is very spiky. For example that extra large instance caused my partner and I to break 1.2gbs (it's one of the main reasons we have SPF+ on our outbound network today).

The problem for application streaming

Following the home lab rebuild when I would launch a program through Athena I would randomly see a crash. Athena manages a separate thread that kicks off the Moonlight client for the stream and this client would be the component that crashed. Sometimes I could get an hour...sometimes 10–15 minutes. To make the situation worse as of this writing Athena does not pause the remote application when a crash occurs (though I will probably add this at some point).

Random disconnections are not fun, but if rare enough I was willing to ignore. Athena does have some measures to allow quick reconnections and keeps tracks of the remote machine state so if we do crash the client this is not that big of a deal. Unfortunately...5-6 crashes in one night is not a usable system. And if someone was in VR the frequency could be even higher to the point of being potentially unusable.

The problem for VR applications

My partner and I when both in VR would experience random drops in connectivity sometimes even completely disconnecting us from worlds. What was interesting is sometimes I would drop, and sometimes she would drop...other times both of us would. If it was truly outbound ISP issues them I would have expected both of our connections to drop, but that was fairly uncommon.

Other times especially in very busy worlds where a lot of data was running over the wire I would experience exceptionally high degrees of instability on the network to the point Windows would tell me I was not connected to the internet. As someone who truly loves music this was a huge issue as I quite enjoy hearing unique takes or spins on songs or even completely new works from artists. It got to the point that listening to music was a gamble on whether the network would just disconnect me.

Tracing the Athena crashes

Investigating the crash I observed the ping on the remote would vary from 0.3-0.5ms under normal operation and spike as high as 5000ms. And my partner would observe no issues on her system. This also seemed to align randomly with syncs\changes from the cluster on the git and Nextcloud instances (for context I keep internal mirrors of several open source projects like QEMU, Linux kernel, and K8s because I have to read the code on these for additions, debugging, or fixes). And randomly I would completely lose remote connectivity with the box in the rack.

This by itself doesn't mean much as a typical TCP/IP request would just mean a single slow frame right? Well that's the issue, Moonlight at this point in time had uncompressed video as an experimental feature and if there was instability it would crash the binary. And sure patching Moonlight is an option, but a 5-second spike in input latency is still not acceptable so while this would stop the crash I would still need to address the network latency.

Tracing the latency issue on Athena

The original implementation of Athena was done in 2024 and ran stable up until the end of 2025 with very few changes to the deployment. So why did the network latency spike to the point that the moonlight is crashing. Very little had changed in terms of raw hardware...why would would things have broken now. What had changed was: VM configurations and a new internal router.

When we updated the home lab we realized that a lot of VMs could be moved or combined. As a result CPU allocation changed pretty significantly ans more CPU heavy workloads onto the host node running the thin client for the Athena client. When observing the CPU usage on this node I noticed that in some cases there was CPU pressure where the tasks were being delayed on their scheduling for the VMs. This was roughly 5% in the worst case and would randomly spike. The solution here was to change the CPU affinity on some VMs to save a few cores to not be scheduled by VMs (leaving some for host tasks) ended up removing this CPU pressure...though in hindsight this might be why the den is 8 degrees hotter than the rest of our home. This caused pings to drop from 5000ms to 300ms...good enough assuming the application is TCP or can properly handle out of order packets... Moonlight at the time would not.

Now is probably a good time to mention that our NFS system is on a virtual machine that sits on the same host as my thin client and at this time we were making use of a paravirtualized network device. What this means however is if our VMs receive a large amount of traffic then there will be some traffic contention on the virtual NIC present on the thin client. The cleanest solution to this is dedicated hardware passed directly into the virtual machine such as a network NIC and do what I have done with the k8s cluster where we connect the passed through NICs to a separate network. This brought things down from 300ms to 0.2-0.4ms.

Tracing the latency issues on VR

Some important context here is my VR box sits in the 12U server rack and dumps a very large amount of heat due to just how demanding VR is as an application. The ambient temperature in server rack actually would break 130F. When the temperature would break 130F on the heat exhaust on the rack we would observe that networking into this box would sporadically fail.

This kinda tracks as the max ambient operating temperature for the networking gear in the rack is 100F. Logically of course the gear is going to fail at 30F over operating temperatures. This tracks as well with things working correctly for a couple of hours and then failing as the full metal server rack\machine chassis would become heat soaked.

To further validate that this was likely the physical switches. My partner's machine in the next room would never have network stability issues as the backbone of our network runs through a 10GB switch that is kept outside the server rack (sitting directly on top of it). I say this because we run internal DNS servers which means when they fail (and they have in the past) will take out the domain resolution for all systems on the network effectively causing most applications to think they cannot access the internet. This also isn't fixable with just pointing to an external DNS, because in the interest of minimizing outbound traffic flows and maximizing network cache usage all the nodes on the network use the internal DNS resolvers.

There's no elegant software solution here I just moved the one of the servers that was sitting on a shelf on top of the server rack and removed the metal side panels on the server rack. These two dropped the temperature enough that random disconnections stopped occurring. But figured someone would appreciate the explanation.

Next steps...

Because our network gear is a mix of Ubiqiti and pfSense the current plan is to introduce software defined networking into the stack. This is so I can automatically deploy networking changes from our CI\CD and even do things like migrate workloads between subnetworks and VMs automatically, for example long term I would like a DB replica and a NFS replica so that in the event that node fails to start we don't lose core infrastructure. The main reason for this is we can't really make an effective service migration is we have multiple subnetworks that have been deployed specifically to avoid overloading our router at the top of our network and for security isolation.

A slight tangent

As of this writing Moonlight's last stable release was in 2024...going on 2 years ago. This is...not ideal as there are a fair number of improvements and fixes in later versions. I did have one crash that inevitably the only way to resolve was to bump the application version...however given the latency issues that were occurring the above changes were very necessary given the UDP nature of Moonlight.

I also would later replace the router once it hit EOL with a Ubiquiti Dream Machine. But this is its own topic so here's the info on that if you want.

My Professional Story

Projects (Day Job)

Projects (Spare time)

Patents \ Accomplishments

Personal