Troubleshooting my Talos Linux cluster outage

October 31, 2025

A few days ago we had a power outage and afterwards, none of the 4 nodes in my raspberry pi talos linux cluster came back online.

Will Gorman

@willgorman.net

We lost power while I’m out of town and none of the nodes in my raspberry pi k8s cluster came back up. Now I’m anxious to get back and figure out what happened

Or at least that's what I thought at first. It turned out that the nodes were actually up but not on my Tailscale network (although if something isn't on my Tailscale network then for all practical purposes it's not up at all). So what happened?

In Talos Linux, Tailscale is installed on the hosts by way of an extension so in order to figure out what's wrong with Tailscale I needed to check the Talos CLI.

talosctl logs ext-tailscale

192.168.6.104: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2025-10-31T18:42:09-05:00 is after 1970-01-02T00:00:14Z"

Well that's not good. Talos authenticates all API interactions so if the certificate isn't valid then I wouldn't be able to do anything with the cluster at all. But what's up with the strange 1970-01-02T00:00:14Z timestamp?

I found a Reddit post with a very similar problem that indicated the issue was due to raspberry pi clocks resetting on reboot and then not being able to get the correct time from NTP. But why my nodes be failing to get NTP. According to the documentation, Talos would be using time.cloudflare.com as the default time server and I hadn't changed that.

I figured I should try connecting a monitor to one of nodes and that gave me my next clue.

time query error with server "192.168.4.1"

That's not time.cloudflare.com, that's my eero router. For some reason (that I'm still not sure of) the cluster has been using the NTP from my router instead of the default of time.cloudflare.com. A quick check from another machine confirmed that the NTP server wasn't responding. I'm still not sure what happened there, but after restarting the router it started to respond to NTP queries again and access to the Talos API recovered as well.

The nodes still weren't on the Tailscale network but with talosctl working again I could easily see why. The auth key that they use to join the Tailscale network had expired, so after generating a new key in Tailscale and updating the extension config in Talos the cluster recovered to full health.