RKE2 Cluster Troubleshooting Guide
This document covers common issues encountered with RKE2 clusters and their solutions.
Issues encountered here are based on deployment of multi-node RKE2 clusters from the RKE2 Ansible Playbook deployed on Proxmox and bootstrapped via the Rachner dashboard.
Table of Contents
- Calico Networking Issues
- Longhorn Storage Issues
- CSI Pod Exec Format Errors
- General Troubleshooting Commands
Calico Networking Issues
Symptom: Calico Node Pod Not Ready
Symptoms:
- Calico node pod shows 0/1 Ready status
- Pods on affected node cannot communicate with cluster services
- Longhorn manager and other network-dependent services fail
Example:
kubectl get pods -n calico-system
NAME READY STATUS RESTARTS AGE
calico-node-64rz2 0/1 Running 0 22h
Diagnosis:
Check Calico node pod logs:
kubectl logs -n calico-system <calico-node-pod-name>
Look for errors like:
Failed to connect to typha endpoint 10.9.11.46:5473... i/o timeout
Root Cause:
Port 5473 (Calico Typha) is not allowed through the firewall. Calico uses Typha as a connection concentrator for the Kubernetes API, and without this port open, Calico nodes cannot establish proper communication.
Solution:
Add port 5473/tcp to the firewall on all nodes:
sudo ufw allow 5473/tcp comment 'Calico Typha'
Verify the rule was added:
sudo ufw status numbered | grep 5473
Restart the affected Calico pod:
kubectl delete pod -n calico-system <calico-node-pod-name>
Verify the new pod becomes Ready:
kubectl get pods -n calico-system -w
Prevention:
Ensure the playbook's group_vars/all.yml includes port 5473 in both controlplane_ports and worker_ports.
Longhorn Storage Issues
Symptom: Longhorn Manager Crashes
Symptoms:
- Longhorn manager pods in CrashLoopBackOff
- PVCs stuck in Pending state
- Longhorn admission webhook failures
Diagnosis:
Check longhorn-manager pod logs:
kubectl logs -n longhorn-system <longhorn-manager-pod-name>
Look for errors like:
Failed to initialize Longhorn API client
connection timed out
Check Longhorn backend service endpoints:
kubectl get endpoints -n longhorn-system longhorn-backend
Root Cause:
Typically caused by networking issues (see Calico section above). Longhorn manager requires functional cluster networking to communicate with the Longhorn backend service.
Solution:
- Fix underlying networking issues first (ensure Calico is healthy)
- Delete the crashing longhorn-manager pod to force restart:
kubectl delete pod -n longhorn-system <longhorn-manager-pod-name>
- Verify all nodes are registered in Longhorn:
kubectl get nodes.longhorn.io -n longhorn-system
Expected output shows all cluster nodes:
NAME READY ALLOWSCHEDULING SCHEDULABLE AGE
master-01 True true True 22h
master-02 True true True 19m
worker-01 True true True 22h
Symptom: PVC Stuck in Pending
Diagnosis:
Check PVC status:
kubectl get pvc -n <namespace>
kubectl describe pvc -n <namespace> <pvc-name>
Check Longhorn CSI provisioner pods:
kubectl get pods -n longhorn-system -l app=csi-provisioner
Solution:
Ensure all CSI components are running (see CSI Pod Exec Format Errors section if they're not).
CSI Pod Exec Format Errors
Symptom: CSI Pods CrashLoopBackOff with Exec Format Error
Symptoms:
- CSI pods (attacher, provisioner, resizer, snapshotter) in CrashLoopBackOff
- longhorn-csi-plugin pods not ready (0/3 or similar)
- Error in logs:
exec /csi-attacher: exec format error
Example:
kubectl get pods -n longhorn-system | grep csi
csi-attacher-858dd64dc4-6w768 0/1 CrashLoopBackOff 9 (2m37s ago) 23m
csi-provisioner-7d9f559dcd-dmvgv 0/1 CrashLoopBackOff 265 (43s ago) 21h
Diagnosis:
Check pod logs:
kubectl logs -n longhorn-system <csi-pod-name>
Error message:
exec /csi-attacher: exec format error
This indicates corrupted or incorrectly cached container images.
Root Cause:
The exec format error occurs when container images are corrupted or cached with the wrong architecture binaries. This can happen during image pulls or due to registry issues.
Solution:
- Identify which nodes have failing CSI pods:
kubectl get pods -n longhorn-system -l app=csi-attacher -o wide
kubectl get pods -n longhorn-system -l app=csi-provisioner -o wide
kubectl get pods -n longhorn-system -l app=csi-resizer -o wide
kubectl get pods -n longhorn-system -l app=csi-snapshotter -o wide
- SSH to the affected node and remove the corrupted images:
For CSI attacher:
ssh ubuntu@<node-ip> -i ~/.ssh/<key>
sudo CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock \
/var/lib/rancher/rke2/bin/crictl images | grep csi-attacher
sudo CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock \
/var/lib/rancher/rke2/bin/crictl rmi <image-id>
For all CSI images at once:
ssh ubuntu@<node-ip> -i ~/.ssh/<key>
sudo CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock \
/var/lib/rancher/rke2/bin/crictl rmi $(sudo CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock \
/var/lib/rancher/rke2/bin/crictl images | grep -E 'csi-(attacher|provisioner|resizer|snapshotter)' | awk '{print $3}' | sort -u)
- Delete the failing pods to force recreation with fresh images:
kubectl get pods -n longhorn-system -l app=csi-attacher -o wide | grep <node-name> | awk '{print $1}' | \
xargs kubectl delete pod -n longhorn-system
Repeat for provisioner, resizer, and snapshotter.
- Verify new pods are Running:
kubectl get pods -n longhorn-system | grep csi
Expected output:
csi-attacher-858dd64dc4-dr9x5 1/1 Running 0 2m
csi-provisioner-7d9f559dcd-4j5x4 1/1 Running 0 1m
csi-resizer-567484d9f7-t828z 1/1 Running 0 1m
csi-snapshotter-858f58fdc5-4hj86 1/1 Running 0 1m
Symptom: longhorn-csi-plugin Pod Not Ready
Diagnosis:
Check pod status:
kubectl get pods -n longhorn-system -l app=longhorn-csi-plugin
Check individual container status:
kubectl describe pod -n longhorn-system <longhorn-csi-plugin-pod-name>
Check logs for each container:
kubectl logs -n longhorn-system <pod-name> -c node-driver-registrar
kubectl logs -n longhorn-system <pod-name> -c livenessprobe
kubectl logs -n longhorn-system <pod-name> -c longhorn-csi-plugin
Solution:
Same as CSI pod exec format errors. Remove the corrupted images and restart the pods:
ssh ubuntu@<node-ip> -i ~/.ssh/<key>
sudo CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock \
/var/lib/rancher/rke2/bin/crictl rmi $(sudo CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock \
/var/lib/rancher/rke2/bin/crictl images | grep -E 'csi-node-driver-registrar|livenessprobe' | awk '{print $3}' | sort -u)
kubectl delete pod -n longhorn-system <longhorn-csi-plugin-pod-name>
General Troubleshooting Commands
Cluster Health
Check node status:
kubectl get nodes -o wide
Check all pods across namespaces:
kubectl get pods -A
Check pods not running:
kubectl get pods -A | grep -v Running | grep -v Completed
Calico Troubleshooting
List all Calico pods:
kubectl get pods -n calico-system
Check Calico node status on specific node:
kubectl logs -n calico-system <calico-node-pod-name>
Longhorn Troubleshooting
Check all Longhorn pods:
kubectl get pods -n longhorn-system
Check Longhorn nodes:
kubectl get nodes.longhorn.io -n longhorn-system
Check Longhorn volumes:
kubectl get volumes.longhorn.io -n longhorn-system
Check Longhorn backend endpoints:
kubectl get endpoints -n longhorn-system longhorn-backend
Check PVCs using Longhorn:
kubectl get pvc -A | grep longhorn
Container Runtime Commands
List images on a node:
ssh ubuntu@<node-ip> -i ~/.ssh/<key>
sudo CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock \
/var/lib/rancher/rke2/bin/crictl images
List containers on a node:
ssh ubuntu@<node-ip> -i ~/.ssh/<key>
sudo CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock \
/var/lib/rancher/rke2/bin/crictl ps -a
Remove an image:
ssh ubuntu@<node-ip> -i ~/.ssh/<key>
sudo CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock \
/var/lib/rancher/rke2/bin/crictl rmi <image-id>
Firewall Troubleshooting
Check UFW status:
sudo ufw status numbered
Add a firewall rule:
sudo ufw allow <port>/<protocol> comment '<description>'
Reload UFW:
sudo ufw reload
Log Analysis
Get recent pod logs:
kubectl logs -n <namespace> <pod-name> --tail=100
Get previous pod logs (after crash):
kubectl logs -n <namespace> <pod-name> --previous
Stream pod logs:
kubectl logs -n <namespace> <pod-name> -f
Get logs from specific container in multi-container pod:
kubectl logs -n <namespace> <pod-name> -c <container-name>
Common Issue Resolution Checklist
When troubleshooting RKE2 cluster issues, follow this checklist:
- Check node status - Ensure all nodes are Ready
- Check networking - Verify Calico pods are Running and Ready
- Check firewall - Ensure all required ports are open (especially 5473 for Calico)
- Check storage - Verify Longhorn manager pods are Running
- Check CSI - Ensure all CSI component pods are Running
- Check PVCs - Verify PVCs are Bound if storage is required
- Check logs - Review pod logs for specific error messages
Known Issues and Workarounds
Issue: Node hostname resolution errors
Symptom:
sudo: unable to resolve host <hostname>: Name or service not known
Cause: The /etc/hosts file has stale or incorrect entries for the node's hostname.
Solution: Already handled by the playbook. The playbook ensures /etc/hosts has the correct entry:
127.0.1.1 <hostname>
Issue: Cloud provider taint preventing pod scheduling
Symptom: Pods stuck in Pending state with taint-related errors.
Cause: RKE2 may add a cloud provider taint even when not using a cloud provider.
Solution: Remove the taint:
kubectl taint nodes <node-name> node.cloudprovider.kubernetes.io/uninitialized:NoSchedule-