Troubleshooting Guide¶
Common issues and solutions for the homelab infrastructure.
Flux GitOps Issues¶
Flux Not Syncing Changes¶
Symptoms: - Changes pushed to Git but not applied to cluster - Kustomizations stuck in "Unknown" state
Diagnosis:
# Check Flux system status
flux get kustomizations
# Check source controller
flux get sources git
# View source controller logs
kubectl logs -n flux-system -l app=source-controller
Solutions:
-
Force reconciliation:
-
Check Git repository access:
HelmRelease Failures¶
Symptoms: - HelmRelease in "Failed" state - Applications not deploying
Diagnosis:
# Check HelmRelease status
flux get helmreleases -A
# Get detailed error information
kubectl describe helmrelease <release-name> -n <namespace>
# Check helm controller logs
kubectl logs -n flux-system -l app=helm-controller
Solutions:
-
Check chart repository:
-
Suspend and resume:
Database Connection Issues¶
PostgreSQL Connection Problems¶
Symptoms: - Applications unable to connect to database - Connection timeouts or authentication failures
Diagnosis:
# Check PostgreSQL pod status
kubectl get pods -n postgresql
# View PostgreSQL logs
kubectl logs -n postgresql statefulset/postgresql
# Test connection from within cluster
kubectl run -it --rm debug --image=postgres:16 --restart=Never -- psql -h postgresql.postgresql.svc.cluster.local -U postgres
Solutions:
-
Check service and endpoints:
-
Verify secrets:
Redis Connection Issues¶
Symptoms: - Cache misses or connection errors - Applications reporting Redis unavailability
Diagnosis:
# Check Redis pod status
kubectl get pods -n redis
# Test Redis connection
kubectl run -it --rm debug --image=redis:7 --restart=Never -- redis-cli -h redis-master.redis.svc.cluster.local ping
Networking Problems¶
Cloudflare Tunnel Issues¶
Symptoms: - External services not accessible - Tunnel showing as disconnected
Diagnosis:
# Check cloudflared pod status
kubectl get pods -n cloudflared
# View tunnel logs
kubectl logs -n cloudflared deployment/cloudflared-cloudflare-tunnel
Solutions:
-
Restart tunnel:
-
Check tunnel configuration:
DNS Resolution Problems¶
Symptoms: - Services not resolving by name - External DNS records not created
Diagnosis:
# Check external-dns logs
kubectl logs -n external-dns deployment/external-dns
# Test DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup auth.kjho.me
Solutions:
-
Check external-dns configuration:
-
Verify Cloudflare credentials:
Storage Issues¶
Longhorn Volume Problems¶
Symptoms: - PVCs stuck in "Pending" state - Application pods failing to start due to volume mount issues
Diagnosis:
# Check PVC status
kubectl get pvc -A
# Check Longhorn system
kubectl get pods -n longhorn-system
# Check storage classes
kubectl get storageclass
Solutions:
- Check Longhorn UI:
- Access Longhorn dashboard at configured URL
-
Review volume and node status
-
Restart Longhorn components:
Authentication Issues¶
Authentik Problems¶
Symptoms: - Unable to access Authentik UI - Authentication flows failing - Database connection errors
Diagnosis:
# Check Authentik pods
kubectl get pods -n authentik
# View Authentik logs
kubectl logs -n authentik deployment/authentik-server
kubectl logs -n authentik deployment/authentik-worker
# Check database initialization
kubectl get jobs -n authentik
Solutions:
-
Restart Authentik components:
-
Check database connectivity:
1Password Sync Issues¶
Symptoms: - Secrets not syncing from 1Password - OnePasswordItem resources in error state
Diagnosis:
# Check 1Password Connect status
kubectl get pods -n 1password-connect
# Check OnePasswordItem status
kubectl get onepassworditem -A
# View operator logs
kubectl logs -n 1password-connect deployment/onepassword-connect-operator
Solutions:
-
Restart 1Password Connect:
-
Check credentials:
General Debugging Commands¶
Cluster Health¶
# Node status
kubectl get nodes -o wide
# Resource usage
kubectl top nodes
kubectl top pods -A
# Events
kubectl get events -A --sort-by=.metadata.creationTimestamp
Pod Debugging¶
# Pod status and details
kubectl get pods -A -o wide
kubectl describe pod <pod-name> -n <namespace>
# Container logs
kubectl logs <pod-name> -n <namespace> -c <container-name>
kubectl logs <pod-name> -n <namespace> --previous
# Execute into pod
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
Network Debugging¶
# Test connectivity between pods
kubectl run -it --rm debug --image=busybox --restart=Never -- wget -qO- http://service-name.namespace.svc.cluster.local
# Check service endpoints
kubectl get endpoints -n <namespace>
# Network policies
kubectl get networkpolicies -A
Emergency Procedures¶
Complete Cluster Reset¶
Destructive Operation
Only use in development or when cluster is completely broken
# From deployment machine
ansible-playbook -i provisioning/k3s-inventory.ini provisioning/k3s-wipe.yml
ansible-playbook -i provisioning/k3s-inventory.ini provisioning/k3s-bootstrap.yml
Service Rollback¶
# Rollback HelmRelease to previous version
flux suspend helmrelease <release-name> -n <namespace>
helm rollback <release-name> -n <namespace>
flux resume helmrelease <release-name> -n <namespace>
# Or revert Git commit
git revert <commit-hash>
git push origin main
📁 Related Files:
- Discord Notifications - Alert configuration
- Database Configurations - PostgreSQL and Redis
- Network Configurations - Cloudflare and DNS