HOMELAB-542: fix(monitoring): resolve critical infrastructure alerts #138

Open
aaron wants to merge 1 commit from plane/HOMELAB-542-critical-alerts-triage into live
Owner

Summary

  • Fix ArgoCD OOMKilled by increasing memory limits from 512Mi to 1Gi
  • Disable false-positive Talos control plane alerts (kubeControllerManager, kubeScheduler, kubeProxy)
  • Fix TargetDown alerts by excluding control plane components that don't exist in Talos
  • Add comprehensive node monitoring (NodeCPUHigh, NodeSystemSaturation, NodeDiskIOSaturation, KubeCPUOvercommit)
  • Add Longhorn volume health monitoring (LonghornVolumesDegraded, LonghornVolumesFaulted, LonghornReplicaCount)
  • Add Kubernetes job monitoring (KubeJobFailed, KubeJobRunningTooLong)

Test plan

  • ArgoCD pods no longer OOMKilled after deployment
  • Control plane false-positive alerts resolved
  • TargetDown alerts only fire for legitimate targets
  • New node monitoring alerts properly detect resource pressure
  • Longhorn volume health properly monitored
  • Failed jobs properly detected and alerted

Resolves all critical and warning alerts mentioned in HOMELAB-542.

🤖 Generated with Claude Code

## Summary - Fix ArgoCD OOMKilled by increasing memory limits from 512Mi to 1Gi - Disable false-positive Talos control plane alerts (kubeControllerManager, kubeScheduler, kubeProxy) - Fix TargetDown alerts by excluding control plane components that don't exist in Talos - Add comprehensive node monitoring (NodeCPUHigh, NodeSystemSaturation, NodeDiskIOSaturation, KubeCPUOvercommit) - Add Longhorn volume health monitoring (LonghornVolumesDegraded, LonghornVolumesFaulted, LonghornReplicaCount) - Add Kubernetes job monitoring (KubeJobFailed, KubeJobRunningTooLong) ## Test plan - [ ] ArgoCD pods no longer OOMKilled after deployment - [ ] Control plane false-positive alerts resolved - [ ] TargetDown alerts only fire for legitimate targets - [ ] New node monitoring alerts properly detect resource pressure - [ ] Longhorn volume health properly monitored - [ ] Failed jobs properly detected and alerted Resolves all critical and warning alerts mentioned in HOMELAB-542. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
HOMELAB-542: fix(monitoring): resolve critical infrastructure alerts
Some checks failed
0/0 projects applied successfully.
CI Review / ai-review (pull_request) Has been cancelled
CI Review / helm-validate (pull_request) Has been cancelled
CI Review / pr-title (pull_request) Has been cancelled
Lint & Validate / shellcheck (pull_request) Has been cancelled
Lint & Validate / yaml-lint (pull_request) Has been cancelled
Lint & Validate / terraform-validate (pull_request) Has been cancelled
766ef2cc57
- Fix ArgoCD OOMKilled: increase memory limits from 512Mi to 1Gi for server and repoServer
- Disable false-positive Talos control plane alerts: kubeControllerManager, kubeScheduler, kubeProxy
- Fix TargetDown alerts: exclude control plane components that don't exist in Talos
- Add comprehensive node monitoring: NodeCPUHigh, NodeSystemSaturation, NodeDiskIOSaturation, KubeCPUOvercommit
- Add Longhorn volume health monitoring: LonghornVolumesDegraded, LonghornVolumesFaulted, LonghornReplicaCount
- Add Kubernetes job monitoring: KubeJobFailed, KubeJobRunningTooLong

Addresses all critical and warning alerts mentioned in ticket.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Some checks failed
0/0 projects applied successfully.
CI Review / ai-review (pull_request) Has been cancelled
CI Review / helm-validate (pull_request) Has been cancelled
CI Review / pr-title (pull_request) Has been cancelled
Lint & Validate / shellcheck (pull_request) Has been cancelled
Lint & Validate / yaml-lint (pull_request) Has been cancelled
Lint & Validate / terraform-validate (pull_request) Has been cancelled
This pull request has changes conflicting with the target branch.
  • core/manifests/monitoring/rules/pod-alerts.yaml
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin plane/HOMELAB-542-critical-alerts-triage:plane/HOMELAB-542-critical-alerts-triage
git switch plane/HOMELAB-542-critical-alerts-triage
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aaron/infra-core!138
No description provided.