feat: add resource limit alerts and fix Loki retention (HOMELAB-180) #19

Merged
claude-agent merged 1 commit from plane/HOMELAB-180-resource-limit-alerts into live 2026-03-22 20:31:24 +00:00
Owner

Summary

  • Longhorn alerts (HOMELAB-219): disk filling 80%/90%, node storage pressure
  • Observability alerts (HOMELAB-220): Prometheus TSDB/WAL, Loki error rates/ingester, Tempo compaction/flush
  • Improved PVC alerts (HOMELAB-221): tiered 75%/90% thresholds + predict_linear() filling-fast prediction
  • Loki retention fix (HOMELAB-222): enabled compactor retention_enabled: true — without this, the 30d retention_period was silently ignored

Total alert count: 16 → 26 (+10 new alerts)

Test plan

  • ArgoCD syncs PrometheusRule resources
  • kubectl get prometheusrules -n monitoring shows longhorn-alerts, observability-alerts
  • Prometheus UI /alerts shows new rules loaded without errors
  • Loki restarts with compactor retention enabled (check logs)
  • No false-positive alerts firing on healthy cluster

🤖 Generated with Claude Code

## Summary - **Longhorn alerts** (HOMELAB-219): disk filling 80%/90%, node storage pressure - **Observability alerts** (HOMELAB-220): Prometheus TSDB/WAL, Loki error rates/ingester, Tempo compaction/flush - **Improved PVC alerts** (HOMELAB-221): tiered 75%/90% thresholds + `predict_linear()` filling-fast prediction - **Loki retention fix** (HOMELAB-222): enabled compactor `retention_enabled: true` — without this, the 30d `retention_period` was silently ignored Total alert count: 16 → 26 (+10 new alerts) ## Test plan - [ ] ArgoCD syncs PrometheusRule resources - [ ] `kubectl get prometheusrules -n monitoring` shows longhorn-alerts, observability-alerts - [ ] Prometheus UI /alerts shows new rules loaded without errors - [ ] Loki restarts with compactor retention enabled (check logs) - [ ] No false-positive alerts firing on healthy cluster 🤖 Generated with [Claude Code](https://claude.com/claude-code)
feat: add resource limit alerts and fix Loki retention (HOMELAB-180)
Some checks failed
CI Review / pr-title (pull_request) Failing after 0s
CI Review / helm-validate (pull_request) Failing after 2s
CI Review / ai-review (pull_request) Failing after 1s
Lint & Validate / terraform-validate (pull_request) Failing after 1s
Lint & Validate / yaml-lint (pull_request) Failing after 1s
Lint & Validate / shellcheck (pull_request) Failing after 1s
f5b472ce74
- Add Longhorn storage capacity alerts (disk 80%/90%, node storage 80%)
- Add observability stack alerts (Prometheus TSDB/WAL, Loki errors,
  Tempo compaction/ingester failures)
- Improve PVC alerts: tiered thresholds (75% warn, 90% crit) and
  predict_linear() for "filling in <4h" prediction
- Fix Loki retention: enable compactor retention_enabled (required for
  retention_period to actually enforce cleanup on 50Gi PVC)

Sub-tickets: HOMELAB-219, HOMELAB-220, HOMELAB-221, HOMELAB-222

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aaron/infra-core!19
No description provided.