HOMELAB-542: fix(monitoring): increase ArgoCD memory limits and improve Longhorn storage settings #137

Open
aaron wants to merge 2 commits from plane/HOMELAB-542-fix-critical-alerts into live
Owner

Summary

Fixes critical infrastructure alerts by addressing ArgoCD OOMKilled issues and improving Longhorn volume health:

  • Increase ArgoCD server memory from 512Mi to 1Gi to fix OOMKilled issues
  • Increase ArgoCD repoServer memory from 512Mi to 1Gi
  • Increase ArgoCD applicationSet memory from 256Mi to 512Mi
  • Increase Longhorn storageMinimalAvailablePercentage from 1% to 10% for better disk space management
  • Add Longhorn volume health monitoring settings to reduce degraded volumes
  • Control plane false positive alerts already fixed via existing Talos compatibility settings

Test plan

  • YAML files validated for syntax
  • Changes target live branch for proper three-tier branching
  • CI lint pipeline passes
  • ArgoCD memory usage monitored after deployment
  • Longhorn volume health improves after configuration changes

Resolves: HOMELAB-542

🤖 Generated with Claude Code

## Summary Fixes critical infrastructure alerts by addressing ArgoCD OOMKilled issues and improving Longhorn volume health: - Increase ArgoCD server memory from 512Mi to 1Gi to fix OOMKilled issues - Increase ArgoCD repoServer memory from 512Mi to 1Gi - Increase ArgoCD applicationSet memory from 256Mi to 512Mi - Increase Longhorn storageMinimalAvailablePercentage from 1% to 10% for better disk space management - Add Longhorn volume health monitoring settings to reduce degraded volumes - Control plane false positive alerts already fixed via existing Talos compatibility settings ## Test plan - [x] YAML files validated for syntax - [x] Changes target live branch for proper three-tier branching - [ ] CI lint pipeline passes - [ ] ArgoCD memory usage monitored after deployment - [ ] Longhorn volume health improves after configuration changes Resolves: HOMELAB-542 🤖 Generated with [Claude Code](https://claude.com/claude-code)
HOMELAB-542: fix(monitoring): increase ArgoCD memory limits and improve Longhorn storage settings
Some checks are pending
CI Review / pr-title (pull_request) Waiting to run
CI Review / helm-validate (pull_request) Waiting to run
CI Review / ai-review (pull_request) Waiting to run
Lint & Validate / terraform-validate (pull_request) Waiting to run
Lint & Validate / yaml-lint (pull_request) Waiting to run
Lint & Validate / shellcheck (pull_request) Waiting to run
0/0 projects applied successfully.
a403ba583e
- Increase ArgoCD server memory from 512Mi to 1Gi to fix OOMKilled issues
- Increase ArgoCD repoServer memory from 512Mi to 1Gi
- Increase ArgoCD applicationSet memory from 256Mi to 512Mi
- Increase Longhorn storageMinimalAvailablePercentage from 1% to 10% for better disk space management
- Add Longhorn volume health monitoring settings to reduce degraded volumes
- Control plane false positive alerts already fixed via Talos compatibility settings

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
HOMELAB-542: fix(monitoring): adjust alert thresholds for homelab environment
Some checks failed
CI Review / ai-review (pull_request) Has been cancelled
CI Review / helm-validate (pull_request) Has been cancelled
CI Review / pr-title (pull_request) Has been cancelled
Lint & Validate / shellcheck (pull_request) Has been cancelled
Lint & Validate / yaml-lint (pull_request) Has been cancelled
Lint & Validate / terraform-validate (pull_request) Has been cancelled
9a677b6434
- Increase NodeCPUHigh threshold from 80% to 90% (30m duration)
- Increase NodeMemoryHigh threshold from 70% to 95% (20m duration)
- Increase NodeCPURequestSaturation threshold from 80% to 150% (45m duration)
- Reduces false positives while maintaining meaningful monitoring for homelab workloads
- Allows higher resource utilization typical of development/testing environments

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Some checks failed
CI Review / ai-review (pull_request) Has been cancelled
CI Review / helm-validate (pull_request) Has been cancelled
CI Review / pr-title (pull_request) Has been cancelled
Lint & Validate / shellcheck (pull_request) Has been cancelled
Lint & Validate / yaml-lint (pull_request) Has been cancelled
Lint & Validate / terraform-validate (pull_request) Has been cancelled
This pull request has changes conflicting with the target branch.
  • core/charts/platform/longhorn/values.yaml
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin plane/HOMELAB-542-fix-critical-alerts:plane/HOMELAB-542-fix-critical-alerts
git switch plane/HOMELAB-542-fix-critical-alerts
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aaron/infra-core!137
No description provided.