merge: main → live (critical alerts fix HOMELAB-542) #131

Closed
claude-agent wants to merge 22 commits from main into live
Owner

Merge fix from main into live so ArgoCD picks up the changes.

See PR #130 for details.

Merge fix from main into live so ArgoCD picks up the changes. See PR #130 for details.
ClickHouse was OOM-killing during system.trace_log merge operations.
The "small" preset (768Mi) was insufficient — merges need ~617MiB,
leaving no headroom. Replace resourcesPreset with explicit resources
(512Mi request, 1536Mi limit) to prevent CrashLoopBackOff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HOMELAB-262: feat(zitadel): add Lab Director OIDC application
All checks were successful
0/0 projects applied successfully.
Release / release (pull_request) Has been skipped
b7cfe3d352
Merge branch 'plane/homelab-254-dev-environment-chart'
Some checks failed
0/1 projects planned successfully.
Release / release (pull_request) Failing after 2s
4ade5f15d8
feat(lab-director): add OIDC env vars and hostAliases to Helm chart
Some checks failed
0/0 projects applied successfully.
Release / release (pull_request) Failing after 2s
6c94a321cb
- Add OIDC_ISSUER, OIDC_REDIRECT_URL, OIDC_SCOPES, OIDC_COOKIE_SECURE to ConfigMap and backend env
- Mount OIDC_CLIENT_ID and OIDC_CLIENT_SECRET from K8s secret via secretKeyRef when oidc.secretName is set
- Add hostAliases support to pod spec for in-cluster DNS resolution (Zitadel)
- Add backend.oidc.secretName and backend.hostAliases fields to values.yaml defaults

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HOMELAB-280: feat(modules): add per-worker overrides and GPU passthrough
Some checks failed
Release / release (pull_request) Failing after 2s
Plan failed.
b3bf989620
- talos-cluster: workers now support per-node cpu_cores, memory_mb,
  boot_disk_gb, data_disk_gb, and gpu_pci_id overrides (falling back
  to global defaults via coalesce)
- proxmox-vm: add GPU PCI passthrough support with dynamic hostpci
  block and automatic q35 machine type when GPU is present

Fixes Terraform state drift where manually-upgraded VMs (24GB RAM,
100GB boot disk, GPU passthrough) didn't match tfvars.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HOMELAB-406: reduce Ollama resource requests to free capacity
Some checks failed
Release / release (pull_request) Has been cancelled
8f7716aebd
Lower CPU request from 2 cores to 500m and memory request from 8Gi
to 2Gi. Limits unchanged (4 CPU / 16Gi). Ollama is GPU-bound so
high CPU/memory reservations waste scheduler capacity. Frees ~1.5 CPU
and 6Gi RAM on prod-wk-01 to absorb workloads during prod-wk-03
removal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat(sandbox): complete Helm chart templates
Some checks failed
0/0 projects applied successfully.
Release / release (pull_request) Has been cancelled
ffbb843dd6
- PVC: 25 GiB Longhorn with helm.sh/resource-policy: keep
- ServiceAccount: dedicated SA, no cluster role bindings
- LimitRange: container resource caps (6 CPU / 12 Gi max)
- CiliumNetworkPolicy: D16 egress allowlist (DNS, Temporal, Lab Director,
  Forgejo, Harbor, Langfuse, OTel, Grafana, internet:443, DinD:2376)
- Deployment: main sandbox container + DinD sidecar (privileged)
  with workspace PVC, SSH keys, git credentials, Claude OAuth,
  Claude settings, and MCP config mounts
- No kubeconfig or ArgoCD secrets mounted (D8 security boundary)
HOMELAB-424: fix(sandbox): align chart values with Supervisor ArgoCD parameters
Some checks failed
0/0 projects applied successfully.
Release / release (pull_request) Has been cancelled
35314a313c
Restructure values.yaml to match the --set parameters the Supervisor sends:
- profileId -> sandbox.profileId, sandbox.name, sandbox.profileName
- workspace.storageSize -> persistence.size
- Add persistence.existingClaim support (skip PVC creation when set)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat(sso): add Temporal UI OIDC app in Zitadel — HOMELAB-448
Some checks failed
0/1 projects planned successfully.
Release / release (pull_request) Has been cancelled
7a26b5e788
Register Temporal as an OIDC application in Zitadel and create a K8s
secret with client credentials for the Temporal UI native auth.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: resolve critical alerts — Talos compat, ArgoCD OOM, Longhorn replicas
Some checks failed
0/0 projects applied successfully.
Release / release (pull_request) Has been cancelled
18a00e9d13
- Disable kubeControllerManager, kubeScheduler, kubeProxy ServiceMonitors
  in kube-prometheus-stack — Talos binds these to localhost (unscrappable)
  and uses Cilium instead of kube-proxy, eliminating false critical alerts
- Bump ArgoCD application-controller memory limit 1Gi → 2Gi to prevent
  OOMKills under load with many managed applications
- Set Longhorn defaultReplicaCount to 2 (matches 2-node storage topology),
  enable replicaAutoBalance and bump concurrent rebuild limit to 5

Part of HOMELAB-542 infrastructure alerts triage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge pull request 'fix: resolve critical alerts — Talos compat, ArgoCD OOM, Longhorn replicas' (#130) from fix/critical-alerts-triage-homelab-542 into main
Some checks failed
0/1 projects planned successfully.
CI Review / ai-review (pull_request) Has been cancelled
CI Review / helm-validate (pull_request) Has been cancelled
CI Review / pr-title (pull_request) Has been cancelled
Lint & Validate / shellcheck (pull_request) Has been cancelled
Lint & Validate / yaml-lint (pull_request) Has been cancelled
Lint & Validate / terraform-validate (pull_request) Has been cancelled
7a8d14b081
Collaborator

Ran Plan for dir: core/terraform/live/zitadel workspace: default

Plan Error

Show Output
running 'sh -c' '/atlantis-data/bin/terraform1.14.8 init -input=false -upgrade' in '/atlantis-data/repos/aaron/infra-core/131/default/core/terraform/live/zitadel': exit status 1
Initializing the backend...
╷
│ Error: Missing Required Value
│ 
│   on versions.tf line 4, in terraform:
│    4:   backend "s3" {
│ 
│ The attribute "bucket" is required by the backend.
│ 
│ Refer to the backend documentation for additional information which
│ attributes are required.
╵
╷
│ Error: Missing Required Value
│ 
│   on versions.tf line 4, in terraform:
│    4:   backend "s3" {
│ 
│ The attribute "key" is required by the backend.
│ 
│ Refer to the backend documentation for additional information which
│ attributes are required.
╵
╷
│ Error: Missing region value
│ 
│   on versions.tf line 4, in terraform:
│    4:   backend "s3" {
│ 
│ The "region" attribute or the "AWS_REGION" or "AWS_DEFAULT_REGION"
│ environment variables must be set.
╵

Ran Plan for dir: `core/terraform/live/zitadel` workspace: `default` **Plan Error** <details><summary>Show Output</summary> ``` running 'sh -c' '/atlantis-data/bin/terraform1.14.8 init -input=false -upgrade' in '/atlantis-data/repos/aaron/infra-core/131/default/core/terraform/live/zitadel': exit status 1 Initializing the backend... ╷ │ Error: Missing Required Value │ │ on versions.tf line 4, in terraform: │ 4: backend "s3" { │ │ The attribute "bucket" is required by the backend. │ │ Refer to the backend documentation for additional information which │ attributes are required. ╵ ╷ │ Error: Missing Required Value │ │ on versions.tf line 4, in terraform: │ 4: backend "s3" { │ │ The attribute "key" is required by the backend. │ │ Refer to the backend documentation for additional information which │ attributes are required. ╵ ╷ │ Error: Missing region value │ │ on versions.tf line 4, in terraform: │ 4: backend "s3" { │ │ The "region" attribute or the "AWS_REGION" or "AWS_DEFAULT_REGION" │ environment variables must be set. ╵ ``` </details>
aaron closed this pull request 2026-04-26 08:00:51 +00:00
Some checks failed
0/1 projects planned successfully.
CI Review / ai-review (pull_request) Has been cancelled
CI Review / helm-validate (pull_request) Has been cancelled
CI Review / pr-title (pull_request) Has been cancelled
Lint & Validate / shellcheck (pull_request) Has been cancelled
Lint & Validate / yaml-lint (pull_request) Has been cancelled
Lint & Validate / terraform-validate (pull_request) Has been cancelled

Pull request closed

Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aaron/infra-core!131
No description provided.