HOMELAB-970: spec — SSD-tier migration design #181

Open
aaron wants to merge 1 commit from plane/HOMELAB-970-ssd-tier-migration-spec into live
Owner

Summary

Design doc for moving 39 latency-sensitive PVCs onto a dedicated longhorn-fast StorageClass backed by ssd-mirror. HDD topology changes (re-mirror after sdb detach, add hdd-bulk pool, NFS media server, ARR stack) are explicitly deferred to follow-up tickets and are NOT part of this work stream.

Whats in the spec

  • Per-worker topology: 150 GB SSD-fast data disk per worker on ssd-mirror, tagged fast in Longhorn.
  • Keeps existing 1500 GB HDD virtio1 data disks attached on wk-01/02; wk-03 gets no HDD disk at all.
  • longhorn-fast SC (diskSelector=fast, 2 replicas, Retain) + edit longhorn-db to add diskSelector: "fast" so CNPG lands on SSD.
  • Keeps longhorn (default) untouched for backward compat.
  • 4-phase migration: infra → urgent 4 (already-full Prom TSDBs + ClickHouse + runner cache) → CNPG in-place switchovers → remaining 21 stateless → eventual HDD-tier decommission.
  • Fixes the audit finding that wk-02 Longhorn allowScheduling was false, causing all replicas to live on wk-01.
  • Capacity math: 75% ssd-mirror commitment post-migration (150×3 data + 264 GB boots).

Specialist agent reports (archived)

Spec built with 3 dispatched specialist agents:

  • PVC audit (full per-PVC inventory + tier recommendations)
  • ARR stack architect (the deferred media work, parked but referenced)
  • Longhorn storage architect (disk registration, SC, migration mechanics)

Reports at /tmp/dispatch-storage-audit-pvcs.md, /tmp/dispatch-arr-stack-plan.md, /tmp/dispatch-longhorn-tiered-design.md on the Mac.

Next steps (after this PR merges)

  1. Aaron reviews spec.
  2. Invoke superpowers:writing-plans skill to produce the step-by-step implementation plan.
  3. Implementation plan becomes a follow-up PR that edits the talos-cluster + proxmox-vm modules and the prod tfvars.

Test plan

  • Aaron reads the spec and confirms scope, phases, and capacity math.
  • Follow-up implementation plan ticket / PR tracks actual code changes.

🤖 Generated with Claude Code

## Summary Design doc for moving 39 latency-sensitive PVCs onto a dedicated `longhorn-fast` StorageClass backed by `ssd-mirror`. HDD topology changes (re-mirror after sdb detach, add hdd-bulk pool, NFS media server, ARR stack) are explicitly deferred to follow-up tickets and are NOT part of this work stream. ## Whats in the spec - Per-worker topology: 150 GB SSD-fast data disk per worker on ssd-mirror, tagged `fast` in Longhorn. - Keeps existing 1500 GB HDD virtio1 data disks attached on wk-01/02; wk-03 gets no HDD disk at all. - `longhorn-fast` SC (diskSelector=fast, 2 replicas, Retain) + edit `longhorn-db` to add `diskSelector: "fast"` so CNPG lands on SSD. - Keeps `longhorn` (default) untouched for backward compat. - 4-phase migration: infra → urgent 4 (already-full Prom TSDBs + ClickHouse + runner cache) → CNPG in-place switchovers → remaining 21 stateless → eventual HDD-tier decommission. - Fixes the audit finding that wk-02 Longhorn `allowScheduling` was false, causing all replicas to live on wk-01. - Capacity math: 75% ssd-mirror commitment post-migration (150×3 data + 264 GB boots). ## Specialist agent reports (archived) Spec built with 3 dispatched specialist agents: - PVC audit (full per-PVC inventory + tier recommendations) - ARR stack architect (the deferred media work, parked but referenced) - Longhorn storage architect (disk registration, SC, migration mechanics) Reports at `/tmp/dispatch-storage-audit-pvcs.md`, `/tmp/dispatch-arr-stack-plan.md`, `/tmp/dispatch-longhorn-tiered-design.md` on the Mac. ## Next steps (after this PR merges) 1. Aaron reviews spec. 2. Invoke `superpowers:writing-plans` skill to produce the step-by-step implementation plan. 3. Implementation plan becomes a follow-up PR that edits the talos-cluster + proxmox-vm modules and the prod tfvars. ## Test plan - [ ] Aaron reads the spec and confirms scope, phases, and capacity math. - [ ] Follow-up implementation plan ticket / PR tracks actual code changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
HOMELAB-970: spec — homelab SSD-tier migration design
Some checks failed
CI Review / pr-title (pull_request) Has been cancelled
CI Review / helm-validate (pull_request) Has been cancelled
CI Review / ai-review (pull_request) Has been cancelled
Lint & Validate / terraform-validate (pull_request) Has been cancelled
Lint & Validate / yaml-lint (pull_request) Has been cancelled
Lint & Validate / shellcheck (pull_request) Has been cancelled
1146272d8b
Adds a dedicated longhorn-fast StorageClass backed by ssd-mirror (150 GB
SSD-fast data disk per worker) and migrates 39 latency-sensitive PVCs
off the current hdd-mirror-backed longhorn default.

Out of scope (deferred to follow-up tickets): HDD topology changes
(re-mirror, add hdd-bulk), NFS media server, ARR stack.

Designed with dispatched specialist agents (PVC audit, ARR architect,
Longhorn storage architect). Specialist reports archived at
/tmp/dispatch-*.md on the Mac for reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some checks failed
CI Review / pr-title (pull_request) Has been cancelled
CI Review / helm-validate (pull_request) Has been cancelled
CI Review / ai-review (pull_request) Has been cancelled
Lint & Validate / terraform-validate (pull_request) Has been cancelled
Lint & Validate / yaml-lint (pull_request) Has been cancelled
Lint & Validate / shellcheck (pull_request) Has been cancelled
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin plane/HOMELAB-970-ssd-tier-migration-spec:plane/HOMELAB-970-ssd-tier-migration-spec
git switch plane/HOMELAB-970-ssd-tier-migration-spec
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aaron/infra-core!181
No description provided.