Complete operational knowledge for deploying, managing, and operating production Ceph storage clusters using cephadm on bare metal. Covers RBD, CephFS, and RGW services across Ceph Reef (18.2.x) and Squid (19.2.x).
Install:
pipx install agentic-stacks # if you haven't already
agentic-stacks pull agentic-stacks/ceph
Skills
concepts
Ceph architecture, CRUSH maps, placement groups, pools, BlueStore
hardware-planning
Disk sizing, CPU/RAM ratios, network bandwidth, node role planning
host-preparation
OS prerequisites, NTP, firewall ports, container runtime, cephadm install
bootstrap
cephadm bootstrap, initial MON/MGR/OSD deployment, dashboard setup
networking
Public vs cluster network design, VLAN, bonding, MTU configuration
services
RBD pool creation, CephFS/MDS deployment, RGW S3 gateway setup
health-check
Cluster health interpretation, OSD states, PG states, alert triage
scaling
Add/remove OSDs and hosts, expand services, rebalance
upgrades
Rolling upgrades within and across Reef/Squid versions
backup-restore
Pool snapshots, RBD mirroring, RGW multisite, disaster recovery
pool-management
CRUSH rules, erasure coding profiles, tiering, quotas
certificate-mgmt
Dashboard TLS, RGW TLS, internal messenger encryption
troubleshooting
Symptom-based diagnostic trees for common Ceph failure modes
performance
Benchmarking with rados bench/fio, slow OSD diagnosis, bottleneck ID
known-issues
Version-specific bugs and workarounds for Reef and Squid
compatibility
Ceph version × kernel × container image × client compatibility
decision-guides
Replicated vs erasure coding, BlueStore tuning, network topology