What you'll practice
- Diagnosing high load —
top,ps, and finding the process eating the box. - Disk & memory issues —
df -h,du, OOM kills, and full filesystems. - Service failures — reading logs with
journalctland restarting units cleanly. - Networking & DNS —
ss,dig, and tracing why a connection hangs. - A repeatable method — observe, hypothesise, narrow down, confirm — instead of guessing.
The loop you'll run
A node goes amber on the topology map. You SSH in and start narrowing it down:
$ ssh srv-web-01 $ top # load average 18.4 — something is pegging CPU $ ps aux --sort=-%cpu | head $ df -h # /var at 100% — logs never rotated $ journalctl -u nginx --since "10 min ago" $ systemctl restart nginx
Why a simulator beats a cheat sheet
Cheat sheets list commands; they don't build the instinct for which one to run next when you don't yet know what's wrong. Because the simulator's boxes break in realistic ways and the SLA is ticking, you practise the actual skill — reasoning your way from a symptom to a root cause — the same thing a troubleshooting interview or a real on-call page is testing.
Keep going
Troubleshooting is the foundation. Put it to work in a full incident response shift, automate the fixes with Ansible, and prep for your SRE interview.