Most production Linux incidents fall into a handful of shapes. Recognise the shape and you're halfway to the fix. Here are the usual suspects and where to look first.

1. The box is pegged (high load)

Start with top and ps aux --sort=-%cpu | head. A runaway process, a retry loop, or a cron job gone wrong is usually the culprit. High load with low CPU often means I/O wait — check the disks.

2. Disk full

df -h shows which filesystem is at 100%, then du -sh /var/* | sort -h finds the offender — usually unrotated logs. Services fail in confusing ways when they can't write, so check this early.

3. A service won't stay up

systemctl status and journalctl -u <service> --since "10 min ago" tell you why it died. An OOM kill shows up here and in dmesg.

4. "It's slow" / connections hang

Reach for dig (DNS resolving?), ss -tunap (what's listening / connected?), and a quick curl -v to localise where the request stalls.

Build the reflex

Knowing the commands isn't the same as reaching for the right one under pressure. Drill these on live, broken boxes with Linux troubleshooting practice and a full incident response shift.

Common Linux Incidents and How to Fix Them

1. The box is pegged (high load)

2. Disk full

3. A service won't stay up

4. "It's slow" / connections hang

Build the reflex

Learn it by doing