|
|
@@ -0,0 +1,290 @@ |
|
|
# Troubleshooting |
|
|
|
|
|
## Intro |
|
|
|
|
|
The incident management steps I have in mind when being on-call and getting an alert are: |
|
|
|
|
|
- Verify the issue |
|
|
- Triage |
|
|
- Communicate and scalate if needed |
|
|
- Mitigate |
|
|
- Troubleshoot |
|
|
- Postmortem |
|
|
|
|
|
As general troubleshooting or debugging technique: |
|
|
|
|
|
- Do not make things worse (eg, don’t randomly change things you are not familiar with, know when changes are hard to roll back). |
|
|
- Communicate with the team. Take notes as you go of what you see and what you changed. A chat medium like Slack is good since it also keeps a timeline (We want to have backup ways of communicating). Communicate intent and be specific (eg “going to restart the x database in y host”). Acknowledge other people’s messages. Make sure everybody knows who’s got controls so people don’t step on each other. Usually you want one person leading troubleshooting and doing the changes and other people supporting by checking things, communicating with Customer Support and other teams etc. |
|
|
- Try to divide the problem space. Ideally by two but don’t need to start strictly in a systematic way if you have strong historical indicators of where the problems have been. |
|
|
- Test what has worked before. (But if you have to fix the exact same issue more than once or twice then this would be a huge indicator of poor engineering practices). |
|
|
- Do earlier the tests that are fast to do which also give relevant information. |
|
|
- If quick and initial tests failed, then it’s often a good idea to pause, step back and restart debugging in a more systematic fashion, testing more basic assumptions and validating with other people your mental model of how things are supposed to work. |
|
|
|
|
|
### Linux server overview |
|
|
|
|
|
Review: |
|
|
|
|
|
``` |
|
|
# load: |
|
|
uptime |
|
|
|
|
|
# what it does |
|
|
netstat -tlpn # package net-tools |
|
|
ps auxf |
|
|
|
|
|
# memory: |
|
|
# vmstat |
|
|
# r: runnable (running or waiting to run in queue) |
|
|
# b: uninterruptible sleep (D in ps) |
|
|
vmstat # summary |
|
|
vmstat 1 5 -w # every 1 sec, print 5 . wide .(first line is summary since reboot) |
|
|
vmstat -s # summary memory stats |
|
|
|
|
|
free -m |
|
|
grep -i oom /var/log/messages ( /var/log/syslog ) |
|
|
|
|
|
# CPU: |
|
|
top |
|
|
|
|
|
# package sysstat: |
|
|
mpstat -P ALL # cpu balance |
|
|
|
|
|
lscpu |
|
|
|
|
|
pidstat |
|
|
pidstat 1 |
|
|
pidstat -p $pid |
|
|
|
|
|
# disk: |
|
|
vmstat -d |
|
|
df -h |
|
|
df -i |
|
|
iostat -xz 1 |
|
|
|
|
|
# biggest files in / : |
|
|
du -mxS / |sort -n|tail -10 |
|
|
|
|
|
# network: |
|
|
# package sysstat |
|
|
sar -n DEV 1 # network throughput |
|
|
sar -n TCP,ETCP 1 # TCP stats (also: ss -s) |
|
|
|
|
|
# distro: |
|
|
cat /etc/debian_version |
|
|
lsb_release -a # apt-get install lsb-release |
|
|
|
|
|
# boot |
|
|
dmesg |tail |
|
|
last -a |
|
|
``` |
|
|
|
|
|
### Logging |
|
|
|
|
|
``` |
|
|
journalctl |
|
|
journalctl -n 20 --no-pager -u nginx # last lines for a specific unit |
|
|
journalctl --since yesterday --until "1 hour ago" # takes 2022-12-24, 08:00 ... |
|
|
journalctl -k # kernel messages, dmesg |
|
|
journalctl -p err # 0, 1, 2, 3 ... |
|
|
``` |
|
|
|
|
|
``` |
|
|
dmesg | tail |
|
|
tail /var/log/messages ( /var/log/syslog /var/log/kern.log ) |
|
|
``` |
|
|
|
|
|
|
|
|
### systemd |
|
|
|
|
|
``` |
|
|
systemctl # same as systemctl list-units |
|
|
systemctl list-unit-files # lists if they are masked (won't start, use unmask option) |
|
|
systemctl reload unit # reload options after changes, install |
|
|
|
|
|
systemctl --failed |
|
|
systemd-analyze # startup time, append 'blame' for breakdown |
|
|
``` |
|
|
|
|
|
### Filesystems and volumes |
|
|
|
|
|
``` |
|
|
fdisk -l |
|
|
df -lT # -l local, -T type |
|
|
lsblk -f # filesystem |
|
|
file -s /dev/hda1 |
|
|
blkid /dev/hda1 |
|
|
|
|
|
mount |
|
|
cat /etc/fstab |
|
|
|
|
|
fsck.ext4 -p /dev/sda1 # check and fix, if dirty |
|
|
|
|
|
xfs_repair -n /dev/sda # scan |
|
|
xfs_repair /dev/sda # scan and fix |
|
|
``` |
|
|
|
|
|
### Networking |
|
|
|
|
|
``` |
|
|
ss -s |
|
|
netstat -s |
|
|
netstat -i |
|
|
ip -s link |
|
|
ifconfig |
|
|
lsof -i |
|
|
sar -n DEV |
|
|
|
|
|
ip route |
|
|
netstat -r |
|
|
|
|
|
iptables -L |
|
|
iptables -t nat -L # does not show with -L |
|
|
``` |
|
|
|
|
|
curl options: |
|
|
|
|
|
``` |
|
|
curl -v |
|
|
curl -I # header info |
|
|
curl -L # follow location |
|
|
curl -O # download original name |
|
|
``` |
|
|
|
|
|
nic: |
|
|
|
|
|
``` |
|
|
/etc/sysconfig/network-scripts/ifcfg-eth0 |
|
|
DEVICE="eth0" |
|
|
BOOTPROTO="dhcp" |
|
|
``` |
|
|
|
|
|
### Kernel |
|
|
|
|
|
``` |
|
|
uname -a |
|
|
sysctl -a |
|
|
``` |
|
|
|
|
|
### strace |
|
|
|
|
|
``` |
|
|
strace -p $pid # running program |
|
|
strace -c $program # run & summary |
|
|
strace -e trace=write # filter |
|
|
``` |
|
|
|
|
|
Also info under `/proc/$pid/` |
|
|
|
|
|
### SSL/TLS |
|
|
|
|
|
``` |
|
|
openssl x509 -in /path/to/server/certificate -text |
|
|
openssl s_client -connect example.com:443 |
|
|
HEAD / HTTP/1.1 |
|
|
Host: example.com |
|
|
|
|
|
openssl s_client -connect example.com:443 -servername example.com -showcerts | openssl x509 -text -noout |
|
|
``` |
|
|
|
|
|
### cgroups |
|
|
|
|
|
ulimits for current Bash session set at `/etc/security/limits.conf` |
|
|
|
|
|
`su - username -c 'ulimit -a'` |
|
|
|
|
|
`cat /proc/cgroups` |
|
|
|
|
|
### Docker |
|
|
|
|
|
``` |
|
|
docker ps -a |
|
|
docker stats --all |
|
|
|
|
|
docker logs <container> |
|
|
docker inspect <container> |
|
|
docker diff <container> # files changed |
|
|
docker top <container> |
|
|
|
|
|
docker update --help # update memory/cpu settings running container: |
|
|
docker update -m 10M -c 2 <container> |
|
|
|
|
|
# override entrypoint or command: |
|
|
docker run -it --entrypoint /bin/bash <image> |
|
|
docker run -it <image> /bin/bash |
|
|
``` |
|
|
|
|
|
### Kubernetes |
|
|
|
|
|
``` |
|
|
kubectl cluster-info |
|
|
kubectl get events --sort-by=.metadata.creationTimestamp |
|
|
kubectl get pods --show-labels -o wide |
|
|
kubectl top node my-node |
|
|
kubectl api-resources |
|
|
kubectl explain pods |
|
|
|
|
|
kubectl rollout history deployment/frontend |
|
|
kubectl rollout undo deployment/frontend -to-revision=3 |
|
|
kubectl rollout restart deployment/frontend |
|
|
|
|
|
kubectl logs mypod --since 2m |
|
|
kubectl logs mypod --previous |
|
|
|
|
|
# CrashLoopBackOff: can't pul image, image with bad CMD. |
|
|
# Deploy image with sleep command |
|
|
|
|
|
kubectl describe ingress myingress |
|
|
kubectl port-forward svc/my-service 5000 |
|
|
|
|
|
kubeval my-invalid.yaml |
|
|
kubectl diff -f ./my-manifest.yaml |
|
|
|
|
|
# https://kubernetes.io/docs/tasks/debug-application-cluster/debug-running-pod/ |
|
|
|
|
|
kubectl debug -it yourpod --image=busybox:1.28 --target=yourpod |
|
|
kubectl debug myapp -it --image=ubuntu --share-processes --copy-to=myapp-debug |
|
|
kubectl debug myapp -it --copy-to=myapp-debug -- sh |
|
|
kubectl debug node/mynode -it --image=ubuntu |
|
|
``` |
|
|
|
|
|
### DNS |
|
|
|
|
|
`host example.com` |
|
|
|
|
|
From `dnsutils` package: |
|
|
|
|
|
`dig +short example.com` , `dig @ns_ip example.com` |
|
|
|
|
|
`nslookup example.com` : resolves and tells you what DNS server you are using. |
|
|
|
|
|
Critical files: |
|
|
|
|
|
``` |
|
|
/etc/nsswitch.conf # order of resolving |
|
|
/etc/resolv.conf # nameservers |
|
|
/etc/hosts # hard-coded hostname-ip maps |
|
|
``` |
|
|
|
|
|
## Applications |
|
|
|
|
|
### nginx |
|
|
|
|
|
Test configuration: `nginx -t` |
|
|
Test and dump config: `nginx -T` |
|
|
|
|
|
### etcd |
|
|
|
|
|
``` |
|
|
etcdctl get --prefix --keys-only / |
|
|
etcdctl get "" --prefix=true --keys-only |
|
|
etcdctl endpoint status --write-out=table # json to get version |
|
|
etcdctl member list --write-out=table |
|
|
etcdctl alarm list |
|
|
|
|
|
curl http://127.0.0.1:2379/health |
|
|
|
|
|
grep "[CE] |" etcd.log |
|
|
grep "apply entries took too long" etcd.log |
|
|
|
|
|
etcdctl compact $version |
|
|
``` |