Skip to content

Instantly share code, notes, and snippets.

@last-dev
Forked from fduran/Troubleshooting.md
Created October 19, 2024 17:44
Show Gist options
  • Save last-dev/c77dbb5e3a985c4d353669c2d4382f3f to your computer and use it in GitHub Desktop.
Save last-dev/c77dbb5e3a985c4d353669c2d4382f3f to your computer and use it in GitHub Desktop.

Revisions

  1. @fduran fduran revised this gist May 22, 2022. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions Troubleshooting.md
    Original file line number Diff line number Diff line change
    @@ -98,6 +98,7 @@ tail /var/log/messages ( /var/log/syslog /var/log/kern.log )

    ```
    systemctl # same as systemctl list-units
    systemctl cat <service> # shows location and contents of config file for <service>
    systemctl list-unit-files # lists if they are masked (won't start, use unmask option)
    systemctl reload unit # reload options after changes, install
  2. @fduran fduran created this gist Apr 25, 2022.
    290 changes: 290 additions & 0 deletions Troubleshooting.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,290 @@
    # Troubleshooting

    ## Intro

    The incident management steps I have in mind when being on-call and getting an alert are:

    - Verify the issue
    - Triage
    - Communicate and scalate if needed
    - Mitigate
    - Troubleshoot
    - Postmortem

    As general troubleshooting or debugging technique:

    - Do not make things worse (eg, don’t randomly change things you are not familiar with, know when changes are hard to roll back).
    - Communicate with the team. Take notes as you go of what you see and what you changed. A chat medium like Slack is good since it also keeps a timeline (We want to have backup ways of communicating). Communicate intent and be specific (eg “going to restart the x database in y host”). Acknowledge other people’s messages. Make sure everybody knows who’s got controls so people don’t step on each other. Usually you want one person leading troubleshooting and doing the changes and other people supporting by checking things, communicating with Customer Support and other teams etc.
    - Try to divide the problem space. Ideally by two but don’t need to start strictly in a systematic way if you have strong historical indicators of where the problems have been.
    - Test what has worked before. (But if you have to fix the exact same issue more than once or twice then this would be a huge indicator of poor engineering practices).
    - Do earlier the tests that are fast to do which also give relevant information.
    - If quick and initial tests failed, then it’s often a good idea to pause, step back and restart debugging in a more systematic fashion, testing more basic assumptions and validating with other people your mental model of how things are supposed to work.

    ### Linux server overview

    Review:

    ```
    # load:
    uptime
    # what it does
    netstat -tlpn # package net-tools
    ps auxf
    # memory:
    # vmstat
    # r: runnable (running or waiting to run in queue)
    # b: uninterruptible sleep (D in ps)
    vmstat # summary
    vmstat 1 5 -w # every 1 sec, print 5 . wide .(first line is summary since reboot)
    vmstat -s # summary memory stats
    free -m
    grep -i oom /var/log/messages ( /var/log/syslog )
    # CPU:
    top
    # package sysstat:
    mpstat -P ALL # cpu balance
    lscpu
    pidstat
    pidstat 1
    pidstat -p $pid
    # disk:
    vmstat -d
    df -h
    df -i
    iostat -xz 1
    # biggest files in / :
    du -mxS / |sort -n|tail -10
    # network:
    # package sysstat
    sar -n DEV 1 # network throughput
    sar -n TCP,ETCP 1 # TCP stats (also: ss -s)
    # distro:
    cat /etc/debian_version
    lsb_release -a # apt-get install lsb-release
    # boot
    dmesg |tail
    last -a
    ```

    ### Logging

    ```
    journalctl
    journalctl -n 20 --no-pager -u nginx # last lines for a specific unit
    journalctl --since yesterday --until "1 hour ago" # takes 2022-12-24, 08:00 ...
    journalctl -k # kernel messages, dmesg
    journalctl -p err # 0, 1, 2, 3 ...
    ```

    ```
    dmesg | tail
    tail /var/log/messages ( /var/log/syslog /var/log/kern.log )
    ```


    ### systemd

    ```
    systemctl # same as systemctl list-units
    systemctl list-unit-files # lists if they are masked (won't start, use unmask option)
    systemctl reload unit # reload options after changes, install
    systemctl --failed
    systemd-analyze # startup time, append 'blame' for breakdown
    ```

    ### Filesystems and volumes

    ```
    fdisk -l
    df -lT # -l local, -T type
    lsblk -f # filesystem
    file -s /dev/hda1
    blkid /dev/hda1
    mount
    cat /etc/fstab
    fsck.ext4 -p /dev/sda1 # check and fix, if dirty
    xfs_repair -n /dev/sda # scan
    xfs_repair /dev/sda # scan and fix
    ```

    ### Networking

    ```
    ss -s
    netstat -s
    netstat -i
    ip -s link
    ifconfig
    lsof -i
    sar -n DEV
    ip route
    netstat -r
    iptables -L
    iptables -t nat -L # does not show with -L
    ```

    curl options:

    ```
    curl -v
    curl -I # header info
    curl -L # follow location
    curl -O # download original name
    ```

    nic:

    ```
    /etc/sysconfig/network-scripts/ifcfg-eth0
    DEVICE="eth0"
    BOOTPROTO="dhcp"
    ```

    ### Kernel

    ```
    uname -a
    sysctl -a
    ```

    ### strace

    ```
    strace -p $pid # running program
    strace -c $program # run & summary
    strace -e trace=write # filter
    ```

    Also info under `/proc/$pid/`

    ### SSL/TLS

    ```
    openssl x509 -in /path/to/server/certificate -text
    openssl s_client -connect example.com:443
    HEAD / HTTP/1.1
    Host: example.com
    openssl s_client -connect example.com:443 -servername example.com -showcerts | openssl x509 -text -noout
    ```

    ### cgroups

    ulimits for current Bash session set at `/etc/security/limits.conf`

    `su - username -c 'ulimit -a'`

    `cat /proc/cgroups`

    ### Docker

    ```
    docker ps -a
    docker stats --all
    docker logs <container>
    docker inspect <container>
    docker diff <container> # files changed
    docker top <container>
    docker update --help # update memory/cpu settings running container:
    docker update -m 10M -c 2 <container>
    # override entrypoint or command:
    docker run -it --entrypoint /bin/bash <image>
    docker run -it <image> /bin/bash
    ```

    ### Kubernetes

    ```
    kubectl cluster-info
    kubectl get events --sort-by=.metadata.creationTimestamp
    kubectl get pods --show-labels -o wide
    kubectl top node my-node
    kubectl api-resources
    kubectl explain pods
    kubectl rollout history deployment/frontend
    kubectl rollout undo deployment/frontend -to-revision=3
    kubectl rollout restart deployment/frontend
    kubectl logs mypod --since 2m
    kubectl logs mypod --previous
    # CrashLoopBackOff: can't pul image, image with bad CMD.
    # Deploy image with sleep command
    kubectl describe ingress myingress
    kubectl port-forward svc/my-service 5000
    kubeval my-invalid.yaml
    kubectl diff -f ./my-manifest.yaml
    # https://kubernetes.io/docs/tasks/debug-application-cluster/debug-running-pod/
    kubectl debug -it yourpod --image=busybox:1.28 --target=yourpod
    kubectl debug myapp -it --image=ubuntu --share-processes --copy-to=myapp-debug
    kubectl debug myapp -it --copy-to=myapp-debug -- sh
    kubectl debug node/mynode -it --image=ubuntu
    ```

    ### DNS

    `host example.com`

    From `dnsutils` package:

    `dig +short example.com` , `dig @ns_ip example.com`

    `nslookup example.com` : resolves and tells you what DNS server you are using.

    Critical files:

    ```
    /etc/nsswitch.conf # order of resolving
    /etc/resolv.conf # nameservers
    /etc/hosts # hard-coded hostname-ip maps
    ```

    ## Applications

    ### nginx

    Test configuration: `nginx -t`
    Test and dump config: `nginx -T`

    ### etcd

    ```
    etcdctl get --prefix --keys-only /
    etcdctl get "" --prefix=true --keys-only
    etcdctl endpoint status --write-out=table # json to get version
    etcdctl member list --write-out=table
    etcdctl alarm list
    curl http://127.0.0.1:2379/health
    grep "[CE] |" etcd.log
    grep "apply entries took too long" etcd.log
    etcdctl compact $version
    ```