Skip to content

Instantly share code, notes, and snippets.

@jenbam
Forked from ameenkhan07/FB-PE-InterviewTips.md
Created July 30, 2021 23:05
Show Gist options
  • Save jenbam/84e894c250a0a4c9afdeac7b4cec19c5 to your computer and use it in GitHub Desktop.
Save jenbam/84e894c250a0a4c9afdeac7b4cec19c5 to your computer and use it in GitHub Desktop.

Revisions

  1. @ameenkhan07 ameenkhan07 revised this gist May 6, 2020. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -55,7 +55,7 @@ Types of loads : CPU Bound load, Memory bound load, IO bound load
    - summary of servers memory utilization statistics, short for virtual memory stat.
    - **r**: number of process waiting to run on CPU. Value greater than CPU count means saturation of the server.
    - **free** : free memory in kilobytes. Alternative, more elaborate, *free*
    - **si, so**: Page in and page out (paging~swapping). When pages are written into memory from disk, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages .
    - **si, so**: Page in and page out (paging~swapping). When pages are written from memory into disk, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages.
    Pageins are fine, application initialization will have page-ins. Too many page-out indicate that kernel might be spending too much time managing memeory than application processing (thrashing).
    In case of constant pageouts, check process occupying cpu the most using ps command.
    Might be confused as IO problem, since disk are used as memory, and swapping would require r/w from said device.m
  2. @ameenkhan07 ameenkhan07 renamed this gist Feb 4, 2019. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  3. @ameenkhan07 ameenkhan07 revised this gist Jan 20, 2019. 3 changed files with 18 additions and 23 deletions.
    9 changes: 9 additions & 0 deletions InterviewTips.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,9 @@
    ### What to Expect and Tips
    • 45-minute systems interview, focus on responding to real world problems with an unhealthy service, such as a web server or database. The interview will start off at a high level troubleshooting a likely scenario, dig deeper to find the cause and some possible solutions for it. The goal is to probe your knowledge of systems at scale and under load, so keep in mind the challenges of the Facebook environment.
    • Focus on things such as tooling, memory management and unix process lifecycle.

    ### Systems
    More specifically, linux troubleshooting and debugging. Understanding things like memory, io, cpu, shell, memory etc. would be pretty helpful. Knowing how to actually write a unix shell would also be a good idea. What tools might you use to debug something? On another note, this interview will likely push your boundaries of what you know (and how to implement it).

    ### Design/Architecture 
    Interview is all about taking an ambiguous question of how you might build a system and letting you guide the way. Your interviewer will add in constraints when necessary and the idea is to get a simple, workable solution on the board. Things like load and monitoring are things you might consider. What you consider is just as important as to what you don’t. So ask clarifying questions and gather requirements when appropriate.
    16 changes: 9 additions & 7 deletions Resources.md
    Original file line number Diff line number Diff line change
    @@ -1,12 +1,14 @@
    Links used in my preparation

    - https://www.quora.com/How-should-I-prepare-for-a-production-engineer-interview-at-Facebook
    - http://www.brendangregg.com/linuxperf.html
    Experiences:
    - https://www.quora.com/How-should-I-prepare-for-a-production-engineer-interview-at-Facebook
    - https://shivamkhandelwal.in/production-engineering-internship-interview-process-facebook/

    - https://medium.com/netflix-techblog/linux-performance-analysis-in-60-000-milliseconds-accc10403c55
    - http://www.brendangregg.com/blog
    - https://wizardzines.com/

    - Systems Performance By Brendan Gregg and his [blog]( http://www.brendangregg.com/blog)
    - http://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html
    - http://www.brendangregg.com/blog/2014-11-22/linux-perf-tools-2014.html
    - http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
    - https://medium.com/netflix-techblog/linux-performance-analysis-in-60-000-milliseconds-accc10403c55
    - http://www.brendangregg.com/linuxperf.html
    - http://www.brendangregg.com/blog/2014-11-22/linux-perf-tools-2014.html

    - https://wizardzines.com/
    16 changes: 0 additions & 16 deletions SystemsInterview.md
    Original file line number Diff line number Diff line change
    @@ -1,16 +0,0 @@
    ### What to Expect
    • This 45-minute systems interview will focus on responding to real world problems with an unhealthy service, such as a web server or database. The interview will start off at a high level troubleshooting a likely scenario, dig deeper to find the cause and some possible solutions for it. The goal is to probe your knowledge of systems at scale and under load, so keep in mind the challenges of the Facebook environment.
    • Depending on how your conversation goes, your interviewer may ask to use CoderPad.
    • Some of the questions may be around scalability, so think of solutions that would apply and be effective in our environment.

    ### Helpful Tips
    • Focus on things that might show up in your average Operating Systems class such as tooling, memory management and unix process lifecycle.
    • Spend time on a linux system — maybe even install one from scratch. Run Linux as your primary desktop environment for a while to force yourself to learn how it works, even though servers != desktops.
    • Brendan Gregg's blog & his book "Systems Performance" may help refresh basic OS material
    • What's it like to be a PE at Facebook?

    ### Systems
    More specifically, linux troubleshooting and debugging. Understanding things like memory, io, cpu, shell, memory etc. would be pretty helpful. Knowing how to actually write a unix shell would also be a good idea. What tools might you use to debug something? On another note, this interview will likely push your boundaries of what you know (and how to implement it).

    ### Design/Architecture 
    This interview is all about taking an ambiguous question of how you might build a system and letting you guide the way. Your interviewer will add in constraints when necessary and the idea is to get a simple, workable solution on the board. Things like load and monitoring are things you might consider. What you consider is just as important as to what you don’t. So ask clarifying questions and gather requirements when appropriate.
  4. @ameenkhan07 ameenkhan07 revised this gist Jan 20, 2019. 3 changed files with 59 additions and 69 deletions.
    60 changes: 59 additions & 1 deletion LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,34 @@
    ## Troubleshooting and Debugging
    - Analysing Performance, steps involved, and ending.
    - Linux OS performance metrics. Lot of monitoring tools are built on top of these metrics.


    ### Methodoligies

    - provide giudance in choosing the performance tools. Starting point, process and the ending point.
    1. AntiPattern: People tend to run commands they know, not trying to understand what the problem and attacking in solving that instead. (Drunk Man Anti Method) Randomly throwing everything at the problem.
    2. Maybe network, firewall etc

    - Problem Statement Method
    - Why do you think it has a performance problem, is this a new problem or has been there for some time? Something changed recently? Can be expressed in terms of latency, run-time?

    - Workload Characterization Method
    - Who is the causing the load? Why is the load called? What is the load? How the load changed over time?
    - Solve some issues

    - USE Method
    - USE : Utilization, Saturation, Error
    - Functional diagram of the system(listing all componenets of the system),
    and for every resource check **utilization**(busy time),
    **saturation**(queue length/time), **errors**(easy to interprate).
    - Current tools might not look everywhere, so this method poses question before the ansers, look at place which are sometimes missed.

    - CPU Analysis
    - Process get deadlocked/blocked, at some point (paging, context switching, network io)

    - CPU Profile Method
    - Flame graph


    ### Tooling

    @@ -111,4 +141,32 @@ Types of loads : CPU Bound load, Memory bound load, IO bound load
    #### Network Issues

    - ethtool
    - Diagnose network links
    - Diagnose network links


    ### Sample Examples

    Hint : Get a functional diagram of the environement, makes easier to create a check list.

    ----


    Example 1 : *System is slow*
    1. start with command top for processes and cpu
    2. Check Disk io (iostat), and network (sar)
    What to do in such a case : Quantify the problem, is it latency etc. Check system resources with methodologies, run through the checklist.

    Example 2:
    Application Latency is higher.
    USE METHOD:
    1. **top** command
    - Check cpu summary, process/kernel time, cpu utilization (if it is 100 percent or not).
    2. CPU utilization again with **vmstat** to see paterns. Check memory, if there is enough left and is not leaninig towards saturation point.
    3. **mpstat** to check if maxing out any cpu
    ```
    Utilization and saturation metrics: swapping not too much, enough memory left, cpu are not overloaded, cpu time for kernel/application is not too much, r is not a lot more than cpu present.
    CPU saturation/utiliation is flexible in case of linux, kernel manages/moves things around, interrups threads etc if needed. same is not the case with io.
    ```
    4. Check Disk IO utiliation. **iostat**. util column: more than 60 percent utilization might the problem.
    5. Check Network IO utilization **sar -n DEV 1**.
    6. pidstat for process wise usage of.
    65 changes: 0 additions & 65 deletions LinuxPerformanceMethodoligies.md
    Original file line number Diff line number Diff line change
    @@ -1,65 +0,0 @@
    - Analysing Performance, steps involved, and ending.
    - Linux OS performance metrics. Lot of monitoring tools are built on top of these metrics.

    Example : *System is slow*
    1. start with command top for processes and cpu
    2. Check Disk io (iostat), and network (sar)
    What to do in such a case : Quantify the problem, is it latency etc. Check system resources with methodologies, run through the checklist.

    ## Methodoligies
    - provide giudance in choosing the performance tools. Starting point, process and the ending point.

    1. AntiPattern: People tend to run commands they know, not trying to understand what the problem and attacking in solving that instead. (Drunk Man Anti Method) Randomly throwing everything at the problem.
    2. Maybe network, firewall etc

    ### Actual Methodoligies
    - Problem Statement Method
    - Why do you think it has a performance problem, is this a new problem or has been there for some time? Something changed recently? Can be expressed in terms of latency, run-time?

    - Workload Characterization Method
    - Who is the causing the load? Why is the load called? What is the load? How the load changed over time?
    - Solve some issues

    - USE Method
    - USE : Utilization, Saturation, Error
    - Functional diagram of the system(listing all componenets of the system),
    and for every resource check **utilization**(busy time),
    **saturation**(queue length/time), **errors**(easy to interprate).
    - Current tools might not look everywhere, so this method poses question before the ansers, look at place which are sometimes missed.

    - CPU Analysis
    - Process get deadlocked/blocked, at some point (paging, context switching, network io)

    - CPU Profile Method
    - Flame graph

    #### Tools
    Categorised:
    | Observability Tools : Watch Activity | Benchmarking : Load Test | Tuning : Changing system parameters | Static : Chainging system configs.
    Observability Tools : LinuxInternal.md

    Hint : Get a functional diagram of the environement, makes easier to create a check list.

    ----

    Example 2:
    Application Latency is higher.
    USE METHOD:
    1. **top** command
    - Check cpu summary, process/kernel time, cpu utilization (if it is 100 percent or not).
    2. CPU utilization again with **vmstat** to see paterns. Check memory, if there is enough left and is not leaninig towards saturation point.
    3. **mpstat** to check if maxing out any cpu
    ```
    Utilization and saturation metrics: swapping not too much, enough memory left, cpu are not overloaded, cpu time for kernel/application is not too much, r is not a lot more than cpu present.
    CPU saturation/utiliation is flexible in case of linux, kernel manages/moves things around, interrups threads etc if needed. same is not the case with io.
    ```
    4. Check Disk IO utiliation. **iostat**. util column: more than 60 percent utilization might the problem.
    5. Check Network IO utilization **sar -n DEV 1**.
    6. pidstat for process wise usage of.


    -----

    Category of problem : Sluggish slow server

    Questions to consider : what is load and when it is high
    3 changes: 0 additions & 3 deletions Misc.md
    Original file line number Diff line number Diff line change
    @@ -15,7 +15,4 @@ It is a datastructure in linux which stores all the info about a file besides it
    - Out of inodes. Inodes are limited for the filesystem. If all of them are used up, you cannot add more files
    *df -i*
    - Corrupted filesystem blocks



    ### Test
  5. @ameenkhan07 ameenkhan07 revised this gist Jan 17, 2019. 2 changed files with 11 additions and 4 deletions.
    12 changes: 8 additions & 4 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -25,7 +25,7 @@ Types of loads : CPU Bound load, Memory bound load, IO bound load
    - summary of servers memory utilization statistics, short for virtual memory stat.
    - **r**: number of process waiting to run on CPU. Value greater than CPU count means saturation of the server.
    - **free** : free memory in kilobytes. Alternative, more elaborate, *free*
    - **si, so**: Page in and page out (paging~swapping). When pages are written into disk from memory, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages.
    - **si, so**: Page in and page out (paging~swapping). When pages are written into memory from disk, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages .
    Pageins are fine, application initialization will have page-ins. Too many page-out indicate that kernel might be spending too much time managing memeory than application processing (thrashing).
    In case of constant pageouts, check process occupying cpu the most using ps command.
    Might be confused as IO problem, since disk are used as memory, and swapping would require r/w from said device.m
    @@ -60,11 +60,15 @@ Types of loads : CPU Bound load, Memory bound load, IO bound load
    - buffer/cached = sum of buffer and cache. Buffer used for block device io, cache used by virtual page cache.

    - **sar -n DEV 1**
    - Tool to check network throughput and ensure if it is under the limit. rxKbps and txkBps : measure of workload
    - Tool to check network throughput and ensure if it is under the limit. rxKbps and txkBps : measure of workload

    - **sar -n TCP,ETCP 1**
    - Overview of tcp metrics.
    - **active** and **passive**: outbound and inbound connections. Used as measure of network load on the server
    - Overview of tcp metrics.
    - **active** and **passive**: outbound and inbound connections. Used as measure of network load on the server

    - **sar**
    - Statistics archive, for CPU, Memory, IO, Network, stores for a month. -A option for all the records
    - By default, sar command shows stats for a day, but using -s -e specific time period of the day could be used

    - **top**
    - System wide summary. All of above (memory, CPU, IO, network)
    3 changes: 3 additions & 0 deletions Misc.md
    Original file line number Diff line number Diff line change
    @@ -16,3 +16,6 @@ It is a datastructure in linux which stores all the info about a file besides it
    *df -i*
    - Corrupted filesystem blocks



    ### Test
  6. @ameenkhan07 ameenkhan07 revised this gist Jan 16, 2019. 3 changed files with 55 additions and 16 deletions.
    46 changes: 30 additions & 16 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -4,29 +4,37 @@

    ```
    Tool Categories : observability, benchmarking, tuning, static performance tuning, profiling, and tracing
    Types of resources : CPU, Memory, Block Devices(disk), Network Devices
    Types of resources : CPU, Memory, IO ie Block Devices(disk) and Network Devices
    Types of loads : CPU Bound load, Memory bound load, IO bound load
    ```

    #### Observability Tools : Basics

    - **uptime**
    - Measure of cpu demand by looking at system(CPU + disks) load averages (no of processes running or are waiting to run)
    - Load average - average no of processes that have to wait for CPU time.
    - *High Level* idea of system usage and how the load changes. 3 numbers -moving load averages at 1, 5, 15 minute.
    - Interpretation : if the load average at 1 min is more than that of 15 min, the load is increasing, or if reverse then load is decreasing. If load is 0.0, then CPU is idle.
    - If load average is greater than CPU, meaning more work than what cpu can dispatch. CPU Saturation
    - If load average is greater than no of CPU, meaning more work than what cpu can dispatch. Indicates CPU Saturation
    - Better alternatives : per-CPU utilization - using mpstat -P ALL 1, per-process CPU utilization - top, pidstat

    - **dmesg | tail**
    - Lists system messages, errors messages related to performance measures can be looked from here

    - **vmstat**
    - summary of servers memory utilization statistics, short for virtual memory stat
    - **vmstat 1**
    - summary of servers memory utilization statistics, short for virtual memory stat.
    - **r**: number of process waiting to run on CPU. Value greater than CPU count means saturation of the server.
    - **free** : free memory in kilobytes. Alternative, more elaborate, *free*
    - **si, so**: Page in and page out (paging~swapping). When pages are written into disk from memory, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages.
    Pageins are fine, application initialization will have page-ins. Too many page-out indicate that kernel might be spending too much time managing than application processing (thrashing). In case of constant pageouts, check process occupying cpu the most using ps command.
    - **us, sy, id, wa, st** : CPU times : user time(application ), system time (kernel), idle, wait I/O, and stolen time.
    Pageins are fine, application initialization will have page-ins. Too many page-out indicate that kernel might be spending too much time managing memeory than application processing (thrashing).
    In case of constant pageouts, check process occupying cpu the most using ps command.
    Might be confused as IO problem, since disk are used as memory, and swapping would require r/w from said device.m
    Swapping can cause a cascading of bad performance since the waiting processes might keep piling on.
    - **us, sy, id, wa, st** : Percent of CPU times : user time(application), system time (kernel and other system processes), idle(cpu is idle, higher the better), wait I/O, and stolen time.
    First two indicate that the load may be CPU bound load and looking into the processes in pidstat/top might help in identifying the culprit process(es).

    -Options - 1: every second t: timestamp column, SM: Data in Megabytes


    - **mpstat -P ALL 1**
    - CPU time breakdowns per CPU(cores) us ing the -P option. One of the cores/CPU overworking indicate high usage of a single threaded app.
    @@ -41,10 +49,10 @@ Types of resources : CPU, Memory, Block Devices(disk), Network Devices
    - **iostat -xz 1**
    - Used for devices (hard disks), to understand the workload applied and resulting performance.
    - Workload metrics
    - **r/s, w/s, rkB/s, wkB/s** : no of reads & writes, no of kB read and written from the atttached devices
    - **r/s, w/s, rkB/s, wkB/s** : no of reads & writes, no of kB read and written from the attached devices
    - Resulting performance metrics
    - **await** : avg time for io. Time queued or time being serviced or time waiting for the blocked disk .Larger than expected times might mean device saturation.
    - **util** : Percentage of time the device is doing work. Interpretation: more than 60 percent indicate device sationation
    - **util** : Percentage of time the device is doing work. Interpretation: more than 60 percent indicate device saturation


    - **free -m**
    @@ -71,26 +79,32 @@ Types of resources : CPU, Memory, Block Devices(disk), Network Devices
    #### Observability Tools : Intermediate

    - **strace**: System call tracer. Translates syscall args. Usful in solving system usage issues.
    - Implementations of strace use ptrace, alternate use perf using perf-trace. Former slows system down
    so have to be cautious.
    - Blocks the target, slows application down, Shouldnt be used in production.
    - Implementations of strace use ptrace, alternate use perf using perf-trace. Former slows system down
    so have to be cautious.
    - Blocks the target, slows application down, Shouldnt be used in production.

    - **tcpdump**
    - Trace packets. Packets sequences etc. Scalability issue when network is io in high volumes (gigabits). Doesnt scale well.

    - **netstat**
    - Prints network protocol statistics. Different options provide differ information (interface stats, route table etc)
    - Better command (ip table etc) **ss**
    - Prints network protocol statistics. Different options provide differ information (interface stats, route table etc)
    - Better command (ip table etc) **ss**
    - **nicstat**
    - Network interface stats
    - Network interface stats

    - **swapon -s**
    - Shows swap device usage
    - Shows swap device usage

    - **lsof**
    - Debug tool. Understand env, who is connected to who. Which files are connected to which process.
    - Debug tool. Understand env, who is connected to who. Which files are connected to which process.

    - **sar**
    - System activity reporter. Many statistics (TCP, DEV(networking))
    - Complements top by giving statistics from the past.



    #### Network Issues

    - ethtool
    - Diagnose network links
    7 changes: 7 additions & 0 deletions LinuxPerformanceMethodoligies.md
    Original file line number Diff line number Diff line change
    @@ -56,3 +56,10 @@ USE METHOD:
    4. Check Disk IO utiliation. **iostat**. util column: more than 60 percent utilization might the problem.
    5. Check Network IO utilization **sar -n DEV 1**.
    6. pidstat for process wise usage of.


    -----

    Category of problem : Sluggish slow server

    Questions to consider : what is load and when it is high
    18 changes: 18 additions & 0 deletions Misc.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,18 @@
    ### What is inode

    It is a datastructure in linux which stores all the info about a file besides its actual name and its contents.

    ### What happens when "No space left on device"

    - Check available memory and it usage using **du** or **df** command
    - du -sh /
    Space used on a particular drive
    - df -h
    Drive wise split of space usage/availability
    - Reason :
    - Deleted file is already being used by a process and is therefore not released by the kernel.
    use lsof command to get the process which is causing this problem and restart it.
    - Out of inodes. Inodes are limited for the filesystem. If all of them are used up, you cannot add more files
    *df -i*
    - Corrupted filesystem blocks

  7. @ameenkhan07 ameenkhan07 revised this gist Jan 14, 2019. 3 changed files with 130 additions and 26 deletions.
    97 changes: 72 additions & 25 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -3,17 +3,20 @@
    ### Tooling

    ```
    observability, benchmarking, tuning, static performance tuning, profiling, and tracing
    Tool Categories : observability, benchmarking, tuning, static performance tuning, profiling, and tracing
    Types of resources : CPU, Memory, Block Devices(disk), Network Devices
    ```

    - uptime
    - Useful for CPU load averages (no of processes running and are waiting to run)
    - *High Level* idea of system usage, moving sum average of 1, 5, 15 minute.
    - "High level" because gives some idea of how the load is changing on a system, i.e. if the load average at 1 min is more than that of 15 min, the load is increasing, or if reverse then load is decreasing. If load is 0.0, then CPU is idle
    - Load averages : CPU demand, ie number of threads which are waiting to run on the CPU
    #### Observability Tools : Basics

    - **uptime**
    - Measure of cpu demand by looking at system(CPU + disks) load averages (no of processes running or are waiting to run)
    - *High Level* idea of system usage and how the load changes. 3 numbers -moving load averages at 1, 5, 15 minute.
    - Interpretation : if the load average at 1 min is more than that of 15 min, the load is increasing, or if reverse then load is decreasing. If load is 0.0, then CPU is idle.
    - If load average is greater than CPU, meaning more work than what cpu can dispatch. CPU Saturation
    - Better alternatives : per-CPU utilization - using mpstat -P ALL 1, per-process CPU utilization - top, pidstat
    - dmesg | tail

    - **dmesg | tail**
    - Lists system messages, errors messages related to performance measures can be looked from here

    - **vmstat**
    @@ -22,28 +25,72 @@ observability, benchmarking, tuning, static performance tuning, profiling, and t
    - **free** : free memory in kilobytes. Alternative, more elaborate, *free*
    - **si, so**: Page in and page out (paging~swapping). When pages are written into disk from memory, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages.
    Pageins are fine, application initialization will have page-ins. Too many page-out indicate that kernel might be spending too much time managing than application processing (thrashing). In case of constant pageouts, check process occupying cpu the most using ps command.
    - **us, sy, id, wa, st** : CPU times : user time, system time (kernel), idle, wait I/O, and stolen time.
    - **us, sy, id, wa, st** : CPU times : user time(application ), system time (kernel), idle, wait I/O, and stolen time.
    -Options - 1: every second t: timestamp column, SM: Data in Megabytes

    - mpstat -P ALL 1
    - CPU time breakdowns per CPU(cores) using the -P option. One of the cores/CPU overworking indicate high usage of a single threaded app.
    - **mpstat -P ALL 1**
    - CPU time breakdowns per CPU(cores) us ing the -P option. One of the cores/CPU overworking indicate high usage of a single threaded app.
    - **usr** : percentage of cpu utilization while executing user level application
    - **sys** : percentage of cpu utilization while executing by kernel

    - pidstat 1
    - Summary of per process statistics, like top, but doesnt clean the screen. Easy to see patterns over time.

    - iostat -xz 1
    - Used for devices (hard disks), to understand the workload applied and performance.
    - **r/s, w/s, rkB/s, wkB/s** : no of reads & writes, no of kB read and written from the atttached devices
    - **util** : Percentage of time the device is doing work. Interpretation: omre than 60 percent
    - **await** : avg time for io. Time queued or time being serviced. Larger than expected times might mean device saturation.

    - **pidstat 1**
    - Summary of per process statistics(breakdown), like top, but doesnt clean the screen.
    Easy to see patterns over time, rolling output.
    - usr, system for each process.

    - **iostat -xz 1**
    - Used for devices (hard disks), to understand the workload applied and resulting performance.
    - Workload metrics
    - **r/s, w/s, rkB/s, wkB/s** : no of reads & writes, no of kB read and written from the atttached devices
    - Resulting performance metrics
    - **await** : avg time for io. Time queued or time being serviced or time waiting for the blocked disk .Larger than expected times might mean device saturation.
    - **util** : Percentage of time the device is doing work. Interpretation: more than 60 percent indicate device sationation

    - free -m

    - **free -m**
    - alternate cat /proc/meminfo
    - buffer/cached = sum of buffer and cache. Buffer used for deivce io, cache used by filesystem.
    - buffer/cached = sum of buffer and cache. Buffer used for block device io, cache used by virtual page cache.

    - **sar -n DEV 1**
    - Tool to check network throughput and ensure if it is under the limit. rxKbps and txkBps : measure of workload

    - **sar -n TCP,ETCP 1**
    - Overview of tcp metrics.
    - **active** and **passive**: outbound and inbound connections. Used as measure of network load on the server

    - **top**
    - System wide summary. All of above (memory, CPU, IO, network)
    - Consumes cpu to read /proc.
    - % CPU summed across all CPUs.

    - **ps**
    - Process status listing



    #### Observability Tools : Intermediate

    - **strace**: System call tracer. Translates syscall args. Usful in solving system usage issues.
    - Implementations of strace use ptrace, alternate use perf using perf-trace. Former slows system down
    so have to be cautious.
    - Blocks the target, slows application down, Shouldnt be used in production.

    - **tcpdump**
    - Trace packets. Packets sequences etc. Scalability issue when network is io in high volumes (gigabits). Doesnt scale well.

    - **netstat**
    - Prints network protocol statistics. Different options provide differ information (interface stats, route table etc)
    - Better command (ip table etc) **ss**
    - **nicstat**
    - Network interface stats

    - **swapon -s**
    - Shows swap device usage

    - **lsof**
    - Debug tool. Understand env, who is connected to who. Which files are connected to which process.

    - **sar**
    - System activity reporter. Many statistics (TCP, DEV(networking))
    - Complements top by giving statistics from the past.

    - sar -n DEV 1
    - sar -n TCP,ETCP 1
    - top
    58 changes: 58 additions & 0 deletions LinuxPerformanceMethodoligies.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,58 @@
    - Analysing Performance, steps involved, and ending.
    - Linux OS performance metrics. Lot of monitoring tools are built on top of these metrics.

    Example : *System is slow*
    1. start with command top for processes and cpu
    2. Check Disk io (iostat), and network (sar)
    What to do in such a case : Quantify the problem, is it latency etc. Check system resources with methodologies, run through the checklist.

    ## Methodoligies
    - provide giudance in choosing the performance tools. Starting point, process and the ending point.

    1. AntiPattern: People tend to run commands they know, not trying to understand what the problem and attacking in solving that instead. (Drunk Man Anti Method) Randomly throwing everything at the problem.
    2. Maybe network, firewall etc

    ### Actual Methodoligies
    - Problem Statement Method
    - Why do you think it has a performance problem, is this a new problem or has been there for some time? Something changed recently? Can be expressed in terms of latency, run-time?

    - Workload Characterization Method
    - Who is the causing the load? Why is the load called? What is the load? How the load changed over time?
    - Solve some issues

    - USE Method
    - USE : Utilization, Saturation, Error
    - Functional diagram of the system(listing all componenets of the system),
    and for every resource check **utilization**(busy time),
    **saturation**(queue length/time), **errors**(easy to interprate).
    - Current tools might not look everywhere, so this method poses question before the ansers, look at place which are sometimes missed.

    - CPU Analysis
    - Process get deadlocked/blocked, at some point (paging, context switching, network io)

    - CPU Profile Method
    - Flame graph

    #### Tools
    Categorised:
    | Observability Tools : Watch Activity | Benchmarking : Load Test | Tuning : Changing system parameters | Static : Chainging system configs.
    Observability Tools : LinuxInternal.md

    Hint : Get a functional diagram of the environement, makes easier to create a check list.

    ----

    Example 2:
    Application Latency is higher.
    USE METHOD:
    1. **top** command
    - Check cpu summary, process/kernel time, cpu utilization (if it is 100 percent or not).
    2. CPU utilization again with **vmstat** to see paterns. Check memory, if there is enough left and is not leaninig towards saturation point.
    3. **mpstat** to check if maxing out any cpu
    ```
    Utilization and saturation metrics: swapping not too much, enough memory left, cpu are not overloaded, cpu time for kernel/application is not too much, r is not a lot more than cpu present.
    CPU saturation/utiliation is flexible in case of linux, kernel manages/moves things around, interrups threads etc if needed. same is not the case with io.
    ```
    4. Check Disk IO utiliation. **iostat**. util column: more than 60 percent utilization might the problem.
    5. Check Network IO utilization **sar -n DEV 1**.
    6. pidstat for process wise usage of.
    1 change: 0 additions & 1 deletion SystemsInterview.md
    Original file line number Diff line number Diff line change
    @@ -9,7 +9,6 @@
    • Brendan Gregg's blog & his book "Systems Performance" may help refresh basic OS material
    • What's it like to be a PE at Facebook?


    ### Systems
    More specifically, linux troubleshooting and debugging. Understanding things like memory, io, cpu, shell, memory etc. would be pretty helpful. Knowing how to actually write a unix shell would also be a good idea. What tools might you use to debug something? On another note, this interview will likely push your boundaries of what you know (and how to implement it).

  8. @ameenkhan07 ameenkhan07 revised this gist Jan 13, 2019. 1 changed file with 23 additions and 10 deletions.
    33 changes: 23 additions & 10 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -7,30 +7,43 @@ observability, benchmarking, tuning, static performance tuning, profiling, and t
    ```

    - uptime
    - Useful for CPU load averages (no of processes wanting to run)
    - Useful for CPU load averages (no of processes running and are waiting to run)
    - *High Level* idea of system usage, moving sum average of 1, 5, 15 minute.
    - "High level" because gives some idea of how the load is changing on a system, i.e. if the load average at 1 min is more than that of 15 min, the load is increasing, or if reverse then load is decreasing. If load is 0.0, then CPU is idle
    - Load averages : CPU demand, ie number of threads which are waiting to run on the CPU
    - Better alternatives : per-CPU utilization - using mpstat -P ALL 1, per-process CPU utilization - top, pidstat

    - dmesg | tail
    - Lists system messages, errors messages related to performance measures can be looked from here

    - **vmstat**
    - summary of servers memory utilization statistics, short for virtual memory stat
    - Columns:
    - **r**: number of process waiting to run on CPU. Value greater than CPU count means saturation of the server.
    - **free** : free memory in kilobytes. Alternative, more elaborate, *free*
    - **si, so**: Page in and page out (paging~swapping). When pages are written into disk from memory, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages.
    Pageins are fine, application initialization will have page-ins. Too many page-out indicate that kernel might be spending too much time managing than application processing (thrashing). In case of constant pageouts, check process occupying cpu the most using ps command.
    - **us, sy, id, wa, st** : CPU times : user time, system time (kernel), idle, wait I/O, and stolen time.
    -Options - 1: every second t: timestamp column, SM: Data in Megabytes

    - mpstat -P ALL 1
    - CPU time breakdowns per CPU(cores) using the -P option. One of the cores/CPU overworking indicate high usage of a single threaded app.
    - **usr** : percentage of cpu utilization while executing user level application
    - **sys** : percentage of cpu utilization while executing by kernel

    - pidstat 1
    - Summary of per process statistics, like top, but doesnt clean the screen. Easy to see patterns over time.

    - iostat -xz 1
    - Used for devices (hard disks), to understand the workload applied and performance.
    - **r/s, w/s, rkB/s, wkB/s** : no of reads & writes, no of kB read and written from the atttached devices
    - **util** : Percentage of time the device is doing work. Interpretation: omre than 60 percent
    - **await** : avg time for io. Time queued or time being serviced. Larger than expected times might mean device saturation.

    - free -m
    - alternate cat /proc/meminfo
    - buffer/cached = sum of buffer and cache. Buffer used for deivce io, cache used by filesystem.


    vmstat 1
    mpstat -P ALL 1
    pidstat 1
    iostat -xz 1
    free -m
    sar -n DEV 1
    sar -n TCP,ETCP 1
    top
    - sar -n DEV 1
    - sar -n TCP,ETCP 1
    - top
  9. @ameenkhan07 ameenkhan07 revised this gist Jan 11, 2019. 1 changed file with 5 additions and 5 deletions.
    10 changes: 5 additions & 5 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -18,11 +18,11 @@ observability, benchmarking, tuning, static performance tuning, profiling, and t
    - **vmstat**
    - summary of servers memory utilization statistics, short for virtual memory stat
    - Columns:
    - **r**: number of process waiting to run on CPU. Value greater than CPU count means saturation of the server.
    - **free** : free memory in kilobytes. Alternative, more elaborate, *free*
    - **si, so**: Page in and page out (paging~swapping). When pages are written into disk from memory, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages.
    Pageins are fine, application initialization will have page-ins. Too many page-out indicate that kernel might be spending too much time managing than application processing (thrashing). In case of constant pageouts, check process occupying cpu the most using ps command.
    - **us, sy, id, wa, st** : CPU times : user time, system time (kernel), idle, wait I/O, and stolen time.
    - **r**: number of process waiting to run on CPU. Value greater than CPU count means saturation of the server.
    - **free** : free memory in kilobytes. Alternative, more elaborate, *free*
    - **si, so**: Page in and page out (paging~swapping). When pages are written into disk from memory, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages.
    Pageins are fine, application initialization will have page-ins. Too many page-out indicate that kernel might be spending too much time managing than application processing (thrashing). In case of constant pageouts, check process occupying cpu the most using ps command.
    - **us, sy, id, wa, st** : CPU times : user time, system time (kernel), idle, wait I/O, and stolen time.



  10. @ameenkhan07 ameenkhan07 revised this gist Jan 11, 2019. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -16,8 +16,8 @@ observability, benchmarking, tuning, static performance tuning, profiling, and t
    - Lists system messages, errors messages related to performance measures can be looked from here

    - **vmstat**
    - summary of servers memory utilization statistics, short for virtual memory stat
    - Columns:
    - summary of servers memory utilization statistics, short for virtual memory stat
    - Columns:
    - **r**: number of process waiting to run on CPU. Value greater than CPU count means saturation of the server.
    - **free** : free memory in kilobytes. Alternative, more elaborate, *free*
    - **si, so**: Page in and page out (paging~swapping). When pages are written into disk from memory, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages.
  11. @ameenkhan07 ameenkhan07 revised this gist Jan 11, 2019. 1 changed file with 15 additions and 5 deletions.
    20 changes: 15 additions & 5 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -10,12 +10,22 @@ observability, benchmarking, tuning, static performance tuning, profiling, and t
    - Useful for CPU load averages (no of processes wanting to run)
    - *High Level* idea of system usage, moving sum average of 1, 5, 15 minute.
    - "High level" because gives some idea of how the load is changing on a system, i.e. if the load average at 1 min is more than that of 15 min, the load is increasing, or if reverse then load is decreasing. If load is 0.0, then CPU is idle
    - Load averages : CPU demand
    - Load averages : CPU demand, ie number of threads which are waiting to run on the CPU

    -

    ad averages
    dmesg | tail
    - dmesg | tail
    - Lists system messages, errors messages related to performance measures can be looked from here

    - **vmstat**
    - summary of servers memory utilization statistics, short for virtual memory stat
    - Columns:
    - **r**: number of process waiting to run on CPU. Value greater than CPU count means saturation of the server.
    - **free** : free memory in kilobytes. Alternative, more elaborate, *free*
    - **si, so**: Page in and page out (paging~swapping). When pages are written into disk from memory, it is pageout. Page in, when data(process data) is brought from disk to memory, in the forms of pages.
    Pageins are fine, application initialization will have page-ins. Too many page-out indicate that kernel might be spending too much time managing than application processing (thrashing). In case of constant pageouts, check process occupying cpu the most using ps command.
    - **us, sy, id, wa, st** : CPU times : user time, system time (kernel), idle, wait I/O, and stolen time.



    vmstat 1
    mpstat -P ALL 1
    pidstat 1
  12. @ameenkhan07 ameenkhan07 revised this gist Jan 10, 2019. 1 changed file with 23 additions and 2 deletions.
    25 changes: 23 additions & 2 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,26 @@
    ## Troubleshooting and Debugging

    ### Tooling
    #### observability, benchmarking, tuning, static performance tuning, profiling, and tracing.
    -

    ```
    observability, benchmarking, tuning, static performance tuning, profiling, and tracing
    ```

    - uptime
    - Useful for CPU load averages (no of processes wanting to run)
    - *High Level* idea of system usage, moving sum average of 1, 5, 15 minute.
    - "High level" because gives some idea of how the load is changing on a system, i.e. if the load average at 1 min is more than that of 15 min, the load is increasing, or if reverse then load is decreasing. If load is 0.0, then CPU is idle
    - Load averages : CPU demand

    -

    ad averages
    dmesg | tail
    vmstat 1
    mpstat -P ALL 1
    pidstat 1
    iostat -xz 1
    free -m
    sar -n DEV 1
    sar -n TCP,ETCP 1
    top
  13. @ameenkhan07 ameenkhan07 revised this gist Jan 10, 2019. 1 changed file with 1 addition and 2 deletions.
    3 changes: 1 addition & 2 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,5 @@
    ## Troubleshooting and Debugging

    ### Tooling

    (observability, benchmarking, tuning, static performance tuning, profiling, and tracing.)
    #### observability, benchmarking, tuning, static performance tuning, profiling, and tracing.
    -
  14. @ameenkhan07 ameenkhan07 revised this gist Jan 10, 2019. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,6 @@
    ## Troubleshooting and Debugging

    ### Tooling

    (observability, benchmarking, tuning, static performance tuning, profiling, and tracing.)
    -
  15. @ameenkhan07 ameenkhan07 created this gist Jan 10, 2019.
    5 changes: 5 additions & 0 deletions LinuxInternals.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,5 @@
    ## Troubleshooting and Debugging

    ### Tooling
    (observability, benchmarking, tuning, static performance tuning, profiling, and tracing.)
    -
    12 changes: 12 additions & 0 deletions Resources.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,12 @@
    Links used in my preparation

    - https://www.quora.com/How-should-I-prepare-for-a-production-engineer-interview-at-Facebook
    - http://www.brendangregg.com/linuxperf.html

    - https://medium.com/netflix-techblog/linux-performance-analysis-in-60-000-milliseconds-accc10403c55
    - http://www.brendangregg.com/blog
    - http://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html
    - http://www.brendangregg.com/blog/2014-11-22/linux-perf-tools-2014.html
    - http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

    - https://wizardzines.com/
    17 changes: 17 additions & 0 deletions SystemsInterview.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,17 @@
    ### What to Expect
    • This 45-minute systems interview will focus on responding to real world problems with an unhealthy service, such as a web server or database. The interview will start off at a high level troubleshooting a likely scenario, dig deeper to find the cause and some possible solutions for it. The goal is to probe your knowledge of systems at scale and under load, so keep in mind the challenges of the Facebook environment.
    • Depending on how your conversation goes, your interviewer may ask to use CoderPad.
    • Some of the questions may be around scalability, so think of solutions that would apply and be effective in our environment.

    ### Helpful Tips
    • Focus on things that might show up in your average Operating Systems class such as tooling, memory management and unix process lifecycle.
    • Spend time on a linux system — maybe even install one from scratch. Run Linux as your primary desktop environment for a while to force yourself to learn how it works, even though servers != desktops.
    • Brendan Gregg's blog & his book "Systems Performance" may help refresh basic OS material
    • What's it like to be a PE at Facebook?


    ### Systems
    More specifically, linux troubleshooting and debugging. Understanding things like memory, io, cpu, shell, memory etc. would be pretty helpful. Knowing how to actually write a unix shell would also be a good idea. What tools might you use to debug something? On another note, this interview will likely push your boundaries of what you know (and how to implement it).

    ### Design/Architecture 
    This interview is all about taking an ambiguous question of how you might build a system and letting you guide the way. Your interviewer will add in constraints when necessary and the idea is to get a simple, workable solution on the board. Things like load and monitoring are things you might consider. What you consider is just as important as to what you don’t. So ask clarifying questions and gather requirements when appropriate.