Skip to content

Instantly share code, notes, and snippets.

@jiayuzhou
Last active March 26, 2016 00:37
Show Gist options
  • Save jiayuzhou/1aecb90b041b16ec5803 to your computer and use it in GitHub Desktop.
Save jiayuzhou/1aecb90b041b16ec5803 to your computer and use it in GitHub Desktop.

Revisions

  1. jiayuzhou revised this gist Oct 22, 2014. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions SparkProfileEMR.md
    Original file line number Diff line number Diff line change
    @@ -15,7 +15,7 @@ and then log SSH to master using SSH agent forwarding
    ```
    you@localhost: ssh -A [email protected] -i ~/EC2_KEY.pem
    ```
    and do
    and then we are able to log into the slave node:
    ```
    hadoop@master: ssh [email protected]
    ```
    @@ -84,7 +84,7 @@ Suppose we try to analyze the memory usage of this container. The first thing we
    know is how to find the system process id running this container. To do so, we can search
    in system process list:
    ```
    ps aux | grep container_1414015957471_0001_01_000015
    hadoop@internal:ps aux | grep container_1414015957471_0001_01_000015
    ```
    And we are able to find
    ```
  2. jiayuzhou created this gist Oct 22, 2014.
    123 changes: 123 additions & 0 deletions SparkProfileEMR.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,123 @@
    1. How to log into slaves
    ---
    One can log into the slave nodes using Agent Forwarding by SSH. Suppose that I find in the YARN log
    ```
    java.nio.channels.CancelledKeyException
    14/09/25 23:12:58 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-172-31-xx-xx.ec2.internal,49972)
    java.io.IOException: Connection reset by peer
    ...
    ```
    I would like to SSH to the slave `ip-172-31-xx-xx.ec2.internal` to check the logs. To do so, firstly we have to add EMR identity to the agent
    ```
    you@localhost: ssh-add ~/EC2_KEY.pem
    ```
    and then log SSH to master using SSH agent forwarding
    ```
    you@localhost: ssh -A [email protected] -i ~/EC2_KEY.pem
    ```
    and do
    ```
    hadoop@master: ssh [email protected]
    ```

    2. Find the lost block manager/container
    ---
    Sometimes we find that something is wrong and we end up with a `FetchFailed`:
    ```
    FetchFailed(BlockManagerId(81, ip-172-31-xx-xx.ec2.internal, 52487, 0), shuffleId=83, mapId=44, reduceId=49)
    ```
    First we check the application master in resource manager
    ```
    http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:9026/cluster/app/application_1413826885383_0001
    ```
    And click on the nodes
    ```
    http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:9026/cluster/nodes
    ```
    Find the `ip-172-31-xx-xx.ec2.internal` and click inside
    ```
    http://ip-172-31-xx-xx.ec2.internal:9035/
    ```
    and qw can find in `List of Containers` and see no container is using `BlockManagerId(81, ip-172-31-xx-xx.ec2.internal, 52487, 0)`.
    This tells us that our block manager is gone. Now we want to find the container
    using that killed block manager.

    Go to `Tools>Local logs` on the left:
    ```
    http://ip-172-31-xx-xx.ec2.internal:9035/logs/userlogs/application_1413826885383_0001/
    ```
    We check each of the container folders and find the one that used the block manager,
    say, `container_1413826885383_0001_01_000086`. And we are able to find the block manager
    in the following log.
    ```
    http://ip-172-31-xx-xx.ec2.internal:9035/logs/userlogs/application_1413826885383_0001/container_1413826885383_0001_01_000086/stderr
    ```

    Now we would like to know that why the container `container_1413826885383_0001_01_000086` is killed.
    To do so we investiage the log file at
    ```
    http://ip-172-31-xx-xx.ec2.internal:9035/logs/yarn-hadoop-nodemanager-ip-172-31-xx-xx.ec2.internal.log.2014-10-20-18
    ```
    And we find out that the node is killed because some problems, e.g., memory limit reached:
    ```
    Limit=11274289152, current usage = 11443380224
    2014-10-20 18:00:20,624 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=3377,containerID=container_1413826885383_0001_01_000086] is running beyond physical memory limits. Current usage: 10.7 GB of 10.5 GB physical memory used; 11.3 GB of 52.5 GB virtual memory used. Killing container.
    ```




    3. Profile a live container
    ---
    Once we have identified one container that causes the problem, we can dump the memory
    associated with the process that is running the container. After we log into the
    internal node, we can check the logs from node manager,e.g.,
    ```
    hadoop@internal:tail /mnt/var/log/hadoop/yarn-hadoop-nodemanager-ip-172-31-xx-xx.ec2.internal.log
    ```
    we are able to find the status of each container, memory usage. For example for the container `container_1414015957471_0001_01_000015`
    ```
    2014-10-22 22:22:26,574 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 3929 for container-id container_1414015957471_0001_01_000015: 7.3 GB of 20.5 GB physical memory used; 21.5 GB of 102.5 GB virtual memory used
    ```

    Suppose we try to analyze the memory usage of this container. The first thing we need to
    know is how to find the system process id running this container. To do so, we can search
    in system process list:
    ```
    ps aux | grep container_1414015957471_0001_01_000015
    ```
    And we are able to find
    ```
    hadoop 3929 0.0 0.0 112916 1216 ? Ss 22:14 0:00 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms20480m -Xmx20480m -Djava.io.tmpdir=/mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1414015957471_0001/container_1414015957471_0001_01_000015/tmp '-Dspark.akka.timeout=200' '-Dspark.akka.frameSize=100' org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://[email protected]:37254/user/CoarseGrainedScheduler 25 ip-172-31-32-193.ec2.internal 6 1> /mnt/var/log/hadoop/userlogs/application_1414015957471_0001/container_1414015957471_0001_01_000015/stdout 2> /mnt/var/log/hadoop/userlogs/application_1414015957471_0001/container_1414015957471_0001_01_000015/stderr
    hadoop 3940 58.6 14.4 22484956 8887668 ? Sl 22:14 5:59 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms20480m -Xmx20480m -Djava.io.tmpdir=/mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1414015957471_0001/container_1414015957471_0001_01_000015/tmp -Dspark.akka.timeout=200 -Dspark.akka.frameSize=100 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://[email protected]:37254/user/CoarseGrainedScheduler 25 ip-172-31-32-193.ec2.internal 6
    hadoop 8006 0.0 0.0 110280 860 pts/0 S+ 22:24 0:00 grep container_1414015957471_0001_01_000015
    ```
    where the first one is the bash script and the last one is the grep we were doing. Thus the pid of
    the process running the container is `3940`. We now can use java utilities, e.g.,
    use `jstack` to observe the current stack:
    ```
    hadoop@internal:jstack 3940 | vim -
    ```
    or we can dump the current memory content to a file
    ```
    hadoop@internal:jmap -dump:format=b,file=/mnt/3940.dump 3940
    ```
    and use `jhat` or `VisualVM` to analyze the memory status. For example,
    run the following command
    ```
    hadoop@internal:jhat /mnt/3940.dump
    ```
    the computer will analyze the dump file and setup a web server at port `7000`.
    And by using sock connection, we can access the webpage at
    ```
    http://ip-172-31-xx-xx.ec2.internal:7000/
    ```
    and show instance counts:
    ```
    http://ip-172-31-xx-xx.ec2.internal:7000/showInstanceCounts/
    ```