Skip to content

Instantly share code, notes, and snippets.

@cloventt
Created July 27, 2017 05:05
Show Gist options
  • Select an option

  • Save cloventt/f6b72c799215d99c4d2de1869e5aa5c3 to your computer and use it in GitHub Desktop.

Select an option

Save cloventt/f6b72c799215d99c4d2de1869e5aa5c3 to your computer and use it in GitHub Desktop.

Revisions

  1. cloventt created this gist Jul 27, 2017.
    28 changes: 28 additions & 0 deletions hadoop_problem.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,28 @@
    I just spent several hours trying to configure a psuedo-distributed Hadoop cluster inside a Docker container. I wanted to post our experience in case someone else makes the mistake of trying to do this themselves.
    ## Problem
    When we tried to save a file to HDFS with the Java client, the NameNode appeared to save the file. Using `hdfs dfs ls` we could see the file was represented in HDFS, but has a size of 0, indicating no data had made it into the cluster.

    A really unhelpful stack trace was also issues by the client. The error was similar to this:
    ```
    org.apache.hadoop.ipc.RemoteException(java.io.IOException):
    File foo.csv could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) excluded in this operation.
    ```
    This is just the stacktrace from Hadoop piped back to the client.

    ## Investigation
    Our first clue was in this stack trace. The DataNode was labelled `127.0.0.1:50010`, or similar to it. This is incorrect. Our client was in a docker container with IP address `172.18.0.3`, and the DataNode was in a container with IP `172.18.0.4`. From outside of docker, hdfs operations worked fine.

    This told us the problem was probably a networking issue. At first we thought the DataNode was returning its localhost IP as its routable IP, which would fail inside docker. After wading through the hopelessly useless Hadoop docs and support we worked out that this was basically true, but it actually wasn't the problem.

    The NameNode determines the DataNode IP by capturing the IP it registers from, which was localhost in our case. By default it then passes this IP along to the client.

    ## Solution
    The issue ultimately was that there is a barely-mentioned conf option in the hdfs client, `dfs.client.use.datanode.hostname`.

    The way that HDFS handles client operations is this:
    1. Client connects to HDFS NameNode and asks to save a file
    2. NameNode creates a record of the file and decides where to save it in the cluster
    3. NameNode tells client the address of a DataNode to save the file to
    4. Client runs connects to DataNode and saves the file there

    The problem was happening in step 3. Without the property above set to `true`, the NameNode told the client the DataNode was accessible on `127.0.0.1`, which obviously failed from a docker container with a different IP. In a standard installation this wouldn't be a problem, but communicating between docker containers made this a problem. By setting the property to `true`, the NameNode used the DataNode's hostname, which was the name of the Docker container ID. Luckily for us, inside a docker network, container IDs are routable hosts from anywhere in the network.