cloventt · July 27, 2017 05:05 · Jul 27, 2017
diff --git a/hadoop_problem.md b/hadoop_problem.md
@@ -0,0 +1,28 @@
+I just spent several hours trying to configure a psuedo-distributed Hadoop cluster inside a Docker container. I wanted to post our experience in case someone else makes the mistake of trying to do this themselves.
+## Problem
+When we tried to save a file to HDFS with the Java client, the NameNode appeared to save the file. Using `hdfs dfs ls` we could see the file was represented in HDFS, but has a size of 0, indicating no data had made it into the cluster.
+
+A really unhelpful stack trace was also issues by the client. The error was similar to this:
+```
+org.apache.hadoop.ipc.RemoteException(java.io.IOException):
+    File foo.csv could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) excluded in this operation.
+```
+This is just the stacktrace from Hadoop piped back to the client.
+
+## Investigation
+Our first clue was in this stack trace. The DataNode was labelled `127.0.0.1:50010`, or similar to it. This is incorrect. Our client was in a docker container with IP address `172.18.0.3`, and the DataNode was in a container with IP `172.18.0.4`. From outside of docker, hdfs operations worked fine.
+
+This told us the problem was probably a networking issue. At first we thought the DataNode was returning its localhost IP as its routable IP, which would fail inside docker. After wading through the hopelessly useless Hadoop docs and support we worked out that this was basically true, but it actually wasn't the problem.
+
+The NameNode determines the DataNode IP by capturing the IP it registers from, which was localhost in our case. By default it then passes this IP along to the client.
+
+## Solution
+The issue ultimately was that there is a barely-mentioned conf option in the hdfs client, `dfs.client.use.datanode.hostname`.
+
+The way that HDFS handles client operations is this:
+1. Client connects to HDFS NameNode and asks to save a file
+2. NameNode creates a record of the file and decides where to save it in the cluster
+3. NameNode tells client the address of a DataNode to save the file to
+4. Client runs connects to DataNode and saves the file there
+
+The problem was happening in step 3. Without the property above set to `true`, the NameNode told the client the DataNode was accessible on `127.0.0.1`, which obviously failed from a docker container with a different IP. In a standard installation this wouldn't be a problem, but communicating between docker containers made this a problem. By setting the property to `true`, the NameNode used the DataNode's hostname, which was the name of the Docker container ID. Luckily for us, inside a docker network, container IDs are routable hosts from anywhere in the network.
No results found