Running Hadoop

Make a working directory, somewhere

mkdir running-hadoop
cd running-hadoop

Clone the big-data-europe fork for commodity hardware

git clone https://github.com/wxw-matt/docker-hadoop.git
cd docker-hadoop

Test Docker

I didn't have my Docker running and had to start it.

docker run hello-world

Change version - we need to use docker-compose-v2, but the docker-compose command will look for an unversioned file.

mv docker-compose.yml docker-compose-v1.yml
mv docker-compose-v3.yml docker-compose.yml

This will draw two warnings - you can fix the files or ignore both.

Bring the Hadoop containers up (took me 81.6s). I had previously pulled other hadoop images though so your time may vary.

docker-compose up -d

In a Hadoop cluster, we mostly work within the namenode.

I used "docker ps" to list the nodes, then picked the one with name in it. For me:

docker ps
CONTAINER ID   IMAGE                                                    COMMAND                  CREATED         STATUS                            PORTS      NAMES
de4a8afd3940   wxwmatt/hadoop-historyserver:2.1.1-hadoop3.3.1-java8     "/entrypoint.sh /run…"   6 seconds ago   Up 4 seconds (health: starting)   8188/tcp   docker-hadoop-historyserver-1
a0e551086c0c   wxwmatt/hadoop-namenode:2.1.1-hadoop3.3.1-java8          "/entrypoint.sh /run…"   6 seconds ago   Up 4 seconds (health: starting)   9870/tcp   docker-hadoop-namenode-1
7b70e2a0bbb3   wxwmatt/hadoop-nodemanager:2.1.1-hadoop3.3.1-java8       "/entrypoint.sh /run…"   6 seconds ago   Up 5 seconds (health: starting)   8042/tcp   docker-hadoop-nodemanager-1
46e45d7b9f49   wxwmatt/hadoop-resourcemanager:2.1.1-hadoop3.3.1-java8   "/entrypoint.sh /run…"   6 seconds ago   Up 4 seconds                      8088/tcp   docker-hadoop-resourcemanager-1
a31b6cbaa2d8   wxwmatt/hadoop-datanode:2.1.1-hadoop3.3.1-java8          "/entrypoint.sh /run…"   6 seconds ago   Up 5 seconds (health: starting)   9864/tcp   docker-hadoop-datanode-1

So I used:

docker exec -it docker-hadoop-namenode-1 /bin/bash

I also tested hello world here:

echo hello world

Set up the node

While within the node, we will need directories for data, for compute, and for results. I created those under a new 'app' directory.

mkdir app
mkdir app/data
mkdir app/res
mkdir app/jars

Fetch data to app/data

'hello world' for 'map reduce' is 'word count' so we get some words to count. I got two of my favorite books and also Wuthering Heights from Project Gutenberg in plaintext format like so:

cd /app/data
curl https://raw.githubusercontent.com/cd-public/books/main/pg1342.txt -o austen.txt
curl https://raw.githubusercontent.com/cd-public/books/main/pg84.txt -o shelley.txt
curl https://raw.githubusercontent.com/cd-public/books/main/pg768.txt -o bronte.txt

It should be easy enough to find other text files on the Internet, but I used these three. Before going further, I verified that I had files of some size:

ls -al

I got:

total 1884
drwxr-xr-x 2 root root   4096 May 28 19:42 .
drwxr-xr-x 5 root root   4096 May 28 19:38 ..
-rw-r--r-- 1 root root 772420 May 28 19:41 austen.txt
-rw-r--r-- 1 root root 693877 May 28 19:42 bronte.txt
-rw-r--r-- 1 root root 448937 May 28 19:42 shelley.txt

The byte size is between the second "root" and the date - and all three are nonzero and similarish in size (as each is a novel).

Fetch compute to app/jars

We are using WordCount.jar, which is helpfully provided by somebody, but most critically is in the repo. We can grab from the local file system (boring) or we just curl again:

cd /app/jars
curl https://github.com/wxw-matt/docker-hadoop/blob/master/jobs/jars/WordCount.jar -o WordCounter.jar

In general, you should not just pull executable files off of the internet and run them, but we are running this in a container, which provides a modicum of safety.

If you don't/can't/won't do a curl, you just take the local jar, though you'll have to drop out of the container to grab it (unless there's some docker commands I don't know).

docker cp .\jobs\jars\WordCount.jar namenode:/app/jars/WordCount.jar

Load data in HDFS

Hadoop can only read and write to something called the Hadoop Distributed File System, which is a very fancy hash table that works well in data centers.

We need to move data from the Linux file system into the Hadoop file system. We use the "hdfs" commands.

cd /
hdfs dfs -mkdir /test-1-input
hdfs dfs -copyFromLocal -f /app/data/*.txt /test-1-input/

Run Hadoop/MapReduce

hadoop jar jars/WordCount.jar WordCount /test-1-input /test-1-output

Copy results out of hdfs

hdfs dfs -copyToLocal /test-1-output /app/res/

See the results!

cat /app/res/test-1-output/part-r-00000

I did this probably around 20 times and it worked about 2 (so 10%) and I couldn't figure out why it worked sometimes and not others. Here's some tricks:

Run in Unix rather than on Windows
Use --remove-orphans with docker compose
Make sure names are all the same everywhere
Use ls, ls -al, and cat liberally to check files, hdfs and otherwise
Use docker cp and curl and verifying the similarity with diff
Check the following:

cd-public/README.md

Running Hadoop