Skip to content

Instantly share code, notes, and snippets.

@arduanov
Forked from cjzamora/hadoop_spark_osx
Created April 20, 2017 15:53
Show Gist options
  • Select an option

  • Save arduanov/dc57ac3321d6511cf8b19c0064d7f00d to your computer and use it in GitHub Desktop.

Select an option

Save arduanov/dc57ac3321d6511cf8b19c0064d7f00d to your computer and use it in GitHub Desktop.

Revisions

  1. @cjzamora cjzamora revised this gist Jan 5, 2016. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions hadoop_spark_osx
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,5 @@
    Source: http://datahugger.org/datascience/setting-up-hadoop-v2-with-spark-v1-on-osx-using-homebrew/

    This post builds on the previous setup Hadoop (v1) guide, to explain how to setup a single node Hadoop (v2) cluster with Spark (v1) on OSX (10.9.5).

    Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Apache Hadoop framework is composed of the following core modules:
  2. @cjzamora cjzamora revised this gist Jan 5, 2016. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions hadoop_spark_osx
    Original file line number Diff line number Diff line change
    @@ -177,9 +177,11 @@ Start the Spark shell (local or yarn mode)
    ./bin/spark-shell --master yarn
    # or ... (use the help flag for more options)
    ./bin/spark-shell --help

    Every spark context launches a web interface for monitoring

    http://localhost:4040/

    Try some basic scala programming in the shell (use ‘exit’ command to end the session)

    println("Hello, World!")
  3. @cjzamora cjzamora created this gist Jan 5, 2016.
    202 changes: 202 additions & 0 deletions hadoop_spark_osx
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,202 @@
    This post builds on the previous setup Hadoop (v1) guide, to explain how to setup a single node Hadoop (v2) cluster with Spark (v1) on OSX (10.9.5).

    Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Apache Hadoop framework is composed of the following core modules:

    HDFS (Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
    YARN (Yet Another Resource Negotiator): a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
    MapReduce: a programming model for large scale data processing. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Although the Hadoop framework is implemented in Java, any programming language can be used with Hadoop Streaming to implement the “map” and “reduce” functions. Apache Pig and Spark expose higher level user interfaces like Pig Latin and a SQL variant respectively.
    Apache Spark is a fast and general processing engine compatible with Hadoop. It can run in Hadoop clusters through YARN or in a standalone mode, and it can process data in HDFS, HBase, Hive, and any Hadoop InputFormat. Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing, by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting. Spark has easy-to-use APIs (e.g. Scala or Python) for operating on large datasets in batch, interactive or streaming modes. Spark provides a unified engine, packaged with higher-level libraries providing support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.

    Instructions

    STEP 1 – PREPARE ENVIRONMENT

    First need to uninstall old versions of Hadoop

    brew cleanup hadoop
    Next update Homebrew formulae

    brew update
    brew upgrade
    brew cleanup

    Check versions in Homebrew formulae (as of 10/10/14)

    brew info hadoop = 2.5.1
    brew info apache-spark = 1.1.0
    brew info scala = 2.11.2
    brew info sbt = 0.13.6

    STEP 2 – INSTALL ENVIRONMENT

    Install Hadoop

    brew install hadoop

    Install Spark (and dependencies)

    brew install apache-spark scala sbt

    STEP 3 – CONFIGURE ENVIRONMENT

    Optionally, set the environment variables in your shell profile, by default they are set it the hadoop or yarn environment shell scripts. Edit your bash profile (‘nano ~/.bash_profile’), add the lines, save and then force the terminal to refresh (‘source ~/.bash_profile’).

    # set environment variables
    export JAVA_HOME=$(/usr/libexec/java_home)
    export HADOOP_HOME=/usr/local/Cellar/hadoop/2.5.1
    export HADOOP_CONF_DIR=$HADOOP_HOME/libexec/etc/hadoop
    export SCALA_HOME=/usr/local/Cellar/apache-spark/1.1.0

    # set path variables
    export PATH=$PATH:$HADOOP_HOME/bin:$SCALA_HOME/bin

    # set alias start & stop scripts
    alias hstart=$HADOOP_HOME/sbin/start-dfs.sh;$HADOOP_HOME/sbin/start-yarn.sh
    alias hstop=$HADOOP_HOME/sbin/stop-dfs.sh;$HADOOP_HOME/sbin/stop-yarn.sh
    Configure passphraseless SSH on localhost and check remote login is enabled (System Preferences >> Sharing)

    1. ssh-keygen -t rsa
    Press enter for each line
    2. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    3. chmod og-wx ~/.ssh/authorized_keys

    STEP 4 – CONFIGURE HADOOP (FOR PSEUDO-DISTRIBUTED MODE)

    The following instructions are to configure Hadoop as a single-node in a pseudo-distributed mode with MapReduce job execution on YARN. Alternative configurations are: pseudo-distributed mode with local MapReduce job execution, or local / standalone mode, or fully-distributed mode.

    Move to the Hadoop libexec directory and edit the configuration files (e.g. ‘nano {filename}’)

    cd usr/local/Cellar/hadoop/2.5.1/libexec/

    Edit ‘etc/hadoop/hadoop-env.sh’:

    # this fixes the "scdynamicstore" warning
    export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

    Edit ‘etc/hadoop/core-site.xml’:

    <configuration>
    <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
    </property>
    </configuration>

    Edit ‘etc/hadoop/hdfs-site.xml’:

    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    </configuration>

    Edit ‘etc/hadoop/mapred-site.xml’:

    <configuration>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
    </configuration>

    Edit ‘etc/hadoop/yarn-site.xml’:

    <configuration>
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    </configuration>

    STEP 5 – START USING HADOOP (EXECUTE MAPREDUCE JOB)

    Move to the Hadoop root directory

    cd /usr/local/Cellar/hadoop/2.5.1

    Format the Hadoop HDFS filesystem

    ./bin/hdfs namenode -format

    Start the NameNode daemon & DataNode daemon

    ./sbin/start-dfs.sh

    Browse the web interface for the NameNode

    http://localhost:50070/

    Start ResourceManager daemon and NodeManager daemon:

    ./sbin/start-yarn.sh

    Check the daemons are all running:

    jps

    Browse the web interface for the ResourceManager

    http://localhost:8088/

    Create the HDFS directories required to execute MapReduce jobs:

    ./bin/hdfs dfs -mkdir -p /user/{username}

    Run an example MapReduce job (calculate pi)

    # calculate pi
    ./bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar pi 10 100

    Try more examples / experiments, or stop the daemons

    ./sbin/stop-dfs.sh
    ./sbin/stop-yarn.sh

    STEP 6 – START USING SPARK

    Spark has already earned a huge fan base and community of users and contributors because it’s faster than MapReduce (in memory and on disk) and easier to program – hence I want to learn how to use it.

    Move to the Spark directory

    cd /usr/local/Cellar/apache-spark/1.1.0

    Run an example Spark application (calculate pi)

    ./bin/run-example SparkPi

    Let’s try working with the Spark (scala) shell which provides a simple way to learn the API and a powerful framework to analyse data interactively. A special interpreter-aware SparkContext is automatically created and assigned to a variable (‘sc’).

    Start the Spark shell (local or yarn mode)

    # use the spark shell (local with 1 thread)
    ./bin/spark-shell
    # or ... (local with 4 threads)
    ./bin/spark-shell --master local[4]
    # or ... (yarn)
    ./bin/spark-shell --master yarn
    # or ... (use the help flag for more options)
    ./bin/spark-shell --help
    Every spark context launches a web interface for monitoring

    http://localhost:4040/
    Try some basic scala programming in the shell (use ‘exit’ command to end the session)

    println("Hello, World!")

    val a = 5
    a + 3

    sc.parallelize(1 to 1000).count()

    exit

    Let’s try to execute an example Spark application on the Hadoop cluster using YARN. There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

    # pattern to launch an application in yarn-cluster mode
    ./bin/spark-submit --class <path.to.class> --master yarn-cluster [options] <app.jar> [options]

    # run example application (calculate pi)
    ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster libexec/lib/spark-examples-*.jar

    THE END!