arduanov · April 20, 2017 15:53 · Jan 5, 2016 · Jan 5, 2016 · Jan 5, 2016
diff --git a/hadoop_spark_osx b/hadoop_spark_osx
@@ -1,3 +1,5 @@
+Source: http://datahugger.org/datascience/setting-up-hadoop-v2-with-spark-v1-on-osx-using-homebrew/
+
 This post builds on the previous setup Hadoop (v1) guide, to explain how to setup a single node Hadoop (v2) cluster with Spark (v1) on OSX (10.9.5).
 
 Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Apache Hadoop framework is composed of the following core modules:

diff --git a/hadoop_spark_osx b/hadoop_spark_osx
@@ -177,9 +177,11 @@ Start the Spark shell (local or yarn mode)
 ./bin/spark-shell --master yarn
 # or ... (use the help flag for more options)
 ./bin/spark-shell --help
+
 Every spark context launches a web interface for monitoring
 
 http://localhost:4040/
+
 Try some basic scala programming in the shell (use ‘exit’ command to end the session)
 
 println("Hello, World!")

diff --git a/hadoop_spark_osx b/hadoop_spark_osx
@@ -0,0 +1,202 @@
+This post builds on the previous setup Hadoop (v1) guide, to explain how to setup a single node Hadoop (v2) cluster with Spark (v1) on OSX (10.9.5).
+
+Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Apache Hadoop framework is composed of the following core modules:
+
+HDFS (Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
+YARN (Yet Another Resource Negotiator): a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
+MapReduce: a programming model for large scale data processing. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Although the Hadoop framework is implemented in Java, any programming language can be used with Hadoop Streaming to implement the “map” and “reduce” functions. Apache Pig and Spark expose higher level user interfaces like Pig Latin and a SQL variant respectively.
+Apache Spark is a fast and general processing engine compatible with Hadoop. It can run in Hadoop clusters through YARN or in a standalone mode, and it can process data in HDFS, HBase, Hive, and any Hadoop InputFormat. Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing, by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting. Spark has easy-to-use APIs (e.g. Scala or Python) for operating on large datasets in batch, interactive or streaming modes. Spark provides a unified engine, packaged with higher-level libraries providing support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.
+
+Instructions
+
+STEP 1 – PREPARE ENVIRONMENT
+
+First need to uninstall old versions of Hadoop
+
+brew cleanup hadoop
+Next update Homebrew formulae
+
+brew update
+brew upgrade
+brew cleanup
+
+Check versions in Homebrew formulae (as of 10/10/14)
+
+brew info hadoop = 2.5.1
+brew info apache-spark = 1.1.0
+brew info scala = 2.11.2
+brew info sbt = 0.13.6
+
+STEP 2 – INSTALL ENVIRONMENT
+
+Install Hadoop
+
+brew install hadoop
+
+Install Spark (and dependencies)
+
+brew install apache-spark scala sbt
+
+STEP 3 – CONFIGURE ENVIRONMENT
+
+Optionally, set the environment variables in your shell profile, by default they are set it the hadoop or yarn environment shell scripts. Edit your bash profile (‘nano ~/.bash_profile’), add the lines, save and then force the terminal to refresh (‘source ~/.bash_profile’).
+
+# set environment variables
+export JAVA_HOME=$(/usr/libexec/java_home)
+export HADOOP_HOME=/usr/local/Cellar/hadoop/2.5.1
+export HADOOP_CONF_DIR=$HADOOP_HOME/libexec/etc/hadoop
+export SCALA_HOME=/usr/local/Cellar/apache-spark/1.1.0
+
+# set path variables
+export PATH=$PATH:$HADOOP_HOME/bin:$SCALA_HOME/bin
+
+# set alias start & stop scripts
+alias hstart=$HADOOP_HOME/sbin/start-dfs.sh;$HADOOP_HOME/sbin/start-yarn.sh
+alias hstop=$HADOOP_HOME/sbin/stop-dfs.sh;$HADOOP_HOME/sbin/stop-yarn.sh
+Configure passphraseless SSH on localhost and check remote login is enabled (System Preferences >> Sharing)
+
+1. ssh-keygen -t rsa
+Press enter for each line 
+2. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
+3. chmod og-wx ~/.ssh/authorized_keys 
+
+STEP 4 – CONFIGURE HADOOP (FOR PSEUDO-DISTRIBUTED MODE)
+
+The following instructions are to configure Hadoop as a single-node in a pseudo-distributed mode with MapReduce job execution on YARN. Alternative configurations are: pseudo-distributed mode with local MapReduce job execution, or local / standalone mode, or fully-distributed mode.
+
+Move to the Hadoop libexec directory and edit the configuration files (e.g. ‘nano {filename}’)
+
+cd usr/local/Cellar/hadoop/2.5.1/libexec/
+
+Edit ‘etc/hadoop/hadoop-env.sh’:
+
+# this fixes the "scdynamicstore" warning
+export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
+
+Edit ‘etc/hadoop/core-site.xml’:
+
+<configuration>
+  <property>
+    <name>fs.defaultFS</name>
+    <value>hdfs://localhost:9000</value>
+  </property>
+</configuration>
+
+Edit ‘etc/hadoop/hdfs-site.xml’:
+
+<configuration>
+  <property>
+    <name>dfs.replication</name>
+    <value>1</value>
+  </property>
+</configuration>
+
+Edit ‘etc/hadoop/mapred-site.xml’:
+
+<configuration>
+  <property>
+    <name>mapreduce.framework.name</name>
+    <value>yarn</value>
+  </property>
+</configuration>
+
+Edit ‘etc/hadoop/yarn-site.xml’:
+
+<configuration>
+  <property>
+    <name>yarn.nodemanager.aux-services</name>
+    <value>mapreduce_shuffle</value>
+  </property>
+</configuration>
+
+STEP 5 – START USING HADOOP (EXECUTE MAPREDUCE JOB)
+
+Move to the Hadoop root directory
+
+cd /usr/local/Cellar/hadoop/2.5.1
+
+Format the Hadoop HDFS filesystem
+
+./bin/hdfs namenode -format
+
+Start the NameNode daemon & DataNode daemon
+
+./sbin/start-dfs.sh
+
+Browse the web interface for the NameNode
+
+http://localhost:50070/
+
+Start ResourceManager daemon and NodeManager daemon:
+
+./sbin/start-yarn.sh
+
+Check the daemons are all running:
+
+jps
+
+Browse the web interface for the ResourceManager
+
+http://localhost:8088/
+
+Create the HDFS directories required to execute MapReduce jobs:
+
+./bin/hdfs dfs -mkdir -p /user/{username}
+
+Run an example MapReduce job (calculate pi)
+
+# calculate pi
+./bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar pi 10 100
+
+Try more examples / experiments, or stop the daemons
+
+./sbin/stop-dfs.sh
+./sbin/stop-yarn.sh
+
+STEP 6 – START USING SPARK
+
+Spark has already earned a huge fan base and community of users and contributors because it’s faster than MapReduce (in memory and on disk) and easier to program – hence I want to learn how to use it.
+
+Move to the Spark directory
+
+cd /usr/local/Cellar/apache-spark/1.1.0
+
+Run an example Spark application (calculate pi)
+
+./bin/run-example SparkPi
+
+Let’s try working with the Spark (scala) shell which provides a simple way to learn the API and a powerful framework to analyse data interactively. A special interpreter-aware SparkContext is automatically created and assigned to a variable (‘sc’).
+
+Start the Spark shell (local or yarn mode)
+
+# use the spark shell (local with 1 thread)
+./bin/spark-shell
+# or ... (local with 4 threads)
+./bin/spark-shell --master local[4]
+# or ... (yarn)
+./bin/spark-shell --master yarn
+# or ... (use the help flag for more options)
+./bin/spark-shell --help
+Every spark context launches a web interface for monitoring
+
+http://localhost:4040/
+Try some basic scala programming in the shell (use ‘exit’ command to end the session)
+
+println("Hello, World!")
+
+val a = 5
+a + 3
+
+sc.parallelize(1 to 1000).count()
+
+exit
+
+Let’s try to execute an example Spark application on the Hadoop cluster using YARN. There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
+
+# pattern to launch an application in yarn-cluster mode
+./bin/spark-submit --class <path.to.class> --master yarn-cluster [options] <app.jar> [options]
+
+# run example application (calculate pi)
+./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster libexec/lib/spark-examples-*.jar
+
+THE END!
No results found