clakech · June 2, 2021 12:38 · Sep 21, 2015 · Aug 7, 2015 · Aug 7, 2015 · Aug 7, 2015
diff --git a/README.md b/README.md
@@ -53,7 +53,7 @@ I just fork his repository and add the fat jar assembly of spark-cassandra-conne
 
 ```
 # clone the fork
-git clone git@github.com:clakech/docker-spark.git
+git clone https://github.com/clakech/docker-spark.git
 cd docker-spark
 
 # run a master spark node

diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer.
 
-`NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily [here](https://github.com/clakech/sparkassandra-dockerized)`
+NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily [here](https://github.com/clakech/sparkassandra-dockerized)
 
 ## Install docker and git
 * https://docs.docker.com/installation/

diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer.
 
-`NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily here: https://github.com/clakech/sparkassandra-dockerized`
+`NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily [here](https://github.com/clakech/sparkassandra-dockerized)`
 
 ## Install docker and git
 * https://docs.docker.com/installation/

diff --git a/README.md b/README.md
@@ -2,6 +2,8 @@
 
 Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer.
 
+`NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily here: https://github.com/clakech/sparkassandra-dockerized`
+
 ## Install docker and git
 * https://docs.docker.com/installation/
 * https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
@@ -91,4 +93,4 @@ scala>println(rdd.map(_.getInt("value")).sum)
 
 # THE END of the boring installation part, now eat and digest data to extract value!
 
-> PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned
+> PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned => another (better?) way to install a Spark + C* cluster is fescribed here: https://github.com/clakech/sparkassandra-dockerized
diff --git a/README.md b/README.md
@@ -92,5 +92,3 @@ scala>println(rdd.map(_.getInt("value")).sum)
 # THE END of the boring installation part, now eat and digest data to extract value!
 
 > PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned
-
-> PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception. I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
diff --git a/README.md b/README.md
@@ -31,7 +31,7 @@ cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'repli
 cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int);
 cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1);
 cqlsh>INSERT INTO test.kv(key, value) VALUES ('key2', 2);
-cqlsh> select * from test.kv;
+cqlsh>select * from test.kv;
 
  key  | value
 ------+-------

diff --git a/README.md b/README.md
@@ -91,4 +91,6 @@ scala>println(rdd.map(_.getInt("value")).sum)
 
 # THE END of the boring installation part, now eat and digest data to extract value!
 
+> PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned
+
 > PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception. I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # How to setup a cluster with Spark + Cassandra using Docker ?
 
-Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try 
+Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer.
 
 ## Install docker and git
 * https://docs.docker.com/installation/

diff --git a/README.md b/README.md
@@ -1,10 +1,12 @@
 # How to setup a cluster with Spark + Cassandra using Docker ?
 
+Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try 
+
 ## Install docker and git
 * https://docs.docker.com/installation/
 * https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
 
-## Run a Cassandra cluster
+## Run a Cassandra 2.1 cluster
 Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/
 
 ```
@@ -41,9 +43,9 @@ cqlsh> select * from test.kv;
 
 Here you have a running and functionnal C* cluster! #nice
 
-## Run a Spark cluster
+## Run a Spark 1.3 cluster
 
-Thanks to [epahomov](https://github.com/epahomov/docker-spark), running a Spark cluster with the [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) is blasting fast too: https://github.com/epahomov/docker-spark
+Thanks to [epahomov](https://github.com/epahomov/docker-spark), running a Spark cluster with the [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) 1.3.0-RC1 is blasting fast too: https://github.com/epahomov/docker-spark
 
 I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image: https://github.com/clakech/docker-spark
 
@@ -87,7 +89,6 @@ scala>println(rdd.map(_.getInt("value")).sum)
 10.0
 ```
 
-THE END of the boring installation part, now eat and digest data to extract value!
+# THE END of the boring installation part, now eat and digest data to extract value!
 
-PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception.
-I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
+> PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception. I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
diff --git a/README.md b/README.md
@@ -90,4 +90,4 @@ scala>println(rdd.map(_.getInt("value")).sum)
 THE END of the boring installation part, now eat and digest data to extract value!
 
 PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception.
-I "fix" this problem in a really bad way, remove this line of code and deleting tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
+I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
diff --git a/README.md b/README.md
@@ -43,13 +43,12 @@ Here you have a running and functionnal C* cluster! #nice
 
 ## Run a Spark cluster
 
-Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark
+Thanks to [epahomov](https://github.com/epahomov/docker-spark), running a Spark cluster with the [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) is blasting fast too: https://github.com/epahomov/docker-spark
 
-I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image.
-
-Clone my fork of epahomov: https://github.com/clakech/docker-spark
+I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image: https://github.com/clakech/docker-spark
 
 ```
+# clone the fork
 git clone [email protected]:clakech/docker-spark.git
 cd docker-spark
 
@@ -88,6 +87,7 @@ scala>println(rdd.map(_.getInt("value")).sum)
 10.0
 ```
 
+THE END of the boring installation part, now eat and digest data to extract value!
 
-
-PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts.
+PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception.
+I "fix" this problem in a really bad way, remove this line of code and deleting tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# How to setup a Cluster with Spark + Cassandra with Docker ?
+# How to setup a cluster with Spark + Cassandra using Docker ?
 
 ## Install docker and git
 * https://docs.docker.com/installation/
@@ -25,7 +25,6 @@ docker run -it --link some-cassandra:cassandra --rm cassandra:2.1 cqlsh cassandr
 
 And now, create some data and retrieve them:
 ```
-# then create some data and retrieve it
 cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
 cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int);
 cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1);
@@ -45,6 +44,7 @@ Here you have a running and functionnal C* cluster! #nice
 ## Run a Spark cluster
 
 Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark
+
 I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image.
 
 Clone my fork of epahomov: https://github.com/clakech/docker-spark

diff --git a/README.md b/README.md
@@ -1,7 +1,8 @@
 # How to setup a Cluster with Spark + Cassandra with Docker ?
 
-## Install docker
-https://docs.docker.com/installation/
+## Install docker and git
+* https://docs.docker.com/installation/
+* https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
 
 ## Run a Cassandra cluster
 Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/
@@ -44,9 +45,49 @@ Here you have a running and functionnal C* cluster! #nice
 ## Run a Spark cluster
 
 Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark
-
 I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image.
 
+Clone my fork of epahomov: https://github.com/clakech/docker-spark
+
+```
+git clone [email protected]:clakech/docker-spark.git
+cd docker-spark
+
+# run a master spark node
+./start-master.sh
+
+# run some workers spark nodes (1 is enought)
+./start-worker.sh
+
+# run a spark shell console to test your cluster
+./spark-shell.sh
+
+# check you can retrive your Cassandra data using Spark
+
+scala>import com.datastax.spark.connector._
+...
+scala>val rdd = sc.cassandraTable("test", "kv")
+rdd: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15
+
+scala>println(rdd.count)
+2
+
+scala>println(rdd.first)
+CassandraRow{key: key1, value: 1}
+
+scala>println(rdd.map(_.getInt("value")).sum)
+3.0
+
+scala>val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
+collection: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:24
+
+scala>collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))       
+...
+
+scala>println(rdd.map(_.getInt("value")).sum)
+10.0
+```
+
 
 
 PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts.
diff --git a/README.md b/README.md
@@ -0,0 +1,52 @@
+# How to setup a Cluster with Spark + Cassandra with Docker ?
+
+## Install docker
+https://docs.docker.com/installation/
+
+## Run a Cassandra cluster
+Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/
+
+```
+# run your first cassandra node
+docker run --name some-cassandra -d cassandra:2.1
+
+# (optionnal) run some other nodes if you wish
+docker run --name some-cassandra2 -d -e CASSANDRA_SEEDS="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-cassandra)" cassandra:2.1
+```
+
+Here you have a Cassandra cluster running without installing anything but docker.
+
+To test your cluster, you can run a cqlsh console:
+```
+# run a cqlsh console to test your cluster
+docker run -it --link some-cassandra:cassandra --rm cassandra:2.1 cqlsh cassandra
+```
+
+And now, create some data and retrieve them:
+```
+# then create some data and retrieve it
+cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
+cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int);
+cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1);
+cqlsh>INSERT INTO test.kv(key, value) VALUES ('key2', 2);
+cqlsh> select * from test.kv;
+
+ key  | value
+------+-------
+ key1 |     1
+ key2 |     2
+
+(2 rows)
+```
+
+Here you have a running and functionnal C* cluster! #nice
+
+## Run a Spark cluster
+
+Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark
+
+I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image.
+
+
+
+PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts.