Last active
June 2, 2021 12:38
-
-
Save clakech/4a4568daba1ca108f03c to your computer and use it in GitHub Desktop.
Revisions
-
clakech revised this gist
Sep 21, 2015 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -53,7 +53,7 @@ I just fork his repository and add the fat jar assembly of spark-cassandra-conne ``` # clone the fork git clone https://github.com/clakech/docker-spark.git cd docker-spark # run a master spark node -
clakech revised this gist
Aug 7, 2015 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -2,7 +2,7 @@ Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer. NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily [here](https://github.com/clakech/sparkassandra-dockerized) ## Install docker and git * https://docs.docker.com/installation/ -
clakech revised this gist
Aug 7, 2015 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -2,7 +2,7 @@ Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer. `NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily [here](https://github.com/clakech/sparkassandra-dockerized)` ## Install docker and git * https://docs.docker.com/installation/ -
clakech revised this gist
Aug 7, 2015 . 1 changed file with 3 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -2,6 +2,8 @@ Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer. `NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily here: https://github.com/clakech/sparkassandra-dockerized` ## Install docker and git * https://docs.docker.com/installation/ * https://git-scm.com/book/en/v2/Getting-Started-Installing-Git @@ -91,4 +93,4 @@ scala>println(rdd.map(_.getInt("value")).sum) # THE END of the boring installation part, now eat and digest data to extract value! > PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned => another (better?) way to install a Spark + C* cluster is fescribed here: https://github.com/clakech/sparkassandra-dockerized -
clakech revised this gist
Aug 6, 2015 . 1 changed file with 0 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -92,5 +92,3 @@ scala>println(rdd.map(_.getInt("value")).sum) # THE END of the boring installation part, now eat and digest data to extract value! > PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned -
clakech revised this gist
Aug 6, 2015 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -31,7 +31,7 @@ cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'repli cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int); cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1); cqlsh>INSERT INTO test.kv(key, value) VALUES ('key2', 2); cqlsh>select * from test.kv; key | value ------+------- -
clakech revised this gist
Aug 6, 2015 . 1 changed file with 2 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -91,4 +91,6 @@ scala>println(rdd.map(_.getInt("value")).sum) # THE END of the boring installation part, now eat and digest data to extract value! > PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned > PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception. I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch -
clakech revised this gist
Aug 6, 2015 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,6 +1,6 @@ # How to setup a cluster with Spark + Cassandra using Docker ? Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer. ## Install docker and git * https://docs.docker.com/installation/ -
clakech revised this gist
Aug 6, 2015 . 1 changed file with 7 additions and 6 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,10 +1,12 @@ # How to setup a cluster with Spark + Cassandra using Docker ? Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try ## Install docker and git * https://docs.docker.com/installation/ * https://git-scm.com/book/en/v2/Getting-Started-Installing-Git ## Run a Cassandra 2.1 cluster Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/ ``` @@ -41,9 +43,9 @@ cqlsh> select * from test.kv; Here you have a running and functionnal C* cluster! #nice ## Run a Spark 1.3 cluster Thanks to [epahomov](https://github.com/epahomov/docker-spark), running a Spark cluster with the [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) 1.3.0-RC1 is blasting fast too: https://github.com/epahomov/docker-spark I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image: https://github.com/clakech/docker-spark @@ -87,7 +89,6 @@ scala>println(rdd.map(_.getInt("value")).sum) 10.0 ``` # THE END of the boring installation part, now eat and digest data to extract value! > PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception. I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch -
clakech revised this gist
Aug 6, 2015 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -90,4 +90,4 @@ scala>println(rdd.map(_.getInt("value")).sum) THE END of the boring installation part, now eat and digest data to extract value! PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception. I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch -
clakech revised this gist
Aug 6, 2015 . 1 changed file with 6 additions and 6 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -43,13 +43,12 @@ Here you have a running and functionnal C* cluster! #nice ## Run a Spark cluster Thanks to [epahomov](https://github.com/epahomov/docker-spark), running a Spark cluster with the [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) is blasting fast too: https://github.com/epahomov/docker-spark I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image: https://github.com/clakech/docker-spark ``` # clone the fork git clone [email protected]:clakech/docker-spark.git cd docker-spark @@ -88,6 +87,7 @@ scala>println(rdd.map(_.getInt("value")).sum) 10.0 ``` THE END of the boring installation part, now eat and digest data to extract value! PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception. I "fix" this problem in a really bad way, remove this line of code and deleting tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch -
clakech revised this gist
Aug 6, 2015 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,4 @@ # How to setup a cluster with Spark + Cassandra using Docker ? ## Install docker and git * https://docs.docker.com/installation/ @@ -25,7 +25,6 @@ docker run -it --link some-cassandra:cassandra --rm cassandra:2.1 cqlsh cassandr And now, create some data and retrieve them: ``` cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int); cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1); @@ -45,6 +44,7 @@ Here you have a running and functionnal C* cluster! #nice ## Run a Spark cluster Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image. Clone my fork of epahomov: https://github.com/clakech/docker-spark -
clakech revised this gist
Aug 6, 2015 . 1 changed file with 44 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,7 +1,8 @@ # How to setup a Cluster with Spark + Cassandra with Docker ? ## Install docker and git * https://docs.docker.com/installation/ * https://git-scm.com/book/en/v2/Getting-Started-Installing-Git ## Run a Cassandra cluster Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/ @@ -44,9 +45,49 @@ Here you have a running and functionnal C* cluster! #nice ## Run a Spark cluster Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image. Clone my fork of epahomov: https://github.com/clakech/docker-spark ``` git clone [email protected]:clakech/docker-spark.git cd docker-spark # run a master spark node ./start-master.sh # run some workers spark nodes (1 is enought) ./start-worker.sh # run a spark shell console to test your cluster ./spark-shell.sh # check you can retrive your Cassandra data using Spark scala>import com.datastax.spark.connector._ ... scala>val rdd = sc.cassandraTable("test", "kv") rdd: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15 scala>println(rdd.count) 2 scala>println(rdd.first) CassandraRow{key: key1, value: 1} scala>println(rdd.map(_.getInt("value")).sum) 3.0 scala>val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4))) collection: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:24 scala>collection.saveToCassandra("test", "kv", SomeColumns("key", "value")) ... scala>println(rdd.map(_.getInt("value")).sum) 10.0 ``` PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. -
clakech created this gist
Aug 6, 2015 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,52 @@ # How to setup a Cluster with Spark + Cassandra with Docker ? ## Install docker https://docs.docker.com/installation/ ## Run a Cassandra cluster Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/ ``` # run your first cassandra node docker run --name some-cassandra -d cassandra:2.1 # (optionnal) run some other nodes if you wish docker run --name some-cassandra2 -d -e CASSANDRA_SEEDS="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-cassandra)" cassandra:2.1 ``` Here you have a Cassandra cluster running without installing anything but docker. To test your cluster, you can run a cqlsh console: ``` # run a cqlsh console to test your cluster docker run -it --link some-cassandra:cassandra --rm cassandra:2.1 cqlsh cassandra ``` And now, create some data and retrieve them: ``` # then create some data and retrieve it cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int); cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1); cqlsh>INSERT INTO test.kv(key, value) VALUES ('key2', 2); cqlsh> select * from test.kv; key | value ------+------- key1 | 1 key2 | 2 (2 rows) ``` Here you have a running and functionnal C* cluster! #nice ## Run a Spark cluster Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image. PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts.