Skip to content

Instantly share code, notes, and snippets.

@clakech
Last active June 2, 2021 12:38
Show Gist options
  • Save clakech/4a4568daba1ca108f03c to your computer and use it in GitHub Desktop.
Save clakech/4a4568daba1ca108f03c to your computer and use it in GitHub Desktop.

Revisions

  1. clakech revised this gist Sep 21, 2015. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -53,7 +53,7 @@ I just fork his repository and add the fat jar assembly of spark-cassandra-conne

    ```
    # clone the fork
    git clone git@github.com:clakech/docker-spark.git
    git clone https://github.com/clakech/docker-spark.git
    cd docker-spark
    # run a master spark node
  2. clakech revised this gist Aug 7, 2015. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -2,7 +2,7 @@

    Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer.

    `NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily [here](https://github.com/clakech/sparkassandra-dockerized)`
    NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily [here](https://github.com/clakech/sparkassandra-dockerized)

    ## Install docker and git
    * https://docs.docker.com/installation/
  3. clakech revised this gist Aug 7, 2015. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -2,7 +2,7 @@

    Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer.

    `NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily here: https://github.com/clakech/sparkassandra-dockerized`
    `NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily [here](https://github.com/clakech/sparkassandra-dockerized)`

    ## Install docker and git
    * https://docs.docker.com/installation/
  4. clakech revised this gist Aug 7, 2015. 1 changed file with 3 additions and 1 deletion.
    4 changes: 3 additions & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -2,6 +2,8 @@

    Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer.

    `NOTE: Before reading, you need to know this was my first attempt to create this kind of cluster, I created a github projet to setup a cluster more easily here: https://github.com/clakech/sparkassandra-dockerized`

    ## Install docker and git
    * https://docs.docker.com/installation/
    * https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
    @@ -91,4 +93,4 @@ scala>println(rdd.map(_.getInt("value")).sum)

    # THE END of the boring installation part, now eat and digest data to extract value!

    > PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned
    > PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned => another (better?) way to install a Spark + C* cluster is fescribed here: https://github.com/clakech/sparkassandra-dockerized
  5. clakech revised this gist Aug 6, 2015. 1 changed file with 0 additions and 2 deletions.
    2 changes: 0 additions & 2 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -92,5 +92,3 @@ scala>println(rdd.map(_.getInt("value")).sum)
    # THE END of the boring installation part, now eat and digest data to extract value!

    > PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned
    > PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception. I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
  6. clakech revised this gist Aug 6, 2015. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -31,7 +31,7 @@ cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'repli
    cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int);
    cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1);
    cqlsh>INSERT INTO test.kv(key, value) VALUES ('key2', 2);
    cqlsh> select * from test.kv;
    cqlsh>select * from test.kv;
    key | value
    ------+-------
  7. clakech revised this gist Aug 6, 2015. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -91,4 +91,6 @@ scala>println(rdd.map(_.getInt("value")).sum)

    # THE END of the boring installation part, now eat and digest data to extract value!

    > PS: This is not a recommanded architecture to use Spark & Cassandra because we should have each Spark workers/slaves on a Cassandra node in order to have a very reactive behavior when spark interact with C*. #notProductionReady #youHaveBeenWarned
    > PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception. I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
  8. clakech revised this gist Aug 6, 2015. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,6 @@
    # How to setup a cluster with Spark + Cassandra using Docker ?

    Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try
    Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try machine learning without the pain to install C* and Spark on your computer.

    ## Install docker and git
    * https://docs.docker.com/installation/
  9. clakech revised this gist Aug 6, 2015. 1 changed file with 7 additions and 6 deletions.
    13 changes: 7 additions & 6 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -1,10 +1,12 @@
    # How to setup a cluster with Spark + Cassandra using Docker ?

    Spark is hype, Cassandra is cool and docker is awesome. Let's have some "fun" with all of this to be able to try

    ## Install docker and git
    * https://docs.docker.com/installation/
    * https://git-scm.com/book/en/v2/Getting-Started-Installing-Git

    ## Run a Cassandra cluster
    ## Run a Cassandra 2.1 cluster
    Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/

    ```
    @@ -41,9 +43,9 @@ cqlsh> select * from test.kv;

    Here you have a running and functionnal C* cluster! #nice

    ## Run a Spark cluster
    ## Run a Spark 1.3 cluster

    Thanks to [epahomov](https://github.com/epahomov/docker-spark), running a Spark cluster with the [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) is blasting fast too: https://github.com/epahomov/docker-spark
    Thanks to [epahomov](https://github.com/epahomov/docker-spark), running a Spark cluster with the [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) 1.3.0-RC1 is blasting fast too: https://github.com/epahomov/docker-spark

    I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image: https://github.com/clakech/docker-spark

    @@ -87,7 +89,6 @@ scala>println(rdd.map(_.getInt("value")).sum)
    10.0
    ```

    THE END of the boring installation part, now eat and digest data to extract value!
    # THE END of the boring installation part, now eat and digest data to extract value!

    PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception.
    I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
    > PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception. I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
  10. clakech revised this gist Aug 6, 2015. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -90,4 +90,4 @@ scala>println(rdd.map(_.getInt("value")).sum)
    THE END of the boring installation part, now eat and digest data to extract value!

    PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception.
    I "fix" this problem in a really bad way, remove this line of code and deleting tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
    I "fix" this problem in a really bad way, remove this line of code and delete tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
  11. clakech revised this gist Aug 6, 2015. 1 changed file with 6 additions and 6 deletions.
    12 changes: 6 additions & 6 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -43,13 +43,12 @@ Here you have a running and functionnal C* cluster! #nice

    ## Run a Spark cluster

    Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark
    Thanks to [epahomov](https://github.com/epahomov/docker-spark), running a Spark cluster with the [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) is blasting fast too: https://github.com/epahomov/docker-spark

    I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image.

    Clone my fork of epahomov: https://github.com/clakech/docker-spark
    I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image: https://github.com/clakech/docker-spark

    ```
    # clone the fork
    git clone [email protected]:clakech/docker-spark.git
    cd docker-spark
    @@ -88,6 +87,7 @@ scala>println(rdd.map(_.getInt("value")).sum)
    10.0
    ```

    THE END of the boring installation part, now eat and digest data to extract value!


    PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts.
    PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts. This lead to a problem with spark-cassandra-connector at runtime because of one line of code into CassandraConnectorConf.scala => line 108 in tag v1.3.0-RC1 => InetAddress.getLocalHost.getHostAddress => Unkown Host Exception.
    I "fix" this problem in a really bad way, remove this line of code and deleting tests to create sbt assembly easily... https://github.com/clakech/spark-cassandra-connector/tree/fearFromThisBranch
  12. clakech revised this gist Aug 6, 2015. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # How to setup a Cluster with Spark + Cassandra with Docker ?
    # How to setup a cluster with Spark + Cassandra using Docker ?

    ## Install docker and git
    * https://docs.docker.com/installation/
    @@ -25,7 +25,6 @@ docker run -it --link some-cassandra:cassandra --rm cassandra:2.1 cqlsh cassandr

    And now, create some data and retrieve them:
    ```
    # then create some data and retrieve it
    cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
    cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int);
    cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1);
    @@ -45,6 +44,7 @@ Here you have a running and functionnal C* cluster! #nice
    ## Run a Spark cluster

    Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark

    I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image.

    Clone my fork of epahomov: https://github.com/clakech/docker-spark
  13. clakech revised this gist Aug 6, 2015. 1 changed file with 44 additions and 3 deletions.
    47 changes: 44 additions & 3 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -1,7 +1,8 @@
    # How to setup a Cluster with Spark + Cassandra with Docker ?

    ## Install docker
    https://docs.docker.com/installation/
    ## Install docker and git
    * https://docs.docker.com/installation/
    * https://git-scm.com/book/en/v2/Getting-Started-Installing-Git

    ## Run a Cassandra cluster
    Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/
    @@ -44,9 +45,49 @@ Here you have a running and functionnal C* cluster! #nice
    ## Run a Spark cluster

    Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark

    I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image.

    Clone my fork of epahomov: https://github.com/clakech/docker-spark

    ```
    git clone [email protected]:clakech/docker-spark.git
    cd docker-spark
    # run a master spark node
    ./start-master.sh
    # run some workers spark nodes (1 is enought)
    ./start-worker.sh
    # run a spark shell console to test your cluster
    ./spark-shell.sh
    # check you can retrive your Cassandra data using Spark
    scala>import com.datastax.spark.connector._
    ...
    scala>val rdd = sc.cassandraTable("test", "kv")
    rdd: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15
    scala>println(rdd.count)
    2
    scala>println(rdd.first)
    CassandraRow{key: key1, value: 1}
    scala>println(rdd.map(_.getInt("value")).sum)
    3.0
    scala>val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
    collection: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:24
    scala>collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
    ...
    scala>println(rdd.map(_.getInt("value")).sum)
    10.0
    ```



    PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts.
  14. clakech created this gist Aug 6, 2015.
    52 changes: 52 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,52 @@
    # How to setup a Cluster with Spark + Cassandra with Docker ?

    ## Install docker
    https://docs.docker.com/installation/

    ## Run a Cassandra cluster
    Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/

    ```
    # run your first cassandra node
    docker run --name some-cassandra -d cassandra:2.1
    # (optionnal) run some other nodes if you wish
    docker run --name some-cassandra2 -d -e CASSANDRA_SEEDS="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-cassandra)" cassandra:2.1
    ```

    Here you have a Cassandra cluster running without installing anything but docker.

    To test your cluster, you can run a cqlsh console:
    ```
    # run a cqlsh console to test your cluster
    docker run -it --link some-cassandra:cassandra --rm cassandra:2.1 cqlsh cassandra
    ```

    And now, create some data and retrieve them:
    ```
    # then create some data and retrieve it
    cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
    cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int);
    cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1);
    cqlsh>INSERT INTO test.kv(key, value) VALUES ('key2', 2);
    cqlsh> select * from test.kv;
    key | value
    ------+-------
    key1 | 1
    key2 | 2
    (2 rows)
    ```

    Here you have a running and functionnal C* cluster! #nice

    ## Run a Spark cluster

    Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark

    I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image.



    PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts.