Thanks to this official docker image of C*, running a Cassandra cluster is really straighforward: https://registry.hub.docker.com/_/cassandra/
# run your first cassandra node
docker run --name some-cassandra -d cassandra:2.1
# (optionnal) run some other nodes if you wish
docker run --name some-cassandra2 -d -e CASSANDRA_SEEDS="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-cassandra)" cassandra:2.1
Here you have a Cassandra cluster running without installing anything but docker.
To test your cluster, you can run a cqlsh console:
# run a cqlsh console to test your cluster
docker run -it --link some-cassandra:cassandra --rm cassandra:2.1 cqlsh cassandra
And now, create some data and retrieve them:
# then create some data and retrieve it
cqlsh>CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
cqlsh>CREATE TABLE test.kv(key text PRIMARY KEY, value int);
cqlsh>INSERT INTO test.kv(key, value) VALUES ('key1', 1);
cqlsh>INSERT INTO test.kv(key, value) VALUES ('key2', 2);
cqlsh> select * from test.kv;
key | value
------+-------
key1 | 1
key2 | 2
(2 rows)
Here you have a running and functionnal C* cluster! #nice
Thanks to epahomov, running a Spark cluster with the spark-cassandra-connector is blasting fast too: https://github.com/epahomov/docker-spark I just fork his repository and add the fat jar assembly of spark-cassandra-connector into the image.
Clone my fork of epahomov: https://github.com/clakech/docker-spark
git clone [email protected]:clakech/docker-spark.git
cd docker-spark
# run a master spark node
./start-master.sh
# run some workers spark nodes (1 is enought)
./start-worker.sh
# run a spark shell console to test your cluster
./spark-shell.sh
# check you can retrive your Cassandra data using Spark
scala>import com.datastax.spark.connector._
...
scala>val rdd = sc.cassandraTable("test", "kv")
rdd: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15
scala>println(rdd.count)
2
scala>println(rdd.first)
CassandraRow{key: key1, value: 1}
scala>println(rdd.map(_.getInt("value")).sum)
3.0
scala>val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
collection: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala>collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
...
scala>println(rdd.map(_.getInt("value")).sum)
10.0
PS: Because of strangeness with docker hosts management, epahomov did a workaround that remove container internal name resolution from /etc/hosts.