Druid relies on a distributed filesystem or binary object store for data storage. The most commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if you already have a Hadoop deployment). In this post, I will show you how to configure Apache Cassandra deep storage for druid cluster.
Druid can use Cassandra as a deep storage mechanism. Segments and their metadata are stored in Cassandra in two tables: index_storage
and descriptor_storage
. Underneath the hood, the Cassandra integration leverages Astyanax. The index storage table is a Chunked Object repository. It contains compressed segments for distribution to historical nodes. Since segments can be large, the Chunked Object storage allows the integration to multi-thread the write to Cassandra, and spreads the data across all the nodes in a cluster. The descriptor storage table is a normal C* table that stores the segment metadata.
I’m assuming you already have Cassandra installed. If not installed yet, follow this post to install Apache Cassandra.
Schema
Open terminal and go to Cassandra installation directory and run:
./bin/cqlsh
This will open Cassandra command line interface. Now create a new keyspace named druid.
CREATE KEYSPACE IF NOT EXISTS druid WITH replication = {'class':'SimpleStrategy', 'replication_factor':1}; USE druid;
Now create the schema. Below are the create statements for each:
CREATE TABLE index_storage(key text, chunk text, value blob, PRIMARY KEY (key, chunk)) WITH COMPACT STORAGE; CREATE TABLE descriptor_storage(key varchar, lastModified timestamp, descriptor varchar, PRIMARY KEY (key)) WITH COMPACT STORAGE;
Extension
druid-cassandra-storage is a community extension and does not come up with the distribution by default. You need to either download the extension jar files from maven or build it from the source.
Here, we’re going to use pull-deps
to download the extension from maven. Go to dist/druid/
directory from terminal and run:
java \ -cp "lib/*" \ -Ddruid.extensions.directory="extensions" \ -Ddruid.extensions.hadoopDependenciesDir="hadoop-dependencies" \ io.druid.cli.Main tools pull-deps \ --no-default-hadoop \ -c "io.druid.extensions.contrib:druid-cassandra-storage:0.12.0"
This will download the druid-cassandra-storage
extension from maven into extensions
directory.
Now In conf/druid/_common/common.runtime.properties
, add “druid-cassandra-storage” to druid.extensions.loadList
. If for example the list already contains “druid-parser-route”, the final property should look like:
druid.extensions.loadList=["druid-parser-route", "druid-cassandra-storage"].
Comment out the configurations for local storage under “Deep Storage” section and add appropriate values for Cassandra. After this, “Deep Storage” section should look like:
# # Deep storage # # For local disk (only viable in a cluster if this is a network mount): #druid.storage.type=local #druid.storage.storageDirectory=var/druid/segments # For HDFS: #druid.storage.type=hdfs #druid.storage.storageDirectory=/druid/segments # For S3: #druid.storage.type=s3 #druid.storage.bucket=your-bucket #druid.storage.baseKey=druid/segments #druid.s3.accessKey=... #druid.s3.secretKey=... # For Cassandra druid.storage.type=c* druid.storage.host=localhost:9160 druid.storage.keyspace=druid
You’re done. Now restart the servers to take effect. To test if it is working, load the sample data in druid and see segments data in Cassandra schemas using cqlsh
.