6/10/2020

Reading time:8 min

New in Cassandra 3.0: Materialized Views

by John Doe

Back to BlogBasic rules of data modeling in Cassandra involve manually denormalizing data into separate tables based on the queries that will be run against that table. Currently, the only way to query a column without specifying the partition key is to use secondary indexes, but they are not a substitute for the denormalization of data into new tables as they are not fit for high cardinality data. High cardinality secondary index queries often require responses from all of the nodes in the ring, which adds latency to each request. Instead, client-side denormalization and multiple independent tables are used, which means that the same code is rewritten for many different users. In 3.0, Cassandra will introduce a new feature called Materialized Views. Materialized views handle automated server-side denormalization, removing the need for client side handling of this denormalization and ensuring eventual consistency between the base and view data. This denormalization allows for very fast lookups of data in each view using the normal Cassandra read path.An ExampleAs an example of how materialized views can be used, suppose we want to track the high scores for players of several games. We have a number of queries that we would like to be able to answer:Given a game, who has the highest score, and what is it? Given a game and a day, who had the highest score, and what was it? Given a game and a month, who had the highest score, and what was it?Materialized views maintain a correspondence of one CQL row each in the base and the view, so we need to ensure that each CQL row which is required for the views will be reflected in the base table's primary keys. For the first query, we will need the game, the player, and their highest score. For the second, we will need the game, the player, their high score, as well the day, the month, and the year of that high score. For the final query, we need everything from the second except the day. The second query will be the most restrictive, so it determines the primary key we will use. A user can update their high score over the course of day, so we only need to track the highest score for a particular day. CREATE TABLE scores ( user TEXT, game TEXT, year INT, month INT, day INT, score INT, PRIMARY KEY (user, game, year, month, day) ) Next, we'll create the view which presents the all time high scores. To create the materialized view, we provide a simple select statement and the primary key to use for this view. Specifying the CLUSTERING ORDER BY allows us to reverse sort the high score so we can get the highest score by simply selecting the first item in the partition. CREATE MATERIALIZED VIEW alltimehigh AS SELECT user FROM scores WHERE game IS NOT NULL AND score IS NOT NULL AND user IS NOT NULL AND year IS NOT NULL AND month IS NOT NULL AND day IS NOT NULL PRIMARY KEY (game, score, user, year, month, day) WITH CLUSTERING ORDER BY (score desc) To query the daily high scores, we create a materialized view that groups the game title and date together so a single partition contains the values for that date. We do the same for the monthly high scores. CREATE MATERIALIZED VIEW dailyhigh AS SELECT user FROM scores WHERE game IS NOT NULL AND year IS NOT NULL AND month IS NOT NULL AND day IS NOT NULL AND score IS NOT NULL AND user IS NOT NULL PRIMARY KEY ((game, year, month, day), score, user) WITH CLUSTERING ORDER BY (score DESC) CREATE MATERIALIZED VIEW monthlyhigh AS SELECT user FROM scores WHERE game IS NOT NULL AND year IS NOT NULL AND month IS NOT NULL AND score IS NOT NULL AND user IS NOT NULL AND day IS NOT NULL PRIMARY KEY ((game, year, month), score, user, day) WITH CLUSTERING ORDER BY (score DESC) We prime our materialized views with some data. We just insert the data into the scores table, and Cassandra will populate the materialized views accordingly. INSERT INTO scores (user, game, year, month, day, score) VALUES ('pcmanus', 'Coup', 2015, 05, 01, 4000) INSERT INTO scores (user, game, year, month, day, score) VALUES ('jbellis', 'Coup', 2015, 05, 03, 1750) INSERT INTO scores (user, game, year, month, day, score) VALUES ('yukim', 'Coup', 2015, 05, 03, 2250) INSERT INTO scores (user, game, year, month, day, score) VALUES ('tjake', 'Coup', 2015, 05, 03, 500) INSERT INTO scores (user, game, year, month, day, score) VALUES ('jmckenzie', 'Coup', 2015, 06, 01, 2000) INSERT INTO scores (user, game, year, month, day, score) VALUES ('iamaleksey', 'Coup', 2015, 06, 01, 2500) INSERT INTO scores (user, game, year, month, day, score) VALUES ('tjake', 'Coup', 2015, 06, 02, 1000) INSERT INTO scores (user, game, year, month, day, score) VALUES ('pcmanus', 'Coup', 2015, 06, 02, 2000) We can now search for users who have scored the highest ever on our games: SELECT user, score FROM alltimehigh WHERE game = 'Coup' LIMIT 1 user | score -----------+------- pcmanus | 4000 And the daily high score: SELECT user, score FROM dailyhigh WHERE game = 'Coup' AND year = 2015 AND month = 06 AND day = 01 LIMIT 1 user | score -----------+------- iamaleksey | 2500 All of the entries have been copied into the all time high materialized view: SELECT user, score FROM alltimehigh WHERE game = 'Coup' user | score -----------+------- pcmanus | 4000 iamaleksey | 2500 yukim | 2250 jmckenzie | 2000 pcmanus | 2000 jbellis | 1750 tjake | 1000 tjake | 500 Because we have a CQL Row in the view for each CQL Row in the base, 'pcmanus' and 'tjake' appear multiple times in the high scores table, one for each date in the base table.We can also delete rows from the base table and the materialized view's records will be deleted. We'll delete the tjake rows from the scores table: DELETE FROM scores WHERE user = 'tjake' Now, looking at all of the top scores, we don't find the tjake entries anymore: SELECT user, score FROM alltimehigh WHERE game = 'Coup' user | score -----------+------- pcmanus | 4000 iamaleksey | 2500 yukim | 2250 jmckenzie | 2000 pcmanus | 2000 jbellis | 1750 When a deletion occurs, the materialized view will query all of the deleted values in the base table and generate tombstones for each of the materialized view rows, because the values that need to be tombstoned in the view are not included in the base table's tombstone. For the single base tombstone, two view tombstones were generated; one for (tjake, 1000) and one for (tjake, 500).How it worksThe base replica performs a local read of the data in order to create the correct update for the view. If the primary key of the view has been updated in the base table, a tombstone would need to be generated so that the old value is no longer present in the view. The batchlog is used to provide an equivalent eventual consistency to what is provided on the base table. Without the batchlog if view updates are not applied but the base updates are, the view and the base will be inconsistent with each other. Using the batchlog, however, does add significant overhead, especially since the batchlog must be written to twice.CreationWhen a materialized view is created against a table which has data already, a building process will be kicked off to populate the materialized view. As such, materialized views can be created on existing tables, but there will be a period during which queries against the materialized view may not return all results. This is similar in behavior to how secondary indexes currently work. When the build is complete, the system.built_materializedviews table on each node will be updated with the view's name.Altering the Base TableWhen a base view is altered, the materialized view is updated as well. If the materialized view has a SELECT * statement, any added columns will be included in the materialized view's columns. Any deleted columns which are part of the SELECT statement will be removed from the materialized view. If a column in the base table is altered, the same alteration will occur in the view table. If the base table is dropped, any associated views will also be dropped.When not to use Materialized Views Materialized views do not have the same write performance characteristics that normal table writes have The materialized view requires an additional read-before-write, as well as data consistency checks on each replica before creating the view updates. These additions overhead, and may change the latency of writes. If the rows are to be combined before placed in the view, materialized views will not work. Materialized views will create a CQL Row in the view for each CQL Row in the base Low cardinality data will create hotspots around the ring If the partition key of all of the data is the same, those nodes would become overloaded. In the alltimehigh materialized view above, if the only game that we stored high scores for was 'Coup', only the nodes which stored 'Coup' would have any data stored on them. Currently, only simple SELECT statements are supported, but a ticket has been filed to add support for more complex SELECT statements WHERE clauses, ORDER BY, and functions aren't available with materialized views If there will be a large number of partition tombstones, the performance may suffer; the materialized view must query for all of the current values and generate a tombstone for each of them. The materialized view will have one tombstone per CQL row deleted in the base table Materialized views are not supported through Thrift

Read this article if you want to know more about New in Cassandra 3.0: Materialized Views

Basic rules of data modeling in Cassandra involve manually denormalizing data into separate tables based on the queries that will be run against that table. Currently, the only way to query a column without specifying the partition key is to use secondary indexes, but they are not a substitute for the denormalization of data into new tables as they are not fit for high cardinality data. High cardinality secondary index queries often require responses from all of the nodes in the ring, which adds latency to each request. Instead, client-side denormalization and multiple independent tables are used, which means that the same code is rewritten for many different users. In 3.0, Cassandra will introduce a new feature called Materialized Views. Materialized views handle automated server-side denormalization, removing the need for client side handling of this denormalization and ensuring eventual consistency between the base and view data. This denormalization allows for very fast lookups of data in each view using the normal Cassandra read path.

An Example

As an example of how materialized views can be used, suppose we want to track the high scores for players of several games. We have a number of queries that we would like to be able to answer:

Given a game, who has the highest score, and what is it?
Given a game and a day, who had the highest score, and what was it?
Given a game and a month, who had the highest score, and what was it?

Materialized views maintain a correspondence of one CQL row each in the base and the view, so we need to ensure that each CQL row which is required for the views will be reflected in the base table's primary keys. For the first query, we will need the game, the player, and their highest score. For the second, we will need the game, the player, their high score, as well the day, the month, and the year of that high score. For the final query, we need everything from the second except the day. The second query will be the most restrictive, so it determines the primary key we will use. A user can update their high score over the course of day, so we only need to track the highest score for a particular day.

CREATE TABLE scores

(

user TEXT,

game TEXT,

year INT,

month INT,

day INT,

score INT,

PRIMARY KEY (user, game, year, month, day)

)

Next, we'll create the view which presents the all time high scores. To create the materialized view, we provide a simple select statement and the primary key to use for this view. Specifying the CLUSTERING ORDER BY allows us to reverse sort the high score so we can get the highest score by simply selecting the first item in the partition.

CREATE MATERIALIZED VIEW alltimehigh AS

SELECT user FROM scores

WHERE game IS NOT NULL AND score IS NOT NULL AND user IS NOT NULL AND year IS NOT NULL AND month IS NOT NULL AND day IS NOT NULL

PRIMARY KEY (game, score, user, year, month, day)

WITH CLUSTERING ORDER BY (score desc)

To query the daily high scores, we create a materialized view that groups the game title and date together so a single partition contains the values for that date. We do the same for the monthly high scores.

CREATE MATERIALIZED VIEW dailyhigh AS

SELECT user FROM scores

WHERE game IS NOT NULL AND year IS NOT NULL AND month IS NOT NULL AND day IS NOT NULL AND score IS NOT NULL AND user IS NOT NULL

PRIMARY KEY ((game, year, month, day), score, user)

WITH CLUSTERING ORDER BY (score DESC)

CREATE MATERIALIZED VIEW monthlyhigh AS

SELECT user FROM scores

WHERE game IS NOT NULL AND year IS NOT NULL AND month IS NOT NULL AND score IS NOT NULL AND user IS NOT NULL AND day IS NOT NULL

PRIMARY KEY ((game, year, month), score, user, day)

WITH CLUSTERING ORDER BY (score DESC)

We prime our materialized views with some data. We just insert the data into the scores table, and Cassandra will populate the materialized views accordingly.

INSERT INTO scores (user, game, year, month, day, score) VALUES ('pcmanus', 'Coup', 2015, 05, 01, 4000)

INSERT INTO scores (user, game, year, month, day, score) VALUES ('jbellis', 'Coup', 2015, 05, 03, 1750)

INSERT INTO scores (user, game, year, month, day, score) VALUES ('yukim', 'Coup', 2015, 05, 03, 2250)

INSERT INTO scores (user, game, year, month, day, score) VALUES ('tjake', 'Coup', 2015, 05, 03, 500)

INSERT INTO scores (user, game, year, month, day, score) VALUES ('jmckenzie', 'Coup', 2015, 06, 01, 2000)

INSERT INTO scores (user, game, year, month, day, score) VALUES ('iamaleksey', 'Coup', 2015, 06, 01, 2500)

INSERT INTO scores (user, game, year, month, day, score) VALUES ('tjake', 'Coup', 2015, 06, 02, 1000)

INSERT INTO scores (user, game, year, month, day, score) VALUES ('pcmanus', 'Coup', 2015, 06, 02, 2000)

We can now search for users who have scored the highest ever on our games:

SELECT user, score FROM alltimehigh WHERE game = 'Coup' LIMIT 1

user | score

-----------+-------

pcmanus | 4000

And the daily high score:

SELECT user, score FROM dailyhigh WHERE game = 'Coup' AND year = 2015 AND month = 06 AND day = 01 LIMIT 1

user | score

-----------+-------

iamaleksey | 2500

All of the entries have been copied into the all time high materialized view:

SELECT user, score FROM alltimehigh WHERE game = 'Coup'

user | score

-----------+-------

pcmanus | 4000

iamaleksey | 2500

yukim | 2250

jmckenzie | 2000

pcmanus | 2000

jbellis | 1750

tjake | 1000

tjake | 500

Because we have a CQL Row in the view for each CQL Row in the base, 'pcmanus' and 'tjake' appear multiple times in the high scores table, one for each date in the base table.

We can also delete rows from the base table and the materialized view's records will be deleted. We'll delete the tjake rows from the scores table:

DELETE FROM scores WHERE user = 'tjake'

Now, looking at all of the top scores, we don't find the tjake entries anymore:

SELECT user, score FROM alltimehigh WHERE game = 'Coup'

user | score

-----------+-------

pcmanus | 4000

iamaleksey | 2500

yukim | 2250

jmckenzie | 2000

pcmanus | 2000

jbellis | 1750

When a deletion occurs, the materialized view will query all of the deleted values in the base table and generate tombstones for each of the materialized view rows, because the values that need to be tombstoned in the view are not included in the base table's tombstone. For the single base tombstone, two view tombstones were generated; one for (tjake, 1000) and one for (tjake, 500).

How it works

The base replica performs a local read of the data in order to create the correct update for the view. If the primary key of the view has been updated in the base table, a tombstone would need to be generated so that the old value is no longer present in the view. The batchlog is used to provide an equivalent eventual consistency to what is provided on the base table. Without the batchlog if view updates are not applied but the base updates are, the view and the base will be inconsistent with each other. Using the batchlog, however, does add significant overhead, especially since the batchlog must be written to twice.

Creation

When a materialized view is created against a table which has data already, a building process will be kicked off to populate the materialized view. As such, materialized views can be created on existing tables, but there will be a period during which queries against the materialized view may not return all results. This is similar in behavior to how secondary indexes currently work. When the build is complete, the system.built_materializedviews table on each node will be updated with the view's name.

Altering the Base Table

When a base view is altered, the materialized view is updated as well. If the materialized view has a SELECT * statement, any added columns will be included in the materialized view's columns. Any deleted columns which are part of the SELECT statement will be removed from the materialized view. If a column in the base table is altered, the same alteration will occur in the view table. If the base table is dropped, any associated views will also be dropped.

When not to use Materialized Views

Materialized views do not have the same write performance characteristics that normal table writes have

The materialized view requires an additional read-before-write, as well as data consistency checks on each replica before creating the view updates. These additions overhead, and may change the latency of writes.
If the rows are to be combined before placed in the view, materialized views will not work. Materialized views will create a CQL Row in the view for each CQL Row in the base
Low cardinality data will create hotspots around the ring

If the partition key of all of the data is the same, those nodes would become overloaded. In the alltimehigh materialized view above, if the only game that we stored high scores for was 'Coup', only the nodes which stored 'Coup' would have any data stored on them.
Currently, only simple SELECT statements are supported, but a ticket has been filed to add support for more complex SELECT statements

WHERE clauses, ORDER BY, and functions aren't available with materialized views
If there will be a large number of partition tombstones, the performance may suffer; the materialized view must query for all of the current values and generate a tombstone for each of them. The materialized view will have one tombstone per CQL row deleted in the base table
Materialized views are not supported through Thrift

cassandra

cassandra.mv

Cassandra 3.0 Materialized Views Technical Deep Dive – DOAN DuyHai's blog

doanduyhai

5/15/2023

cassandra

cassandra.mv

Creating a materialized view

John Doe

6/10/2020

cluster

troubleshooting

datastax

GitHub - arodrime/Montecristo: Datastax Cluster Health Check Tooling

arodrime

4/3/2024

cassandra

langchain

llamaindex

GitHub - michelderu/chat-with-your-data-in-cassandra: Chat with your data stored in DataStax Enterprise, Astra DB and Apache Cassandra - In Natural Language!

John Doe

3/26/2024

spring

angular

rest

GitHub - jhipster/jhipster-sample-app-cassandra: This is a sample application created with JHipster, with the Cassandra option

jhipster

3/7/2024

api.management

node

api

LoopBack

John Doe

3/7/2024

api.management

cassandra

api

GitHub - dreamfactorysoftware/df-cassandra: The DreamFactory Cassandra service

dreamfactorysoftware

3/7/2024

api.management

snowflake

oracle

GitHub - dreamfactorysoftware/dreamfactory: DreamFactory API Management Platform

John Doe

3/7/2024

mongo

rest

cassandra

GitHub - stargate/stargate: An open source data gateway

John Doe

3/7/2024

para

cassandra

api

an open-source, general-purpose, backend framework for the cloud.

John Doe

3/7/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra.mv

cassandra

cassandra.mv

Cassandra 3.0 Materialized Views Technical Deep Dive – DOAN DuyHai's blog

doanduyhai

5/15/2023

cassandra

cassandra.mv

Creating a materialized view

John Doe

6/10/2020

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

airscholar

12/2/2023

An Example

How it works

Creation

Altering the Base Table

When not to use Materialized Views

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us