Menu
CraftCode Crew
  • Home
  • Who we are
  • What we’d love to share
  • Photos
CraftCode Crew

A deep dive into Apache Cassandra – Part 1: Data Structure

Posted on October 1, 2018October 9, 2018
Hey guys,

during my studies I had to analyze the NoSQL database Cassandra as a possible replacement for a regular relational database.
During my research I dove really deep into the architecture and the data model of Cassandra and I figured that someone may profit from my previous research, maybe for your own evaluation process of Cassandra or just personal curiosity.
I will separate this huge topic into several posts and make a little series out of it. I don’t know how many parts the series will contain yet, but I will try to keep every post as cohesive and understandable as possible. Please forgive me, as I have to introduce at least a couple of terms or concepts I won’t be able to describe thoroughly in this post. But don’t worry, I will be covering them in an upcoming one.

What is Cassandra?

Cassandra is a column-oriented open source NoSQL database whose data model is based on Big Table by Google and its distributed architecture on Dynamo by Amazon. It was originally developed by Facebook, later Cassandra became an Apache project and is now one of the top-level projects at Apache. Cassandra is based on the idea of a decentralized, distributed system without a single point of failure and is designed for high data throughput and high availability.

Cassandras Data Structure

I decided to begin my series with Cassandras data structure because it is a good introduction to the general ideas behind Cassandra and a good foundation for future posts regarding the Cassandra Query Language and the distributed nature of it.

I try to give you an overview how data is stored in Cassandra and show you some similarities and differences to a relational database, so let’s get right to it.

Columns, Rows and Tables

The basic component in Cassandras data structure is the column, which consists classically of a key/value pair. Individual columns are combined in a row and uniquely identified by a primary key. It consists of one or more columns and the primary key, which can also consist of one or more columns. To connect individual rows describing the same entity in a logical unit, Cassandra defines tables, which are a container for similar data in row format, equivalent to relations in relational databases.

the row data structure in Cassandra
However, there is a remarkable difference to the tables in relational databases. If individual columns of a row are not used when writing to the database, Cassandra does not replace the value with zero, but the entire column is not stored. This represents a storage space optimization, so the data model of tables has similarities to a multidimensional array or a nested map.

table consisting of skinny rows

Skinny and Wide Rows

Another special feature of the tables in Cassandra is the distinction between skinny and wide rows. I only described skinny rows so far, i.e. they do not have a complex primary key with clustering columns and few entries in the individual partitions, in most cases only one entry per partition.

You can imagine a partition as an isolated storage unit within Cassandra. There are typically several hundred of said partitions in a Cassandra installation. During a write or read operation the value of the primary key gets hashed. The resulting value of the hash algorithm can be assigned to a specific partition inside the Cassandra installation, as every partition is responsible for a certain range of hash values. I will dedicate a whole blog post to the underlying storage engine of Cassandra, so this little explanation has to suffice for now.

Wide rows typically have a significant number of entries per partition. These wide rows are identified by a composite key, consisting of a partition key and optional clustering keys.

table consisting of wide rows

When using wide rows you have to pay attention to the defined limit of two billion entries in a partition, which can happen quite fast when storing measured values of a sensor, because after reaching the limit no more values can be stored in this partition.
The partition key can consist of one or more columns, just like the primary key. Therefore, in order to stay with the example of the sensor data, it makes sense to select the partition key according to several criteria. Instead of simply partitioning according to for example a sensor_id, which depending on the number of incoming measurement data would sooner or later inevitably exceed the limit of 2 billion entries per partition, you can combine the partition key with the date of the measurement. If you combine the sensor_id with the date of the measurement the data is written to another partition on a daily basis. Of course you can make this coarser or grainer as you wish (hourly, daily, weekly, monthly).

The clustering columns are needed to sort data within a partition. Primary keys are also partition keys without additional clustering columns. Several tables are collected in to a keypsace, which is the exact equivalent of a database in relational databases.

Summary

The basic data structures are summarized,

  • the column, consisting of key/value pairs,
  • the row, which is a container for contiguous columns, identified by a primary key,
  • the table, which is a container for rows and
  • the keyspace, which is a container for tables.
I hope I was able to give you a rough overview of the data structure Cassandra uses. The next post in this series will be about the Cassandra Query Language (CQL), in which I will give you some more concrete examples how the data structure affects the data manipulation. Cheers, Leon

2 thoughts on “A deep dive into Apache Cassandra – Part 1: Data Structure”

  1. kwai app hack says:
    October 10, 2018 at 2:12 pm

    This does interest me

    Reply
  2. Harold Oliveres says:
    November 26, 2018 at 7:14 pm

    Magnificent website. A lot of helpful information here. I¡¦m sending it to several pals ans additionally sharing in delicious. And naturally, thank you on your sweat!

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search for content

Recent Posts

  • Sophisticated Google container structure tests
  • Startup: To join or not to join?
  • Clean Code: The hard facts and figures
  • The Stuttgart Hackathon Diary
  • 5 Steps for Working with an Overachiever in Agile Teams

Get to know us

Who we are
What we'd love to share
Our Instagram
Our Twitter
Our Github
©2019 CraftCode Crew | Powered by SuperbThemes & WordPress
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

Necessary Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.