Setting The Stage For The Next Chapter Of The Cassandra Database

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Ben Bromhead about the recent release of Cassandra and its version 4 and how it fits in the current landscape of data tools. So, Ben, can you start by introducing yourself? Thanks, Tobias. Yeah. My name is Ben Bromhead. I am the cofounder and CTO of Instaclustr.

Instaclustr, we essentially do manage the Apache Cassandra as a service across all the major cloud providers, plus some other really cool data technologies like Kafka and Redis and Elasticsearch.

We've been in the in the Cassandra community for quite some time now, kind of going on 8 or 9 years. So, yeah, super excited to talk about what's happening in Apache Cassandra today. And do you remember how you first got involved in data management?

Yes. I do, actually. A bit of a

convoluted path. My career in technology actually started off

in IT security. So I was

working at an IT security consulting company on the assurance side and the penetration testing side of things.

On the assurance side of things, we spend a lot of time working with different technologies

used in large organizations, in particular government, you know, used to doing things like, you know, protecting Internet gateways and websites and other things and providing assurance around the way that they're built and configured.

And quite often, they ended up having, you know, different backing data stores. So I got a lot of experience working with things like, you know, Oracle and MySQL to set up public key infrastructure products,

but started to notice a bit of a turn when

we started to see this next generation of, like, you know, gateway monitoring tools and capabilities coming out and different projects and different experimentation,

and a lot of activity in the open source space where, you know, you'd be choosing these backing data stores and

worked with a company who started leveraging

an intelligence product based on Apache Cassandra. And that's kind of the first time I heard about it, but it wasn't until a little bit later on when kind of got sick of breaking things on the security side of things, and I wanted to go build things

that, you know, 1 of the things that we started building was essentially a data marketplace. You know, I think a lot of engineers have gone out and tried to build something like that. And, you know, we gave it a go too. And 1 of the requirements we found that we had was the ability to kind of store and replicate data across the different, you know, regions and geographies.

Apache Cassandra was an awesome fit for that. Right? So that was the first time I actually used it in anger

and, you know, got got a chance to play with it. Unfortunately, at that time, there's a lot of work you had to do to get it set up. Right? It's a lot different from managing just a single MySQL

database. Right? So we spend a lot of time on, you know, the tooling and capability to go and deploy this, to go and run this. And for a start up, that's just an absolute, you know, time sync. Right? You wanna be working on whatever it is you're doing. Right? Not fighting your database.

But we fought the database, and we we finally got there. And at 1 point, we decided, hey. You know, maybe some other people would be interested in some of the stuff that we've done. So we kind of threw together a bit of a web interface and credit card billing. We turned it into a managed service. You know, 3 months later, we had, like, 10 people in production.

Oh, we should do a good job of this. And that was kind of how Instaclustr got started, but that was really my first

with Apache Cassandra. I was using it to solve

this particular edge and then realizing actually, hey. It's a really good fit for other people and seeing how other people had challenges in running it and making it easier

to access Apache Cassandra.

It's funny the number of times I've had a conversation with somebody of, oh, how did you get involved in this? And, oh, I was actually trying to solve a completely different problem, which I still didn't actually get to.

But on the way, I found this other thing that's really fun. It's so true. And, I mean, you know, when I recount that story, it seems very clear. But, you know, I'll definitely say there was, like, a good year where we're kinda working on both things at the same time and wasn't too sure which 1, you know, would kinda take off or go in 1 direction.

But, you know, with Instaclustr

and providing that Apache Cassandra capability, like, people just kinda kept on coming, and it just kinda kept adding to it. And, you know, it kinda we got a year end, and we're like, hey. This is it. We gotta go. And, you know, it's clusters. You know, off the back of that, we've now jumped to, I think, like, 200 employees or something. So it kinda feels like this little thing that just got out of hand by us trying to, know, make it a bit easier to use a database. So here we are.

And so for anybody who isn't already familiar with the Cassandra product and some of the capabilities that it brings and some of the use cases that it's aimed at, can you give a bit of an overview about what the project is and some of the story behind

how it came to be and its sort of overall role in the open source data ecosystem?

Yeah. Sure. So the high level description so Apache Cassandra,

it's a open source and OSQL

distributed database. Right? You know, it's got some really cool architectural things that came out came out of the back of its origin that make it really, really useful.

It

was actually originally built or created at Facebook.

They built it for inbox message, which I believe it was kind of a bit of a very specific use case they had in mind.

And it got used there for some period of time, not a whole lot, but eventually, it got open source. I believe I wanna say in

2, 008,

roughly, it got open source and got released.

And then about a year later, it eventually made it as a donation into the Apache

project.

Interestingly, like, Facebook

didn't use it for that long. They ended up, I believe, in moving some of that functionality into HBase, actually. But through the process of open sourcing it and releasing it into the community,

a huge ton of other companies were actually like, oh, hang on. This solves our particular use case. And interestingly enough, not so much around message searching,

you know, but a whole bunch of other use cases, like, really, really great for kind of time series data at the time, really, really great for write heavy workloads at the time,

you know, really, really great for pretty much most of those, I guess, use cases where companies had reached the limits of how you could scale with a traditional database or a tradition even a traditional

clustered or, you know, manually sharded database. Right? And so that's where you started to see it getting picked up, you know, by companies like, you know,

Netflix and Rackspace,

you know, Apple

and Intuit and a whole bunch of other really rebuild big companies kind of grabbing this database and using it to replace some of those transactional use cases. Right? And the reason why I think a lot of these companies gravitated towards Cassandra is a few reasons. First of all, it treats,

you know, awareness of different locales or different regions or different value domains is what I like to use in case. I trace them as a 1st class citizen and understands

data placement in regards to those fault domains. Right? So what you can do with this, Cassandra, is you can say, hey, Cassandra. I want you to replicate,

you know, this particular table, this particular key space, whatever it is, 3 times, I need at least 3 copies of this. In this data center, I need at least 5 copies in this data center or however you wanna do it. And there's a whole bunch of, you know, math that kind of indicates, like, based on what consistency and stuff you want with your data, what replication factor you should have. I won't get into that at the moment,

but suffice to say

is that control and that trading, you you know, ideas like where your data lives, consistency

as first class citizens,

where the developer has to think about it up front when they perform a query

is really, really powerful because it actually means that the developer is then building applications that are naturally resilient. Right? And that brings us to the really, really cool point, which is Cassandra lets you, like, actually survive outages.

It lets you have a node go down during the middle of the night, and you don't have to

you're gonna keep you're gonna keep on chugging along. Right? So

that resilience is really, really important with Cassandra. Right? That's 1 of the core reasons. The second core reason

is linear scalability. Right? So with Apache Cassandra, if you wanna scale up the performance side of things, right, it's just a matter of adding more nodes. Right? And because it takes what's called the masterless architecture, and I got that from the Dynamo white paper. So, you know, 1 of the original engineers who worked on this was actually an Amazon engineer, right, and worked on the original Dynamo system. Right? So it took a lot of those concepts and apply that to Apache Cassandra.

So what you can do with that is if you want to double your performance, you just double the number of nodes or servers that are participating within

the Cassandra cluster. You know, there's a few reasons why that makes it really, really simple. Again,

that's a whole another hour of explaining and getting into it, but, know, those are kind of, I think, the really 2 core powerful things why people started to really gravitate towards Apache Cassandra, right, is, you know, the high availability

and then the high scalability.

And you mentioned that the way you got involved in the overall project and community was by trying to build out this project of being able to do this sort of data marketplace and ended up accidentally building a whole successful business around it.

And so I'm wondering if you can just give a bit of an overview of

your perspective on your role within the community

and some of the ways that you're engaging with the project, both in terms of the business side and from your sort of personal engagement with the project?

So I guess, certainly, on the business side, you know, we spend a lot of time in the community. We have committers. We work on the code base. Even our engineers that are not committed spend a lot of time working on, you know, bugs and fixes and features and other bits and pieces, and we like to kinda contribute back there. You know, I think, you know, 1 of our philosophies so we work in a number of different open source communities. It's not just Apache Cassandra. It's Postgres. It's, you know, Apache Kafka and that kind of thing, is, you know, we don't aim to, what I would say, dominate a particular

community or kind of, you know, own that particular space,

but, you know, we're very aware of making sure we contribute back, you know, our our fair share and try and do that. And it's not just around code contributions. Right? You know, we spend a lot of time as well, you know, talking at conferences,

talking to wonderful podcasters as yourself, running meetups, you know, doing training, doing all that kind of thing that there's really important parts of the ecosystem as well and part of that community and getting the good word out there and helping people be successful at that as well. So

that's very much on the Instaclustr side of things. You know, personally,

I don't get as much of a chance to contribute back to Apache Cassandra as I used to or as I like to, 1 of the joys of of running a business. But, you know, when I do get the chance, you know, staying involved with how the community presents itself, how they talk about this about the project,

you know, kind of contributing and, you know, also making sure that instaclustered

resources are deployed within the community in the way that the community kinda works as well. Right? So that's kind of more of the role that I find myself in on a day to day basis. You know? I have submitted a few patches, but I think I could count them on 1 or 2 hands, to be honest. That's the joy of being a CTO is that, ostensibly, it's a technical role. It's really not. And so that brings us to the

most notable piece of news coming out of the Cassandra project recently is the major release of version 4.

And when I was looking through and preparing for the show, I noticed that the prior release of version 3 was all the way back in 2015, so it's gone 6 years between major releases. And I'm wondering if you can give a bit of an overview about what is notable about the version 4 release, both in terms of the technical aspects and some of

the aspects of why it went so long between releases, whether that's because there wasn't really a major need of it or just the overall sort of engagement or complexity of being able to be confident in that. And I know that 1 of the aspects of it is that it's a database, so it's a very core piece of everyone's infrastructure, so you need to make sure that it is a 100% reliable and is not going to break or lose data. Yeah. No. It's so true. So the 3 dot x branches, as we call it in the community,

it was kind of a bit of an interesting experiment where we approached it from and it was called the TikTok release cycle. Right? So it kind of be you know,

evens would be new stable would be new features. Odds would be stabilizing.

Right? And with a goal of kind of doing a release, I think it was every month. It's 5 years ago. I can't remember.

What was interesting with that particular experiment is I think it did bring kind of the velocity that previously the Cassandra community had been a little bit concerned about.

The problem, of course, was is that quality, I think, kinda suffered. And so, you know, it was some time before people kinda got confident. I think 3.7 was kind of when, you know, people will start kinda, like, game enough start to deploy some of this capability into production.

And then, of course, you start adding new features on top of that again, and then it was again about 3 dot 11 that it really started to stabilize. And even then, we're still up to, like, a significant patch minor patch number with that.

So

that was kind of the TikTok,

and it's named after the Intel release process,

not the video based social media site. So that was kind of the experiment. Right? And we kinda got to 3 dot 11, and I think that kinda just ran out of puff a little bit, and and there wasn't a huge appetite to continue that. Right? So it was kind of that was pulled, and then we're like, hey. Let's work on 4.0. And so the community had a bit of a discussion around what is important as part of 4.0. Right? And the thing that came up was stability. Right? So the fact that Cassandra community traditionally had a great reputation around releasing a dot 0 release that you can use in production, and so,

you know, a lot of the big contributors to the project kind of all came together and said, hey. Look.

We'll only release 4.0 when we can all say, hey. We've ran this in production

ourselves. Right? So we just move the bar from here to way up here all of a sudden. Right? And, you know, some of those contributors were still on, like, the 2 dotzerobranch

as well because they weren't convinced about the stability of the 3 dot x branch. Right? So

there was a lot of that kinda going on. In the meantime, you know, I think there was a number of the other commercial players in in the community.

You know, some of their priorities changed a little bit. You know, 1 of these contributors

hired a lot of or employed a lot of committers, so people that, you know, had permissions to commit patches and do all that kind of stuff. And, also, I think a lot of the know how in the community around certain subsystems,

They had a bit of a change of priorities, which meant that not as much of those committers had, I guess, employee or paid time to go work on the project,

And so I think losing

what I'd call that the middle bar of experience committers also set that release cadence back a little bit.

Right? Having said that, you know, that contributor over the last few years has kind of had a bit of a change of heart and done a bit of a 180 on that previous,

I guess, perspective.

And, you know, we started to see, you know, more of those folks being able to spend more time on the project, which is really, really awesome to see them kinda coming back into the community, but, like, kind of super excited to welcome them back in that sense,

which has been really, really good. So having, you know, I think that kind of, you know, confluence of factors

kind of caused that what I'd call, you know, 5 years of trying to get this out the door. But, you know, certainly, I think the quality gate was a lot higher. You know, the requirements on testing were a lot higher. You know, we've gone from a project that I think has

never had a green run on its CI system due to flaky tests and other bits and pieces.

So everything now passes beautifully and reliably.

So there's been a huge investment from the community

on kind of getting that. And I think

the stability of 4.0,

you know, really shows that. I mean, like, look, you know, it's software. There's still bugs. Right? But, you know, there hasn't been anything quite as

debilitating or show stopping as kind of what we've seen in the past. So, you know, super proud of the community for kind of the work that they've done on how hard they've worked on it and the perseverance as well. Right? You know, 5 years is a long time to kinda maintain

that confidence. Right? It is a long time, and, you know, you watch users going, hey. When's the next cool version coming out? When's this new feature kinda coming out? You know? So I think,

you know, keeping up that confidence,

keeping everyone patient with the process has been a bit of a task as well, but we got there, and I think we're all pretty excited

about seeing how people use it. We're pretty excited about 4 dot 1, and I think we're gonna get into a bit of a better cadence, and it's gonna be a little bit easier to release some of this stuff now that we've kinda cleared a little of those quality gates. So, yeah, super excited to see the progress as we go forward with this. As the saying goes, the only software without bugs is the kind that nobody uses.

Well, that's it. Right?

To the point of the sort of quality gates and the building of confidence in the release cycles and maintaining stability, I'm wondering if you can talk to some of

the supporting tooling and some of the overall

development approach

and architectural aspects of being able to build in that stability, build in that level of confidence, and be able to say at that 4 0 point that, yes, this is ready to go. You can put this in production. I will stake my reputation on it. In terms of, like, some of the actual

things that they've done so, obviously, 1 of the ones I mentioned was just getting, you know, all those tests passing

the existing stuff all grain. But

what's been really interesting is we've brought another level of,

I guess, approaches or theories when it comes to testing. Right? So 1 of the ones that has paid off huge dividends

has been property based testing that got introduced into the project. Right?

So that's using libraries like QuickTheories, I think, is kinda 1 of the popular ones out there.

And,

really what it is is it's around

checking that certain properties hold true for a range of all possible inputs. Right? So

it's kind of a little bit like buzz testing or fuzzing an interface,

but, you know, it's actually defining the ranges in which we operate there, the bounds that we're going to explicitly check. I'm leveraging libraries that can check that understands those edge cases. So, you know, when we're inbounds, out of bounds, on the bounds, all that kind of thing.

But it is all, like, randomly generated as well, but there's some great support by the libraries to make that replayable as well. So then when we catch a bug through a random replay, we can grab that seed, recreate that data, jump in there, go have a look at that. Right? So property based testing has been absolutely huge.

I know that there are plans for things further down the line, like simulating certain things.

You know, we've had a number of people from

who previously worked on

databases like, you know, React and Foundation DB come across, and those 2 databases had a really great approach to testing and simulation, that kind of thing. So we're starting to see all the great lessons that those communities have learned being applied to Apache Cassandra. Right? And so we're seeing a huge improvement

in the number of bugs that that we actually catch. Right? Like, it's really good when you take a new testing approach, and you're like, oh, look at all these bugs we've uncovered,

which has been really, really cool.

I think we're also starting to see

as we turn make some of these components within Apache

Cassandra more pluggable,

it's gonna make those more testable as well. Right? So exposing some of those interfaces.

That's some of the, you know,

exciting work around the testing side of things. I do certainly apologize to the community members if I missed anyone's pet ticket. I owe you and me a culpa, and I'll plug it in the next podcast.

Another interesting aspect of the ecosystem around Cassandra

is there are projects such as CillaDB

that are leveraging the popularity of the interface of Cassandra and its promise of being able to be

scalable and fault tolerant, but completely reimplementing it from the ground up to cut down on the overall hardware requirements and be able

to, you know, ostensibly provide higher performance guarantees. And I'm wondering if you can

talk to some of the ongoing utility or benefits of projects such as Cilla DB or anything else that's operating in the space, particularly in light of Cassandra 4 coming out and just what you see as the overall

potential in the ecosystem

for supporting multiple implementations

of the Cassandra interface versus

the specifically

Cassandra

project?

Yeah. So I'm always in, like, a dual mind about this. Right? I think it's really, really exciting to see

projects

take some of those Cassandra concepts and to adopt,

you know, like,

CQL and to adopt, you know, the the Cassandra native protocol. Right? And I think it speaks to the volume of, you know, how excited people are about Cassandra, what a useful spec it is, a well defined spec as well because people can go and implement it easily. And it's really, really great to see a broader ecosystem kind of evolve around that. Right? I think

it's been really fascinating

to follow, you know, some of the ideas that people at seller have taken and applied

to that, but it's also been really interesting to see how,

you know, other people have done that as well. I know there's other things, like, I think, like, you go by it. You look at some of those cloud native services,

so, like, Amazon Keyspaces.

I think Azure, Cosmo DB has a Cassandra compatible API. Like, there's all these things out there,

and they all do it to some varying degree of success or, you know, kinda works or a lot of them work really well, some don't work quite as well, but it's a great stirring pot or a melting pot of ideas, right, that we get to see. I really enjoy that aspect of it. I think where I kind of

aspect of it.

I think

where I kind of flip back on that is,

you know, I always get a little bit

Apache Cassandra community.

You know, 1 of the really great things about it being an Apache project is the Apache Software Foundation

owns the IP. Right? Owns trademark. It is a, you know, nonprofit

IP holding vehicle. Right? So it kinda creates this wonderful space where different competing companies can come together and work on some of this stuff. Right? And so, you know, it's not like,

you know, people have to go work on a competitor's product or something. This is a shared common good. Right? This is true open source software. It lives at a foundation. It's really, really wonderful. And I wanna caveat that that you can have really great open source software even if it doesn't live in a foundation, but it makes it easier for commercial players to contribute. Right? So I always get a little bit sad that maybe some of these ideas didn't get explored

within the Apache Cassandra community.

Having said that, you know, Apache Cassandra community is the biggest community of, you know, things that implement CQL. Right? It is the original. It is the reference spec. It is the thing that does it. And what you'll start to see

is that the successful ideas from these other projects will start to filter into the Apache Cassandra project. Right? You know,

a great example of this is, you know, 1 of the things that Cilla DB does really nicely is it takes a bit of an a different internal architecture. Right? So it leverages thread per call, which is way better for more modern machines where you potentially have more, you know, hardware threads than you would previously, say, 10 years ago, can take better advantage of that. Right?

There are a number of tickets out there and, you know, the Apache Cassandra Deer about exploring. How do we implement thread thread or move towards an architecture that is more like that? Right?

Cassandra 4 dot 0, it doesn't get to thread per core. It's still very much the seed or architecture that Cassandra uses. But on the networking side of things, you know, we've seen a great reduction in, say, you know, the number of thread pools that we use to do network communications, the implementation of the Netty library, do other bits and pieces, which is actually saying some really, really great performance improvements. Right? So, you know, the great thing about open source software is we can look at even though they're not necessarily doing these things in the community,

you know, we can kinda look around and say, oh, well, hey. They're

lack of different ideas and concepts. Right? So, you know, it is really exciting to kinda see that taken

with Apache Cassandra

and seeing that kind of happen. So that's my very convoluted answer to that particular question. I'm excited to see what the future holds and where we can learn from things.

Out at an intellectual level, I'm very excited to see how those different other communities go and some of the cool ideas that they implement. You know, I think for most of them that are doing it out in the open,

that's a great chance for us to kinda learn from each other.

And then switching gears a bit,

talking about Cassandra

specifically in some of the ways that it's being used, it is

primarily used as a system of record because of the durability guarantees and because of the fact that it is able to be fault tolerant at both the cluster scale and the geographic scale.

And I'm wondering if you can talk to some of the additional systems that people will often pair with Cassandra for being able to take advantage of its durability guarantees, some of the overall sort of system architectures that you commonly see, Cassandra involved in and some of the, I guess, interesting elements of what the project has enabled from a data platform and data systems perspective?

Yeah. Definitely. So, I mean, the things that we commonly see deployed side by side with Cassandra tends to be other, what I'd call, data layer or data infrastructure technologies that also tick some of the same boxes that Cassandra does. Right? So high durability,

great scalability.

Right? So

we also see Apache Kafka get deployed side by side with Cassandra

because it is a highly scalable, durable message bus. Right? And you see this move towards, you know, microservices,

CQRS,

and kind of having to do data hydration. Kafka is a great choice for that. Right? And

we also see things like, you know, either open search, right, the elastic search for being deployed side by side with Cassandra to kind of enrich some of that index and capability. We see, you know, things like Redis from Memcache or other bits and pieces like that as well, you know, to potentially help out with some of that RAID workload side of things.

So that's what we see deployed side by side with Apache Cassandra. There is a huge decrypt system, right, of other things that do get deployed side by side. That's that's I'm just painting a very small picture. There's a much broader landscape out there. And I think it actually you know, in that element, it goes towards the fact that we live in this land of polyglot persistence. Right? So people just essentially choose the right tool for the right job. Right? You know, you rarely come to a company now where they say,

oh, we're just we're an Oracle shop, right? Or we're just all on MySQL shop, right? Like,

usually you come into a company and there'll be like 2 or 3 main databases or, you know, that are the source of record, right? Store record.

In terms of what are the challenges, I think that kind

of

In terms of what are the challenges,

I think, that kind of crop up with first of all, 1, you know, this shift to microservices and you see Cassandra get deployed on there. I don't think it's Cassandra is like,

you know, what I would call a core component of a microservices

approach. It's just part of, you know, some of the tools that a service will choose if they've got a certain set of requirements around availability. Right? But when it comes to microservices,

when it comes to, you know, some of that scale up stuff,

and polyglot persistence,

you end up in this situation where, you know, in order for businesses to understand what's going on, you know, from a, you know, business level set of metrics, they need to get data from all these different systems,

you know, into 1 place, right, so the, you know, business analyst,

leadership,

product managers can kinda see what's going on. Right? And so you we see a lot of people

thinking about how do we get this data from our transactional data stores into, you know, our data warehouse, into our analytics capability. Right? So

I think that's 1 of the challenges that we see a lot of people grappling with. There's some really great projects out there that help with that, and we work very closely with some of these. The project

Debezium

is amazing. I think it does far more heavy lifting, and it should be a far more popular project than I think a lot of people, you know, kind of understand about it. To be easy, miss, it essentially allows you to, you know, grab that wonderful CDC data from a lot of databases and dump it into a Kafka stream. Right? Sounds very simple, but it's so important, right, to being able to get your transactional data kind of core usage, subtle, all that fun stuff into, you know, your data warehouse or wherever it needs to go.

So, you know, doing those integrations,

making sure that you kinda know what's going on and that things aren't getting too compartmentalized

or siloed

is super important, and I think that's kinda

the next set of challenges that a a lot of companies are gonna face given their current data infrastructure.

Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data in motion.

That leaves data ops reactive to data quality issues and can make your consumers lose confidence in your data.

By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end to end metadata,

Data Band lets you identify data quality issues and their root causes from a single dashboard.

With Data Band, you'll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives.

Go to data engineering podcast.com/databand

today to sign up for a free 30 day trial and to take control of your data quality.

As a brief aside, I'm wondering if you can talk a little bit about some of the

data modeling characteristics

of the Cassandra database and some of the ways that you need to think about

structuring the data and working with the data that it stores and the juxtaposition

of how somebody who's familiar with a relational database might need to transform their thinking to be able to take full

advantage of how Cassandra is able to store and partition and distribute data. Under the hood, Cassandra leverages what's called the big the Google Big Tables data model and storage engine model. And to be fair, it has changed a little bit from, like, the original papers and the way that it kinda works.

But in terms of what a developer will say when they come and start working with Cassandra,

the query language and the data definition language

is very, very similar to SQL. Right? And in fact, deliberately so, right, to make it very easy for developers to come from 1 environment to the other. Unfortunately, that also, once you become an experienced person, it becomes a little bit maddening because you're like, oh, I can do this with SQL, but I can't do this with CQL even though the syntax is very, very similar. But developers will find themselves in a situation where, you know, they're working with very familiar objects, you know, tables, insert statements, update statements, select statements.

But in terms of the actual down model itself,

Cassandra kinda has a number of things that you need to think about. Right? So first of all, it has a concept of a primary key, very similar to SQL land, and that is like the unique identifier to look up a given set of columns.

Within the the primary key for Cassandra,

you've got 2 components. 1 is called a partition key,

and this is just a set of columns that determine which node your data lives on. Right? So it's the thing that gets hashed and gets assigned, you know, within the cluster space.

Cassandra provides a certain number of guarantees

around when you do certain operations

within a partition, it can provide certain guarantees around that. The next part of the primary key is called the clustering key.

The clustering key is

essentially the way that Cassandra will lay out that data on disk. Right? So let's say for your clustering key, you have you set it as a time stamp. Right? So all those time stamps will get laid out on disk in order. This is really great because then Cassandra allows you to do what's called a range slice over that particular column. Right? So if you know your partition key and then you know a range of your clustering columns, you can do a query across those. And then you've just got regular columns.

Right? So that's kind of a very high level of the Cassandra data model. The best way that I like to think about it is Cassandra is essentially a map of sorted maps.

Right? And it's kind of had that projected onto more of a table like representation.

Right? So that's the Cassandra data model

at kind of its core. A lot of things build on top of that. Cassandra brings great features as well to the table. It brings things like counters, right, to do distributed counting of things. It is what I would call it's not, like, exactly

accurate, but it's really good for counting things like the number of light likes on a tweet, for example. It brings the concept of lightweight transactions, and so using Paxos under the hood, you can provide

some safety around when you do certain updates. Right? So you can say, hey. Update this row if it doesn't exist or if it has a prior value, and it will do that all nice and safely.

So there are some nice things on top of that. You know, you can also have different you know, it's got great support for different rich data types, like, you know, JSON and sets and other maps and other bits and pieces there.

So usually, like, when you come to it as a developer, like, all this is gonna be very familiar

to you. It's gonna be easy to start working with it.

It is definitely a little bit more of a restricted environment than, say, an SQL environment. You know, you don't have joins or foreign keys or anything like that, and we do that to ensure that we can scale out really nicely.

But for anyone that's kind of run an SQL system at scale and have had to, you know, maintain a really, really big table, you know, you often end up having to drop those foreign key constraints

anyway for performance reasons. Right? So by the time you get to Cassandra, you've already kind of given up a lot of those anyway.

So, yeah, so that's kind of Apache Cassandra in a high level, the data model. You know, if you're kind of interested, I definitely you know, there's heaps of great books and resources out there. There's 2 day training courses and all sorts of things that you can kind of jump in that will make you really understand this stuff. But usually, if you're coming from an SQL world, you'll be able to find some way of doing things. The 1 thing I think I kinda missed in that description is you kind of certainly need to denormalize your data when you are coming from that SQL world. Right? So you will find yourself storing data in multiple places, multiple tables

twice. Cassandra is definitely 1 of those things where you need to architect

the way that the data is laid out based on how you're gonna read it back from from the database. Right? So that might mean a bit of a a double up of data. But it's you know, storage is cheap nowadays. Right? So, you know, it kinda makes sense to do that.

And then switching to the operations aspect

of running Cassandra,

1 of the things that is notable about the way it was architected and designed,

you know, when it was first created,

what, 15 years ago almost maybe, is that it is kind of prescient in terms of the ways that it's able to manage

executing in cloud environments and particularly with the so called cloud native

design patterns that are growing and being established right now, particularly with things like Kubernetes. And it seems that Cassandra is very well suited to be able to slot into a lot of those ways of thinking about just deploying and managing infrastructure.

And I'm wondering if you can just talk to some of the opportunities that you see for

the Cassandra

project, particularly as it's, you know, coming in with this new version and as these cloud native patterns are being established and gaining more prominence.

It's kind of interesting, you know, Cassandra kinda like it was built and developed

in a world where, you know, cloud wasn't around. Right?

Or very, very nascent. I'm not gonna look too closely at those dates. Right? But

I think what lens Cassandra, it is 1 of the original cloud native databases,

is the fact that it came out of the box with a great understanding of where it's deployed. Right? So it understands what fault domain is deployed in. It understands what data center, what region, what rack it's deployed in. Right? And all those

concepts map really, really well to the cloud. Right? So you can think about your rack as your availability zone. Right?

Or you could think about,

you know,

how, say, Kubernetes

with its annotations, labels also understands

where a given rack is, what a given availability zone is. You know, all those concepts just map really, really nicely together. So the fact that Cassandra has

this understanding of

network locality,

value domains as a first class citizen. It kinda you know, it very easily steps into that role as cloud native database. Right? There's also some really interesting things happening within the project as well. For example, making storage engine more pluggable,

we're gonna see that really take off as well with another great concept that kinda came out of the cloud native world, which is the ability to separate storage

from compute and memory. Right? And we're gonna start to see more and more of Cassandra being able to support different patterns, different storage classes,

but also, you know, scale up, you know, pods and style pods or instances or VMs or whatever you wanna kinda call it independent of, you know, different storage mediums, right, which is which is gonna be really, really, really cool. In the community as well, there's been a ton of efforts gone into

making it easier to run

Cassandra on, you know, some of those more cloud native platforms. You've obviously had, you know, service providers like Instaclass are doing it for years in the cloud, but you're also starting to see other projects around, say, Kubernetes. Right? So a number of different operators

for Apache Cassandra, we've written run. I know data stacks have written 1. I know

Orange, the big telco in France, so it's have written 1.

Been a whole bunch of really, really great work around this.

And, you know, internally within the community itself, we're also seeing the community starting to standardize on the community orange Kubernetes operator as well, which is gonna be really, really exciting to see. You know, we kind of found ourselves all in this situation where there was, like, 3 or 4 quite viable production ready operators,

which is not a great user story. Right? You know, we all kind of came to it and, you know, went and developed it, you know, by ourselves independently.

But we all kind of got together and go like, hey. Look. We need a standardized community 1. Right? That's that's what makes sense. So there's a great community

effort going on. There's a working group or a steering committee, kind of meets, I think, every second week or something at the moment and progresses on that as well around building out a community operator for Kubernetes. So, yeah, super exciting stuff kinda happening in that cloud native land for sure. As somebody who has been aware of and sort of passively tracking Cassandra for a number of years now, it's been notable that there was a lot of popularity and attention paid to it maybe

5 or 6 years ago, particularly

with Netflix being a major user of it. And I think that maybe it was Heartbleed

that came out when they announced that they were able to rebuild and redeploy their entire Cassandra infrastructure over a matter of a short period of time without having any outages because of the fact that they were running Cassandra.

And then maybe in the past 2 or 3 years, I haven't heard a lot about it. It's been sort of fading.

And then now with Cassandra 4, I anticipate that we'll be hearing more about it. But in the past,

especially

2 to 5 years, at the same time, there have been a lot of new

database projects and products that have been coming to market. And I'm wondering if you can talk a little bit about your perspective on Cassandra's

role and opportunity within the overall space of data management systems given the fact that there have been so many new entrants to the market that are targeting

high scale, high availability,

global scale, and sort of global fault tolerance capabilities?

Yeah. Definitely.

It's super exciting to see what's happening in some of those communities. You know, you look at things like CockroachDB,

Vitess, you know, all those, I guess, what are they calling themselves? New SQL or, I don't know, the moniker forum. But it's super exciting to see what's happening in that space,

how they're tackling some of those problems.

We kind of go back to, I think, brewer's theorem around, you

know, CAP. Right? You know, consistency,

availability, and partition tolerance, and each 1 of those ones take a slightly

different approach to, you know, how they do their databases. Right?

So with Apache Cassandra, because it prioritizes

by default availability,

whereas I think some of these other ones are kind of much more around consistence.

I think we're gonna see a lot of them, you know, probably sitting side by side, to be honest, where there'll be some use cases where, you know, something like test is the right fit, a 100%.

Right? And

it might sit side by side with Cassandra, where the availability is a higher priority than say, you know, being able to do some other things that the test can do. I think there'll be

users of both that maybe, like, leave 1 and go to the other, you know, just as

people who have adopted 1 because they had 1 use case, but it wasn't a perfect fit, but the other thing might be more of a perfect fit. I think you're gonna see some of that cross pollination or that transition happening for sure. But, you know, I think at the moment, you know, in terms of kind of what Cassandra brings to the table,

it's probably,

you know, still for a number of use cases, 1 of the only real open source choices out there. Right?

But I don't know. I always take a very long term community view around this kind of stuff, which is,

you know, it's really great to see other projects

and other databases and other data stores

explore different ideas and how they approach those, how they tackle it.

And then also just seeing what people's experiences are with them. Right? Because,

you know,

code changes. Right? Projects change. We can add new features. We can remove old features. Right? You know, I think

if we see something that's a great fit for the Cassandra community and people, like, talking about it and that kind of thing, it's like, hey. We can always look at is this something that we wanna adopt? Right? We we see someone who's had a great experience with this.

You know, maybe this could really also help Cassandra out. You know, we've seen that go the other way. We're seeing lots of these projects learn those lessons from Cassandra, right, as 1 of the earlier big scale out databases.

So I don't know. Like, you know, I'm I'm a big Cassandra fan, but I always just love seeing what other projects do and how they approach it. And, you know, it's really fascinating to see the problems they tackle. Do you have any sort of predictions or perspectives

on the near to medium term future of Cassandra and some of the opportunities

that you predict given the new release and the

focus on stability and reliability

that the past few years have invested in? So I think we're gonna see

performance only get better. Right? You know, I think when it comes to, you know, support for some of the new JBMs, we've seen some great testing and stuff around that. We're gonna see more of that pluggability kind of come in. I think people are gonna pay more attention to some of the stories that are being told about Cassandra. It's funny you mentioned previously that, like, there's a bit of a lull and, you know, with Netflix talking about how easy they replaced a whole bunch of stuff. You know, all those companies that were telling great stories about Apache Cassandra, they still use Apache Cassandra. Right? And

I've done this just a a very ad hoc exercise where I've talked to different community members who, you know, work at various big companies and their deployments and that kind of thing. And,

you know, I would happily or easily,

you know, put a bet down that,

you know, in terms of the vast majority of the online population of the world, right, on a daily or weekly basis, they would interact with a service that's backed by Apache Cassandra,

Right? You know, you think about

it's used at Netflix. It's used at Apple. It's used at Spotify. It's using us it uses some of the biggest banking and financial institutions pay powering

their retail and transactional services. Right? So,

you know, even just holding, like, you know, that, like, maybe 10 companies in your head and the use cases that Cassandra Powell, it's there. Right? Like, that's touching and interacting with already more than a 1000000000 or a couple of 1000000000 people globally. Right? So, you know, I

think

the premature death of Cassandra has been reported quite often, but, you know, it's always a little bit premature. Just to butcher that quote for a second there. But I

think with 4.0,

I think people are gonna, you know, hey. We're, you know, we're still powering along here, And I think it'll just be a nice chance to tell some of those stories again and tell some new ones. Right? There's been some great use cases kind of popping up

over that period of time. Anyhow, I'm just excited to see, you know, people may be coming to the the community after hearing about the the 4.0 release and starting to participate or think about deploying it, Cassandra.

It it's already crested the hype cycle and has gone to being boring technology in the best sense of the term. That's it. That's it. And I mean, like, look. Don't get me wrong. Cassandra wrote that NoSQL HYPEWAVE.

What is it? The 6 or 7 years ago. Right?

And, you know, we're now in just, hey. It's boring technology, man. Right? Like, this is the reliable tool you pull out of your tool bag and, you know, off you go. It's a good place to be for a database. Let's be honest. Absolutely.

And so in terms of some of the ways that you've seen Cassandra used both in the community and in systems that are powered by your managed service at Instaclustr, what are some of the most interesting or innovative or unexpected applications of Cassandra that you've seen? That's a really good 1. So I can for a fact, I can't name who it is, but I do know that

Apache Cassandra

powers the source of record

for a major banking financial institution when it comes to transactions.

And people are like, oh, that's not like a weird use case.

Hear me out for a

second. A lot of people always go, you can't use

weekly consistent databases

for transactions, for banking transactions, for financial records. Right? You know, you always and this is the stereotypical

example of why, like, you know, SQL transactions

are great or assets or asset based compliant transactions are great. It's because

and what happens if you get these 2 different transactions? You have negative balance and blah blah blah blah blah. Well, you know, Cassandra is an IP database. Right? Features available you know, prioritize availability and partition tolerance over this stuff. It can power that transaction workload, and the bank's having a really, really great time about it. Right? The other thing I love to educate people on is actually also banks by default were eventually eventually consistent. Right? That's why we have things like overdraft fees. Like, it's because at 1 point, the bank didn't know how much money was coming out of your account.

Right? And so they couldn't stop certain things coming out of your account.

So that's kind of my favorite example about, like, here's a really, really boring use case that people just don't expect

Cassandra to be used in. Right? There's some also really cool stuff, like, you know, I know it's being used for, you know, storing, like, feature sets when it comes to machine learning. I know it's being, you know, used

for, you know, powering some great recommendation engines.

You know, I know it's being used to power, you know, kind of emergency assistance services and other bits and pieces. Like, again, because, Cassandra, it's like it's done that hype cycle. It's kind of in that stable thing. It's like every use case you can imagine. I'm sure Cassandra has done it. Right?

So, you know, that is kind of the exciting thing about working with Apache Cassandra. It's kind of like it has tried you know, touched most use cases. But, yeah, they're kind of just my my main ones, but I do love the fact that it, you know, empowers

financial transactions. Right? How cool is that? And then in terms of your experience of building Instaclustr

and being engaged with the Cassandra community and working on the project? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think

this might be true for all open source projects. Right? Which is,

you know, it's very much around bringing the community on a journey. Right? So it's 1 thing to be like, hey. I've got this awesome feature or we should patch it this way or we should do something like that. Right? But you gotta bring people on the journey. You gotta build that consensus.

So that's really, I think,

from my perspective, and that's where I'm at at the moment. I know you're probably looking for an interesting weird technical answer to this particular question, but, you know, I think, you know, the challenging lessons I've had to learn is, you know, it's really about bringing the community on a journey, right, and

communicating with other people in the community and interacting with, and just working together or having your disagreements, but being able to move on and kind of pull together in the same direction

is really, really huge. I'm really glad it's a lesson that I learned. Right? It's a wonderful lesson that anyone should learn, but

I think the Cassandra community, it was a great place to do that as well. Like, it's a really nice welcoming community as well. So yeah. Definitely.

For people who are interested in Cassandra and excited about the reliability and durability guarantees

that they're able to get out of it. What are some of the cases where it might be the wrong choice and they're better served by using some other scale out system or just going with Postgres or MySQL as the reliable workhorse?

So I I think there's kinda 2 sides to that. Right? If you're building a new service, new project, new startup, whatever that might be,

pick the data store that you know that allows you to get started the quickest, the easiest, and start learning about how people use

your service or product or whatever it is. Right? Because

sometimes you can fall into this trap of being like, oh, it's gonna be so big, so we gotta choose Cassandra because, you know, we gotta scale out easily, blah blah blah.

The answer is actually no. Choose the thing that gets you up and going the quickest and lets you learn lessons

about the problem domain that you're working in the quickest because you're gonna rewrite your data layer 3 or 4 times before you even get to the point where Cassandra starts to be a viable option from a scale perspective. Right? So learn those lessons along the way.

Also,

it can be quite, I think, useful to learn about the challenges of scaling out certain other systems before you get to Cassandra because when you get to Cassandra,

you'll be like, why are there these restrictions? Or why can't I do things like joins? Or why can't I do this? But if you face those challenges in other systems, you'll be like, oh, I understand now why. Because they're also slow on these systems when I try to do it at scale. Right? So, yeah, pick the thing. Don't pick Cassandra if you don't know it and you wanna move quickly.

Of course, you know, there's definitely greenfield or put applications out there, particularly, you know, if it's a new service

from a really large company where it's like, hey. We know we're gonna have, you know, 5, 000, 000 people from day 1. Right? You know? Hey. Well, you've already kind of that that choice gets made for you, and you might need to kind of pick up Cassandra with that. So that's the 1 side of things where I wouldn't use Apache Cassandra. The other side of it is if I'm just storing data for a pure analytics

perspective.

Right? I would not choose Apache Cassandra as my backing data store for a data warehouse. Definitely not. It's not to say that you can't do analytics in Apache Cassandra. There's some great spark plug ins and connectors and other bits and pieces, but you're just not gonna get the same performance as you would with, you know, say, more of a column oriented data store, right, or a column or storage format. Right? Like, something like parquet and, you know, throw all those files on history or whatever it might be. I definitely steer clear of Apache Cassandra for a pure data warehouse or data analytics

perspective.

We do see some, like, custom built data warehouse systems where they use Cassandra

as either, like, you know, an index or a kind of a record keeping pace, but the data itself does not live in Apache Cassandra. That's for sure. We've already talked a little bit about this, but what are some of the things that you see as being in store for the near to medium term future of the Cassandra project?

Yeah. So I think we're gonna see some exciting features around, you know, the pluggable storage engine. We're also gonna see better use of things like virtual tables. Right? That was a new feature that got introduced in 4.0.

Virtual tables, you're probably familiar that with them from, like, MySQL and other bits in places where

you can query statistics about the database itself in those virtual tables.

We're gonna see much better use of that in Apache Cassandra.

We're We're gonna see some improvements to the way that we do Paxos. We're gonna see some improvements the way that we do schema and gossip management, which is gonna make operations around large clusters and changing schemers and large clusters

way, way, way easier.

So we're gonna see some great work around that. We shall see some stabilization

of what's called transient replication,

and that's a really cool technology. It's in there at the moment, but it's currently experimental, and we're gonna see some better improvements around that. Transient replication is a really, really cool way of increasing the durability without really increasing the cost associated

with needing to say, you know, if you wanna go from a replication factor of, say, you know, 3 to 5,

it allows you to get that replication of 5 without needing to run quite as many nodes. Right? So it's a great kind of durability booster.

Yeah. There's gonna be hates coming down the line. Right? I know that there's a ton of developers that during the 4 dot 0 feature phrase, they're like, I just wanna get my ticket in and get this kinda committed, but 4 dot one's gonna look really, really cool from from that perspective.

Are there any other aspects of the Cassandra project or the ways that it's being used or the recent release that we didn't discuss yet that you'd like to cover before we close out the show? I think this has been a really wonderful tour. You know, we've touched on everything from, you know, Cassandra being a great fit for cloud native applications,

you know, some of the journey and the history and where it's a great use or a great choice.

I think, really, I just I'd invite

all your listeners who have previously used Cassandra or have always kind of been interested

or just hearing about it for the first time to actually

go to cassandra.apache.org

and download 4.0 and give it a go. Right? I think you'll be pleasantly surprised.

You know, there's

a whole bunch of resources out there to help you get started on the site now. We've also undergone a bit of a site refresh for the Cassandra project as well. So definitely go download 4.0, give it a try. And, you know, if you're already running Cassandra, you know, think about doing that upgrade because there's some really good things in store for 4.0.

And I guess briefly talking to the upgrade aspect, are there any sort of notable

upgrade

sort of gutchas that people need to be aware of before they click go and go from 3 dot x to 4?

Yeah. Definitely. So make sure

you are on the most recent minor patch version of which ever branch you're on. So Cassandra 4 dot 0 supports upgrades from 43.11

and 3 dot 0, and make sure you're on the most recent minor versions of those.

If you are running large clusters, and by large clusters, I mean, for a 105100

nodes, wait until 4.0.1.

There is a patch that makes those upgrades for those very large clusters a little nicer,

a little easier, particularly when it comes to, you know, if a node goes down during that upgrade process.

Honestly, that's it. Right? Like, just, you know, get your current cluster

up to the right version.

Double check that changes dot text file, which will have a whole bunch of information about if you do things that ways, what to pay attention to, you know, how to change your config file to make to make that new 1. But, yeah, give it a go. It should be a pretty painless journey.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. Definitely. So I think this is 1 of my pet kind of topics to talk about is, you know,

we, as software engineers, we spend a lot of time, you know, working on, you know, doing things with data and where data lives that aren't necessarily always related to the core app application functionality.

Right. So what I mean by that is,

you know, sometimes we've got to do things like, you know, we've got to cache records, because it's slower getting it from 1 database to the other. Right? Or we've gotta do something where it's like, hey. You know, we might be on a MySQL database, and we got 1 big customer. We're gonna a shot that out to another MySQL database, right?

Or we might suddenly have a requirement to start

encrypting

records, right, for, you know, either PCI or privacy or whatever it might be. Right? These are all things that we have to do as developers

where it doesn't change the behavior of our application. Our application doesn't need to know about it, you know, as long as we can run that same query and it comes back with the right answer.

We really don't need to, you know, have that logic built into our application. Right? So I think 1 of the challenges that we have when it comes to managing from to data management is to actually,

you know, stop making developers do some of that schlep or that grunt work and start working on tools and capabilities that make it easier where we can do that at either the database layer or at the driver layer or whatever that might be, but make some of those things a lot easier where the developer really shouldn't actually care as long as they're getting that query back in the right way. Right? So

we've got some things that we're working on around that space, but it'd be great to see other people kinda working on that space around that interaction Right?

So

Right? So, yeah, that would be my view on what's the biggest gap when it comes to tooling on technology for data management. Well, thank you very much for taking the time today to join me and share the work that you've been doing with Cassandra and helping to support the community and some of the news that's been coming out of that project lately. So appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thanks so much, Tobias.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links