Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do?

Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more.

Go to data engineering podcast.com/census

today to get a free 14 day trial. Your host is Tobias Macy. And today, I'm interviewing Prapat Jha and Jonathan Ellis about Astra Streaming, a cloud native streaming platform built on Apache Pulsar. So, Prabhat, can you start by introducing yourself?

Sure. Hi. This is Prabhat Jha.

I head head of the engineering firm for Astra's Streaming here at DataStax.

I'm Jonathan Ellis. I'm CTO at DataStax,

and I work on streaming with Prabhat.

And, Prabhat, do you remember how you first got introduced to data management?

I got into data management by accident, actually. After working for Devos for a while, I just I decided to do a startup, and that the start up was related to mobile application performance monitoring. So things like New Relic data doc for mobile apps way back in 2011.

And when we launched it, we realized that we're getting, like, these metrics and logs from, like,

hundreds and thousands of devices every minute. And that's where, like, it really hit me hard that I had to store that kind of data volume and analyze it.

And back then in 2011,

it was, like, sending those data from mobile SDKs to the Amazon SQS and then basically processing those data and storing data recently, searching and all this stuff. Right? That's how I got into it way back in 2011.

And, Jonathan, how about you? We were talking before the show started about, you know, jobs in college and how colleges don't have the budget to hire someone who knows what they're doing, so they hire students.

And startups are kind of the same way,

and so in, 2005,

a startup hired me to build a object storage engine, basically,

like s 3, but specialized for backup data. And I was not qualified at all to do this, but they didn't have the budget to hire someone who was qualified,

and so I got the job. That's my sales pitch for anyone who's thinking about joining a startup is that, you know, you will

get to, you know, tackle challenges

that are

outside of your comfort zone and outside of what you've, you know, proved on paper

that you can do. And that was my entry into kind of the big data space.

Yeah. I can definitely agree that working at startups is a good way to kind of stretch your skill set because, as you said, you're going to be tasked with things that you never even knew were a problem until you have to try to solve them. And in larger companies, that's, you know, the domain of some senior engineer who has all of the credentials and has all the experience so that they can just kind of bang something out real quick and not have to really do a lot of research and or exploration.

And, you know, early in my career as a sysadmin, there are a lot of things that got thrown my way that I had no idea how to tackle, but you just kind of figure out a way through it, and it's definitely a good experience.

And so that brings us now to where you are today with Astra platform.

And I know that that's a new

product that you're offering at DataStax, where DataStax has

historically been a company built around Cassandra, and now you're adding Pulsar to the overall offerings. I wonder if you can just start by giving a bit of an overview about what you're building

with the Astra platform and specifically Astra streaming and some of the story behind how you got to where you are now. The story there is that

people have been using

Cassandra

with

streaming use cases for

almost as long as Cassandra's been, you know, hit 1 dot o. And in particular,

people use it a lot with Apache Kafka.

But that's kind of

a situation where, you know, they use Kafka because there isn't a better option.

In particular,

there's a lot of friction with Kafka and Cassandra around the point that

Cassandra

is

well known for being the best database in the industry to use if you need to replicate across multiple data centers, And Kafka doesn't do that. Kafka is a single data center architecture.

That, you know, causes

a lot of tension

because

you've got your system of record

that is able to do this key architectural

facility,

whereas you've got your message bus that isn't able to do that. And so

about a year ago, I started looking at this kind of the streaming space to see, you know, is there

another technology that DataStax

could invest in that's going to be more synergistic

with Cassandra.

And that's when I found that, you know, Pulsar has been around for about 5 years. It was open sourced by Yahoo

and donated to the Apache Software Foundation.

But around 18 or so months ago, Pulsar passed a maturity inflection point to where

you can use Pulsar in production now

and expect it to work

without having to have, you know, a committer

on the team or someone who's digging into the code to figure out what that stack trace means.

And Pulsar

has

a much more modern architecture

than Kafka. So it's designed around separate compute and storage. It's designed around multi tenancy.

It's designed around handling both PubSub and queuing workloads, which, you know, we can talk about that later. That's interesting.

So here's this thing that does geo replication. It has all these other benefits.

And so we acquired a Canadian startup called KESK

that had created a pulsar as a service

last October,

and we've turned that into this new product called Astra Streaming

that exists alongside

our Cassandra as a service

in the Astra platform as well. And in terms of the streaming use case, what are some of the

primary

drivers for incorporating streaming with a Cassandra database

and some of the overall product vision that you have for

incorporating streaming into the DataStax offerings?

This actually comes back to what I mentioned earlier that Pulsar does both

PubSub and queuing workloads. And so PubSub means that that you have topics that you publish messages to, and then you have subscribers that subscribe to those topics. And each subscriber

gets

or can read a copy of all the messages in that topic.

And so it's kind of a fan out of those messages

to anyone who's interested in them. Whereas the queuing side is more of, I want to send this message out to a bunch of potential consumers and load balance it across them. I only want it to be processed once.

And so

both of these are used in microservice

based applications

where,

you know, you don't want to have your microservices

calling

each other's APIs

directly

because that takes the micro out of the microservice

design. You're basically creating a larger

coupled unit out of those services when they're directly coupled. So you decouple those by putting a message bus in between,

and so that's where,

you know, Astra Streaming comes in is, you know, we have this Astra database

that people are using, and it gives them, you know, significant advantages over using a system that only works on 1 cloud.

So

Astra

is open source.

It does hybrid cloud replication. If you want to run it on your own

data center and as well as in the public cloud, you can do that.

And we have a native cloud service for when you don't want to manage that yourself.

So having

both of the pieces of the stateful infrastructure

that you need to run a microservice application on, the database and the message bus, that's significantly more powerful than just having 1 of those.

And we are talking to a bunch of customers, and we realized that they had multiple

system

installed. Like, for, like, a streaming use cases, they could have Kafka kind kind of system. For messaging

use cases, they will have, you know, like, an every time queue kind of system or the JMS best implementation. Right? And, ultimately, even though they're using different messaging system data, data was being being being stored in Cassandra.

So we're like, what is this platform that can solve these problems in 1 messaging system so that our customers don't have to install and manage and operate all of these things. Right? It's a loss of complication

to be able to manage

this different architecture at a scale. Right? We're looking for this kind of system, and that's where the advantage of also is that, you know, it's easy to sort of migrate from your legacy GMS based implementation and all this stuff into streaming based on Apache Pulsar. You can migrate your

RabbitMQ loads to the Pulsar. And, of course, you can do the same same thing for Kafka as well. So just as an organization

where you're already flooded with, like, so many tools and systems and services that you need to build for data warehouse, analytics, machine learning pipeline. It just helps you consolidate all into 1. And you mentioned that you had been doing some work with Kafka and that Kafka was sort of the only option for a while, but that there were some pain points. And you mentioned too that you had taken the time to revisit the overall streaming ecosystem.

And I'm wondering what your initial criteria were as you were starting that search again to determine

what technology do I want to bring in to complement my existing product and some of the

potential other options that you were considering before you ultimately decided on Pulsar?

So we're we're looking for something that,

you know, would fit

well

with Cassandra, which was kind of our starting point there. And so we're looking for something that's high throughput, that's low latency,

that does replication across multiple data centers. I think those are kind of the table stakes.

And then when I started looking a little bit closer at Pulsar and I realized that, you know, they'd already designed in, you know, separate compute and storage

and multi tenancy,

which are

super critical when you're going to build a cloud service on top of something.

That became part of my criteria as well.

And

I'm trying to think, but I don't remember anyone else that does all 5 of those. Yeah. And so,

yeah, Pulsar is kind of in a class by itself in that respect.

Yeah. And also, DataStax has made a bit on on data on Kubernetes kind of architecture. So data when when you store, like, we have there's lots of work on running Cassandra efficiently on Kubernetes.

We also wanted to make sure that the platform that we choose is cloud native. And if you look at looking at the Apache Pulsar architecture because it was only created in, like, 2015, 16 time frame, Kubernetes was already there. Right? And when when we looked into it, it had a turn case installation process for,

Kubernetes. So you don't have to, like, you know, reinvent

this whole thing from a scrap for for you to efficiently run-in a cloud native environment. And on top of what Jonathan says, I think the biggest thing is 1 of the important advantage is read and write path separation.

In different messaging platform, if there's a lots of writes, then the reads are affected. If there's lots of reads, then writes are affected.

You know, the new set of data, the volume of data keeps increasing.

You want a system that can handle the read and write pass separately. And Jonathan already mentioned

that Pulsar had a separate way to do do the storage, separate way to do the compute, that also basically manifest into the read and write path separation as well. So I could have a subscriber

which needs to

pull the message

that is in the system for, like, last 1 month right from the day 1. This subscriber subscriber wakes up and start pulling the data, it will have almost no effect on the right path for the system. Right? So those are the important

platform criteria we had because our customers who use Cassandra to store

millions and millions of records, you know, they need this kind of a scale, and we just didn't find anything that could handle this.

And so in terms of the actual

architecture that you're using for the hosting and management

of this Astra service. I'm wondering if you can dig into some of the technical capacity that's necessary to be able to operate Pulsar at scale and integrate it nicely with the Cassandra hosting that you've already got available? Pulsar is CloudNet. It runs out of work with Kubernetes. It comes with a bunch of HEM charts that you can get going very quickly. So it has that built into it. But Pulsar had other than the start up task, you know,

there are not many known use cases of running Pulsar as a service in a multi cloud environment where, like, multiple people can use the same platform. Right? So it's 1 thing to have a multitenancy built into platform. But if you're running that platform in your own data center for your own teams, that's a different thing. But if you're exposing outside to the world,

all of a sudden, the security concerns come into mind. The guardrails that you have to enforce to basically make sure that a noisy neighbor problem is not there. All the stuff that you have to think about when you're installing when you're launching it as a service in the cloud, that's where, basically, we said, okay. Let's look at the component of the Pulsar,

which of the brokers and bookies and zookeepers can be isolated.

But can we run it in a way that if a customer has lots and lots of data,

like, we can still handle them without affecting the other tenants that are using the same platform. Because at the cloud scale, we didn't want to be in the business of installing separate pulse or cluster for each customer. Then it becomes very, very expensive. So you have to think a resource sharing, then security isolations, and all this stuff. At the same time, if 1 or 2 of these customers have a really, really high volume

usage, you want to be able to assign a sort of dedicated set of resources

on that. And thanks to Apache Pulses, cloud native architecture on Kubernetes, you could do this kind of assignments

that some of these brokers

and bookkeepers will be, dedicated to, like, this customer kind of thing. Even at that point, it still it still don't have to launch a separate cluster for them. Right? So those are the, like, 1 aspect of it. The other 1 was, like, guardrails. So, you know, if you're a developer, you know, sometimes you can accidentally flood the system. So, like, imagine

imagine somebody writes a script, and that that ends up creating, like, 1, 000, 000 topics, like, for the tenant. Like, that will be a very bad thing. Right? So we basically went through

almost, like, between among

broker configuration,

bookkeeper configuration, and jukekeeper configuration. There's, like, thousands of combinations of things you can do, and you don't know which one's gonna hurt you in the long run. Right? We actually went through the exercise of looking at each configuration 1 by 1, Which 1 we need to enforce in guardrails 1. Right? So, you know, when you sign up, you know, obviously, we don't want you to create, like, 100, like, tenants. We don't want you to create a 1, 000, 000 topics. So how many of these you can create? Another aspect, as your clients start sending data to the platform,

you know, it's possible that, you know, you are sending lots and lots of data to our platform, but the subscribers and consumers are not there. So there's a huge backlog. We didn't want it to get to the point where we are, like, storing

gigabytes of data for customers just sitting on the platform without getting consumed. Right? So the guardrails around, like, how much backup storage you should have. All these kind of things came into building this control plane. And, also,

because Pulsar's multi tenancy if it's you were running on your data center, you will be able to, like, admin the brokers and bookkeepers yourself. But in the hosted service, we are not gonna give you access to the underlying infrastructure because, you know, because it's a multi tenant system. You are not the only 1 running on it. Right? So here are the interesting exercise we ended up doing. So what we have is basically a pulsar binary,

a control plane, and this is the 1 that is basically

acting as a proxy for all the admin operations and things like that so that we know what is happening. And because of this, we can monitor usage better and all those stuff.

In terms of the proxying and being able to filter the admin operations

but still be able to expose some of the necessary capabilities to the end user, I know that the open source project has an admin interface available to it, and I'm wondering

what your approach was for

either adapting the existing

tooling or just using that as inspiration for building your own interface to the underlying Pulsar clusters that you're managing for your customers?

Pulsar has a great ecosystem. I know we talked a lot about the architecture and everything,

but Pulsar has a very good ecosystem. You know, there are lots of tools and SDKs built by community, which has been contributed back to the project. And as part of that, they have also built these admin tools.

But the difference was that

data stacks already has this Astra DB as a service, which is Cassandra

as a service. Right? So when we launched Astra streaming and because it's for the same customer,

we wanted them them to have the same unique experience of, like, running database in the cloud. That same experience should translate into running a pulse rate in the cloud. So we had to do the authentication model, the authorization model so that when you are logged into 1, you don't have to sign in again. Like, if you are a DB admin on Cassandra,

maybe it makes sense to have a admin role on Pulsar as well. Right? So those are the things we had to add on top of it. So it just didn't make sense for us to take what already existed in with respect to admin tool and add that because we had to redo it anyway. So the lots of work that we have done in Astra's streaming to expose you to the service had been on actually control plane where we do this authentication, authorization,

rate limiting, Social Security enforcements,

things like that to do that. So that's why we ended up building ours from scratch.

The other advantage of the building, our own configuration panel because is that if you look at the people who gonna use

Astra streaming services, they're gonna be different kind of a skill set. You could be a very strong Java developer who know about Pulsar a lot, but it could be somebody who have heard about Pulsar, who have heard about streaming, and you're you're trying to see how it's working for you. Those users are gonna require lots of hand holding. So we wanted to build that experience

that basically cater to both users. So if somebody coming up, we wanted you to give you the fastest way to get a start with Pulsar. Because you're small tenant, for example, you literally sign up, and in less than 10 seconds, you created a tenant. Everything that is needed to start with Pulsar. Right? I think those experience we could not do without building from a scratch. In terms of the underlying operations of the system,

you know, obviously, at the admin layer, you want to be able to restrict the

overall

functionality

or capabilities of end users to be able to dig into the guts of the system. But were there any other modifications that you had to make to

the Pulsar brokers or the bookies to be able to propagate some of those restrictions

down through the different layers because of the way that it was designed and implemented as an an open source project that is primarily intended to be executed within the confines of a single organization?

1 was definitely related to configuration. Right? So even though Pulsar had the brokers and bookings, as you mentioned, to us, and Zookeeper, we didn't wanted to expose that directly outside our Right? So there's a lot of security rules around that. So for example,

Tasha streaming is available on multiple clouds. So it runs on AWS, GCP, Azure, but the control plane is only on 1 cloud because you need to have a central control plane. So now I have a control plane running on 1 cloud, and this pulsar cluster is running on multiple clouds.

So we wanted, for example,

the control plane to be able to talk to the data plane on different cloud

securely, but without end user being able to do the same thing. So this all these networking securities and firewall rules that you place between the Kubernetes

clusters running on different clouds and all this stuff. Like, those were not easy to solve. Like, we had to go through the guards and everything. Right? With respect to

underlying broker and bookkeeper and and all this stuff, I think

lot of it worked out of box for us. The issue was more around,

like, fixing

the problems when it's under the high load. Right? And when we launched, we didn't know how many customers we're gonna get, what kind of workloads we're gonna get. Right? We were, like, testing for, like, 10 100 and thousands of customers using this platform to solve their thing. As we are, like, flood testing the system

with millions and millions of requests happening,

like, per minute and those kind of things, we found some edge cases, like memory management, garbage collection, and all this stuff that you just normally would not find if you're not a heavy user of the system. So we ended up fixing those problems and contributing

back to the community. And, Jonathan, you wanna add some more to this? The other example I can think of is around the Cassandra

Pulsar integration that we wanted to do.

So Cassandra,

you know, it's a NoSQL database. It's categorized in that space,

but it is

NoSQL database with the concept of a schema.

So in Cassandra, your data is typed.

And in Pulsar, you also have the concept of a schema,

and you can evolve the schema, and you can have messages that use, you know, an earlier version of the schema. You can have the, you know, forwards compatibility with those. And basically, there was improvements we needed to make in how you could

update a Pulsar schema dynamically

and inspect it dynamically

to make that going back and forth between Cassandra and Pulsar seamless. Because 1 of the most common things you want to do

is you want to take Pulsar topic of events

and, you know,

dump those into a Cassandra table with, you know, some light filtering or maybe some light transformation.

But fundamentally, for every event in the topic, you want to generate, you know, 0 or 1

rows in the table or possibly multiple tables. We support that as well.

And going the other way around as well, you wanna take,

change log of changes

to a Cassandra table and put that onto a Pulsar topic.

So we made some improvements around Pulsar schema handling. We contributed those to the Apache project,

and they landed in the 2.8 release, which came out last month.

In terms of some of the other tooling and

sort of monitoring and logging aspects

of running Pulsar as a service in a multi tenant high scale environment, what were some of the

extra bits and pieces that you had to engineer to sort of scaffold the overall Pulsar environment that you're running and be able to ensure that

your mitigations for, you know, noisy neighbors are effective and be able to

identify users who are

coming up against their quota limits and be able to offer them additional scale and capacity when they need it and to some of the other operational

and tooling aspects of being able to provide a robust end user service. When you obviously launch these services, as I said, Pulsar comes with a cloud native setup itself. So when you get when you install Pulsar out of box and that you get from community, it comes with a bunch of bunch of Helm charts, which has a Helm chart for, like, Prometheus, Grafana, alert manager, everything built into it. So you're like, hey. I have everything. Why do we need anything else? Well, those only work with 1 cluster and not at this scale. So that's number 1. Number 2 was that we needed to do the true monitoring of the system.

So what we have built is this open source project called Pulsar heartbeat.

Clearly, basically sits outside

the Pulsar cluster

for and it's sending a synthetic workload,

like, every second just to make sure that the latency across the brokers,

the bookkeepers, and everything is working fine. So it actually sends a message on a topic. It creates topics on the fly. It creates consumers on the fly and measures the end to end latency. Right?

So that's how basically we know that which cluster on which cloud is working well or not, for which tenant and all this stuff. As new tenant get added to the system, as new topics get created, as I mentioned earlier, we have reinforced bunch of guardrails around it. So somebody is getting closer to the system.

You know, obviously, we wanted them to know that they are getting closer to the limits that we have enforced in the system. But then we also had to build a mechanism that we can override to those configuration for selected customers. Right? So those are the additional

things we ended up adding into our control plan to do that. We didn't have to modify underlying pulsar for these kind of things because we have a control plane in front of it. Right? And that's also the beauty of the Pulsar edge open source project, and it's a very, very stable project. Right? So what we realized that that there were some enhancements

needed to be done on security aspect of control access for the guardrail. But the underlying bookkeeper, then that way the deployment architecture, everything works in a cloud native environment is fairly good with Pulsar. So we didn't have to invest a lot on that. Of course, there are some significant differences when you're running Kubernetes on Amazon versus Google Cloud versus Azure, and we had learned some of those. 1 of the, for example, features

in Pulsar,

it comes with tiered storage by default. So what tiered storage does is that you will have hot and live data in the memory, but then you can offload your historical data back to the m s 3 or the GCP storage or things like that. So that was super important to us because now we don't have to buy lots of expensive storage

for our customers. So what we have in force, for example,

is a policy that anytime the backlogs

reach a certain certain threshold or a certain threshold with respect to the time and the size, it automatically goes to the s 3 backup. Right? As a vendor, you know, our cost to support its customer

is minimal because we are not assuring all this stuff. This allows us to do the scale

very efficiently because we all know that Amazon s 3 and GCP storage, everything scales very efficiently. So you can have Amazon data that is coming, but we are not paying that much extra for that. So that was another thing we had to we turn it on by default for you, and that is another advantage

of doing that. As a customer, you know that you're not gonna lose data. We have the tier storage on even if you, like, your subscribers

wakes up 1 week later or decide to add a new application that needs historical data and everything. Like, there's no data deletion. Everything is stored over there, like, by default. Since Prapat mentioned pulsar heartbeat, this is a good time to point out that as part of developing Astra Streaming,

we've open sourced as much as possible of the tooling we've created around Pulsar to run this service. So Pulsar heartbeat is open source. Our admin console is open source,

and

our Helm charts are open source. And we've taken these, not only did we say, hey, it's open source, good luck. We've actually bundled them together into a distribution of Pulsar

that you can run on your own infrastructure if that's what you want to do with, you know, included heartbeat, with the included admin console, and so forth. Yeah. And there's hardly anything to buy us in our stack,

that that is not open source, so that we have not given back to the community.

Strategic approach we have taken in that. We are not just thinking of open source as a as a dumping ground of things that we have done in secret and just go there. All of the things that Jonathan mentioned actually have been in open source from day 1. And if you want to find that distribution of Pulsar, it's called Luna Streaming. So Astro Streaming is the service. Luna Streaming is run it yourself.

And then in terms of some of the other sharp edges, you've mentioned a few different edge cases that you've run up against as far as being able to handle multitenancy

and being able to scale it, you know, some of the bugs that you found. But what were some of the sharp edges that you've had to kind of sand down as you started to scale up and add in more customers and start to

have to support such a wide variety of different workloads that, obviously, you weren't gonna be able to come up with on your own because customers always do strange and wonderful things.

So when I started looking into Pulsar, I was talking to 1 of my friend who knows both about Kafka and Pulsar.

And he described Pulsar that he's a beautiful

system architect. It's designed by great system architects,

but I think the people who actually programmed it were, like, not great great programmers. And what he meant by that

is that the if you look at the API

API and and all this stuff, it's not that intuitive. It's like, okay. This API work this way. This other API doesn't work this way. Right? And I'm a huge freak when it comes to the developer experience of the platform.

So just running into those gotchas when you're trying to interact with post our administrator using the API, and we wanted that to be consistent. We ended up making lots of fixes around that. 1 example, for example,

is that in Astra streaming as well as in a positive pulsar, you are able to connect to different

destinations. Like, so you can send data to Pulsar topic, and it will go to the Elasticsearch. It can go to the Cassandra. Right? And go to the various systems that exist out there. So 1 trivial example is that when when upload the

Elasticsearch configuration,

for example, and if your configuration is not good, you don't know until the the data comes to the topic and it tries to connect to the Elasticsearch. Right? Which is not a great user experience. You wanna know if the configuration is not correct. You wanna know right away. Right? So those are the things we ended up fixing fixing in our streaming services, agile and upstream, I think, as well already,

where this user experience of, like, coming to the platform and being able to get started quickly with the infrastructure you already have, whether it's the Elasticsearch or different kind of systems,

That's where we also ran into some, like,

some issues that we ended up fixing. At a higher level,

we went to market, we launched Astra Streaming Beta,

and we said, hey. Come, you know, build applications

with, you know, with Pulsar.

And

in

several respects, the Pulsar API is, you know, a step forward

from, you know, things like the Kafka API

where, you know, Pulsar builds in, you know, asynchronous

publish and subscribe

events.

But, you know, nobody wants to take an application

that's working against Kafka Kafka or it's working against the JMS API

and rewrite that to use Pulsar. So it's fine for, hey, I'm building something new and I'll start with Pulsar.

But it's a tougher

sell

for, hey, you should rewrite your code so that you can use our new service. 1 of the things that we've been investing in is, you know, improving

Pulsar's Kafka compatibility.

So there's a

Java client for Pulsar

that implements the Kafka API.

And so we've contributed some improvements to that, but we have also

spent a lot of effort making it so that you can just take a Kafka connector.

So you have a connector to, you know, Elasticsearch

or you have a connector to, you know, MongoDB

and

send data back and forth between those systems without writing any code, you can take any of those, you know, 120 plus connectors that exist for Kafka

and drop that in Pulsar, and it's able to use that to talk to any of those systems. So that was also released in Pulsar 2.8, so that's something that we're fairly proud of, driven by our experience with customers

that we're giving back to Apache.

RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery,

Amazon Redshift, and more.

Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming.

With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder

today.

Yeah. That was gonna be 1 of my next questions is the sort of migration story for people who have an existing Kafka workload because I know that there are a few different compatibility layers. There's the Java API that you mentioned. I also know that the folks at Stream Native have added a protocol adapter to Pulsar for being able to speak the Kafka protocol for clients that aren't using

Java as their implementation language.

And then there's also the Pulsar IO set of libraries

to kind of mimic what the Kafka Connect project was doing, but native to Pulsar. And so it's definitely interesting to hear that you have

existing Kafka Connect libraries to be able to work natively with Pulsar as well. So I'm wondering if you can maybe just spend a bit more time talking about some of the technical

challenges and implementation details that you've had to go through to

be able to manage those migration paths for people who were already using Kafka and want to migrate to Pulsar?

That's a little deeper than I can go. We can definitely introduce you to the engineers who wrote that code to talk about that.

But actually so Andrei Yeageroff, who worked on the Kafka Connect

piece of things,

gave a talk at Polestar Summit

about that. And so he does go into, like, here are the

impedance mismatches that I ran into and had to find a way around.

As you're building out the LUNA distribution specifically, maybe some of the considerations

that you have for

what to include

and sort of the feature comparisons

that are available for people who decide to choose the luda distribution versus the

Apache open source version or the stream native distribution or if there are any other sort of distributions that are out in the ecosystem that you're tracking for being able to help people with that selection process.

Yeah. Our goal with Luna Streaming

is to give people a batteries included for my, you know, friends in the Python community. That's 1 part of Python's philosophy is we want to give you when you download Python,

you know, you don't need to download a whole bunch of other stuff. Python has batteries included, and you can be productive

with what you just downloaded. And so that's what our goal is with Luna Streaming as well. And so that's why we've included the new improved monitoring tools, the improved admin tools, the improved helm charts.

1 thing that's unique about Luna Streaming is

it it smooths the on ramp for people who haven't already

bought into

the Kubernetes world. You know, Kubernetes is the foundation of both our streaming

and our database

technology,

but, you know, I think everyone recognizes that Kubernetes is the future of operations, but it isn't the present everywhere.

And so

if we are saying, hey, to use Pulsar, you need to, you know, spend a week installing Kubernetes first. That's a big obstacle to some people. And so we partnered with a company called Replicated

to give you basically Kubernetes in a box as part of loop Luna Streaming. And so it will lay down

a Kubernetes environment if you point it at a, you know, cluster of 3 machines or 9 machines, it will lay down a minimal Kubernetes environment on those machines or on those VMs

and then install Pulsar on those for you. Yeah. And I will add that something similar for for astral streaming. I know I mentioned a lot about guardrails and enforcing the control and everything.

What we didn't wanted to compromise

that underlying

you are what you are getting is the Apache Pulsar

so that all the other toolings and frameworks and SDK that exist in the open source and that's built by community or by different vendors,

those should work out of box. So when you sign up for Astra's team and create a tenant for yourself, that is no different than creating a tenant for yourself on your local running Pulsar. So the existing tools like Pulsar CLI, performance tool, and the ecosystem of the, the connectors and everything, we do not you know, those things will work out the box, like, for you. Obviously, with the Astra streaming, we wanna make sure that that connectors we support are, like, tested heavily,

and it works out of box, and we can give you, like, a a peace of mind with respect to how it worked. But on the streaming,

like, if you have a connector that works with open source,

those will work with streaming Autobots. You don't have to do any modification.

I say with respect to licensing as well, there's no restriction. Everything is a party license. There's no, like, open core and this and that, all the stuff that you see here at enterprise peace of mind that you're getting is a pure, truly open source software.

The other thing that we do with Luna Streaming is

let me characterize it this way. If you want, like, the newest, hottest features in Pulsar,

then you should get the, you know, the official Apache Pulsar distribution.

1 of the things that we're trying to do with Luna Streaming is give people something that moves a little bit slower.

And in exchange for that slowness,

you know, we do more testing and, you know, we back port fixes as necessary

so that when we have, you know, Luna Streaming 2.6.2,

you know, we pulled in some fixes that would appear later on in Apache Pulsar 2.6.3,

and there wasn't a Luna Streaming 2.6.0.

Like, we waited until it was, you know, as stable as possible and then released that. So that's kind of the in the enterprise market that DataSec serves, that's what they're optimizing for is more the peace of mind than the latest and greatest. And especially on security and stuff. Right? So, you know, as you could tell, our customers are, like, big enterprises, Fortune 500 companies and App Store, and they have their own security process. So lots of those kind of finding, whatever they end up finding, we end up fixing, like, early in. Sorry. We end up fixing those and Luna's streaming as well as contributing contributing back to the community.

So, you know, they saw some sort of, like, ecosystem, this thing going on between Luna and open source Apache Pulsar.

And another element of the Pulsar ecosystem and the Pulsar project that we haven't touched on yet is the Pulsar functions capabilities for being able to do ad hoc event driven execution

of arbitrary code on the different messages within the topics.

And I'm wondering if you can just talk through some of the

tooling and

user experience

enhancements that you've been able to build in around how you manage the packaging and deployment of code to be run-in these functions environments and just some of the operational challenges of executing arbitrary user code in your managed service.

On the Pulsar Functions, for people who don't know, is that you can write your own custom code in supported languages, which is Python,

Java, Golang as well. I'm not sure how good the Golang support it, but definitely Python and Java, and we test that quite a bit. So idea is that you can push a arbitrary piece of code, and that will act on each message on a topic. And then that piece of code usually is for, like, a message transformation, validation.

You wanna enrich that record in the topic with something else. You could do all this stuff.

So PulsarFunction

is the underlying platform powering a Pulsar Connector ecosystem. So when we run a Cassandra Connector, when we run Elasticsearch Connector, actually, underlying mechanism

is through Pulsar Functions. Right? So it is a first class citizen of of it. Right? The question is when you have a hosted platform, the managed platform, and you have you don't know the developers who is pushing code to it, Like, how do you enforce the security and the guardrails around it? Right? That's a very, very, you know, difficult thing to do at a scale, obviously, as you could tell. So

in Astra streaming that we have right now, we don't allow you to upload custom code right now. We are working on that path where it will be available,

but it's not that difficult either. Because what happens is that

because we run a Pulsar on Kubernetes cluster and we can tie the Pulsar functions to a pod in Kubernetes. Right? So what we are working on is that you upload your code, and it runs in 1 of the pod that is assigned to your namespace.

So you can do whatever you wanna do. If you wanna hurt yourself, you're gonna hurt yourselves. You know, you're calling that pod. Right? Also, architecture, because cloud native aspect of this is makes it relatively easier for us to support this. But having said that, we still have not done it because we wanna make sure that the networking aspect of it so we don't want anybody to write code that can sniff around the record for somebody else topic and all this stuff. So we're still working on that, but I don't think we are very far from that, you know, because the underlying cloud native architecture and the way it runs in Kubernetes. Yeah. We will never run, for example, this Pulsar functions in the process JVM process itself. You know, Apache Pulsar allows you to configure

where this runs, whether it's in the process, same process, or on the same node or in a different pod. We are not gonna do the same process, same notice. This is gonna be the part thing where we can enforce some security around it and make sure that it works. And then 1 of the other pieces that you call out in the marketing material on the Luna homepage is the ability to perform machine learning on streaming data. And I'm wondering if you can just talk through some of the use cases for that style of application

and some of the supporting systems that are necessary to be able to power that execution

and the capabilities that Pulsar has natively that make it a tractable problem for you.

Yeah. I mean, when we talk to lots of our customers and my own personal experience at that last startup,

1 of the use cases of using queuing and messaging system was to build build this machine learning pipeline. Right?

Where where you have a bunch of data coming, you wanna update your model, you wanna start with the model and all this stuff. In Pulsar itself, through Pulsar functions, you can run Python code and and Java code as we talked about. Right? Some of our customers have already done when when they use Apache Pulsar from machine learning, they're using Pulsar functions with TensorFlow if if they're a Python shop. There are a couple of examples of customers using deep learning for j in the Pulsar functions, if they are like a Java shop. Right? So when I think about machine learning with respect to messaging system, you have a bunch of data coming up,

and you need to create a model. That means that each recording that topic likely needs to be cleaned.

Somehow, it will enrich with some of the data. So imagine a record comes in. You have some information of a customer or event in that payload that needs to be enriched with something in Cassandra. So with that, which you can basically in the pulse rates, you're not talking to 3 different systems. Basically,

you can chain different pulse functions, so 1 step after another. So imagine a pulse function

where a function under the transformation, second does the enrichment, and third 1 is actually sends the data to your system, which is which is updating the machine learning model on the fly. Right? So we don't have a fixed way to basically,

you know, describe that this is how we should do machine learning. But usually, if you look at the least common denominator on the machine learning pipeline is to clean the data in different steps and also functions and appropriately training them. You can create this ML pipeline easily, and we don't have to maintain yet another infrastructure.

And because you are running the Pulsar function in the Pulsar itself, you know, it's always up to date with the latest data. You can enforce schemas on it. Right? So they just schema drift on the topic. Your Pulsar function will not work, and you would know that issue right away. Right? The other things in Pulsar architecture that allow you to help you do that, obviously,

you know, when it comes to the serious machine learning

use cases where your model is used and everything, probably you would not use Pulsar functions for that because it's a thing in itself. So in that case, your topic and Pulsar function would reconnect to that deployment somehow and do the message flow and everything.

Yeah. But for lightweight machine learning pipeline and stuff, I think you can easily chain functions together.

And then going back to the overall ecosystem

question, 1 of the other interesting pieces

of having companies like DataStax and Stream Native and some of the other ones out there

using Pulsar in production at scale and investing in it is that it acts as an accelerator

for the Apache project itself. And so I'm wondering

what are some of the elements of the road map that you are most excited for, that you're working towards,

and some of the

overall potential that you see for the Pulsar project now that there is much more momentum behind it in the past, you know, year or 2? This is another thing we have a Pulsar Summit talk that people can refer to. It's a excellent half hour

talk on applying chaos engineering to Pulsar.

1 of the things that we're focusing on is

reducing that time between, hey, there's a dot 0 release and,

you know, it's stable. We can build a LUNA streaming release on top of this.

We're leveraging a set of tools that we built

for testing Cassandra

called Fallout. We open sourced those a couple years ago and now we're applying them to Pulsar as well. And so, basically,

the idea with Fallout is that

you can declaratively

specify

scenarios

that you want Fallout to apply

either sequentially

or concurrently

to a distributed system.

So in the Pulsar space, we can be, you know, adding bookkeeper storage nodes at the same time a zookeeper node fails,

at the same time as we're adding a new tenant in a bunch of topics, at the same time as an existing tenant is sending a 100, 000 messages per second through the system. What happens when you start composing these scenarios together and can you find any,

you know, misbehaviors

that you need to fix?

And that puts me in mind of the Jepsen project with Kyle Kingsbury. I'm wondering if you or anyone else in the community has engaged with him to

do some of that type of stress testing

and sort of ferreting out the distributed systems design flaws that exist in Pulsar to be able to address them?

2 answers for that. So Fallout is

built on top of Jepsen. So Jepsen

is a closure tool and to create new scenarios, you have to write closure.

And so we said, you know, we love closure,

but it's not a super common skill set. So we created a domain specific language

that allows you to build these scenarios without having to go all the way down to closure.

The other thing that Fallout does is it gives you, you know, basically a CI service for running these, and so you can set and it will generate reports

and performance graphs and, you know, all those things of doing a test run that you'll want to investigate what happened to it, you know, later.

The other answer that I think is super interesting here is there is an engineer named Jack Van Lightly

who was hired by Splunk relatively recently.

And he's applying

formal modeling

techniques

to Pulsar and to Bookkeeper. So, you know, Fallout and Jepsen,

you know, they'll allow you to,

you know, run a whole bunch of different scenarios and see what happens. You can't prove that there's not a scenario that you didn't happen to run. And so what formal modeling does is it lets you

build, you know, a theoretical

representation

of the bookkeeper

storage protocol, for instance.

You can test it at a much lower level.

It's also not a proof technique. It's not gonna prove that the bookkeeper protocol

is correct. But instead of testing, you know, thousands of scenarios

in a week, I can test, you know, millions of scenarios in an hour. And so it lets me be much more exhaustive about my confidence that a bookkeeper protocol is correct.

As part of the work that he did, he actually did find a bug in the protocol

several months ago. You know, we were able to fix that as a community and, you know, so I think the combination of

formal tools that are more commonly coming out of academia

and chaos engineering that's coming out of, you know, the very practical side of companies like Netflix,

it's very powerful combination for improving correctness of distributors.

In terms of the ways that your customers are using the Astra platform, platform, what are some of the most interesting or innovative or unexpected

use cases that they've employed?

I think the answer to most of those, like, the generic answer is they're doing

crazy stuff with Pulsar functions.

You know, you can do arbitrarily complex things

without having to stand up and manage a separate cluster

because Pulsar is managing that for you or it's incorporated

into the, you know, same Kubernetes platform as Pulsar.

1 company, a financial services company,

is actually

using Pulsar functions for all the business logic

for the transactions flowing through their system. And

I guess 1 of the reasons that's so cool to me is that when you're a a Fintech company,

you're super, super risk averse. And so these guys did all kinds of homework and to make sure that, you know, that all the failure scenarios

were spoken for

and and then they went ahead and put it in production. So that would be my favorite example.

Oh, 1 of the funny example actually, and this is not a real event streaming example. When we first announced that we're gonna do the astral streaming and it is event streaming platform,

few people actually understood as a literal event streaming platform where, like, you have a video camera

streaming the events and other stuff. So when we launched the alpha version of we asked, hey. What's your use case? I was like, I'm excited to try this platform to see how I can do live video streaming or not. We were like, uh-oh. This is not the right streaming we are talking about here. That was sort of funny thing that we ran into as well.

The joys of the challenge of naming things and the

collisions of meaning that occur in the English language.

Yeah. By the way, if anybody has any recommendation of what you call it, please let us know. We have tried data streaming. That gets confuse confusing with something else. We have tried event streaming.

Obviously, I just saw that was, like,

something different. So we still don't know what is the thing that describes what we do here.

I don't think that there's a succinct way to put it, but there are definitely more verbose ways that you could be add clarity.

Exactly.

Alright. And then in terms of your own experience of building the Astra platform and onboarding customers and being able to support it at platform and onboarding customers

and being able to support it at scale, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think just going through the guardrail process to reinforce, like, what we would allow a regular user to do so that they are not hurting themselves and they're not hurting us. Like so it is a pay as you go system. So when you sign up for Astra Streaming, you are not thinking about how many brokers and bookkeepers you need. We just charge based on data in and data out. At the same time, we wanted to enforce that at the end of the month. You're not surprised by a huge bill because the role process was sending bunch of data and all this stuff. Right? So just thinking through the process of what is the right guardrail because everybody's use case is different and how can we make it make this platform so that we can turn on these things easily for different customers. I think we're the most difficult part because different configuration works differently with each other. So you have to make sure that the

combination and the defaults that we give is correct. That was definitely a challenge. And with respect to user experience, actually, we are still learning about this because it's only been few months since we have launched streaming as a platform as a service here. The people who are coming to our platform, they have a different kind of experience. They've been using Kafka for a while, and they they have heard that this Pulsar is a new thing. This really works well. They're coming from that background. Sometimes some of the people have used Pulsar for a year, and they want to see if they can have a managed offering because they don't wanna do it themselves.

And then you have a completely,

like, a new set of app developers who are building mobile apps and web application using GraphQL and whatnot,

and their understanding of a streaming system is very different. Right? So how do you build a user experience

where obvious things

are obvious, but also the complicated things are also there in a way that it caters to both of them? That has been sort of the difficult balance for us to get it right. You know? I don't think we have cracked the code yet. Like, what is the fine line between, like, hand holding you as your brand new user to the system versus expert 1. And we're still working on that. And our design team does bunch of interviews and everything to make sure that we got it right. If I'm looking forward to, like, what are the next challenges,

we touched on 1 of those around

allowing arbitrary

user code to run-in Pulsar functions and how we plan to tackle that. Looking ahead longer term, I think it's gonna be interesting

to tackle streaming analytics

as well as, you know, the raw foundation that Pulsar gives you. And you can do that kind of by hand with Pulsar functions,

but it's much more involved doing it that way versus using something like Flink and Flink SQL.

So do we provide Flink as a service as well? That's something that we're gonna be looking at. Yeah. And on the same line, you know, as you know, the virus, I've heard few episodes about change data capture. And, you know, this is a beast in itself. And if you have a postgres and my SQL kind of database where everything is happening in 1 node and, like, you have all the logs and everything that you need to replicate to the system, it gets easier. But when you have a systems like Cassandra, you know, no SQL database running on multiple nodes and there's no master concept of it, that means you would reconcile this change log into 1 place before you send that to the Elasticsearch or different kind of system.

You know, getting that turnkey out of box, we are a Cassandra company, so we should know how to do this better. Right? Just working on that, fine tuning that experience that if you need to do the CDC on Cassandra,

his latest platform. And so those are the another things that we already have an offering already, but we're constantly, like, hard tune fine tuning that as well. That has been super interesting project for us actually here. For people who are interested in exploring Pulsar and they want to be able to get it up and running fairly quickly, what are the cases where Astra might be the wrong choice?

Man, for people who wanna get up and running with Pulsar quickly, you know, Astra is the right choice, like, for everyone.

Well, I guess if you need Pulsar functions in,

July 2021,

we don't have that yet. And so that would not be a good choice.

Yeah. That'll be 1 reason. As I said, you know, parser ecosystem has a 100 plus source connector, destination connect, sync connectors, and all this stuff. Right?

We're not sure that the quality of all those connectors are really, really up to the standard. Right? So we only support

limited set of connectors based on what our customers have asked for. So if you need to connect article database for whatever reason, you know,

you won't be able to do that here in that system. We don't have that. And those are the use cases I would say, you know, if you don't have the features that we need, then you, of course, you're not gonna use Astra Estiming for that. But if you know

that you need Pulsar, I think that ASR streaming is the fastest way to get started,

for sure.

You've mentioned a few things already about what you have planned for some of the upcoming road map, but what are some of the other elements that you're excited to work towards in the near to medium term, both for the Astra project and also more broadly for the Pulsar ecosystem?

So as a service provider,

we care a lot about both day 1 and day 2 experience. So, like, you know, of course, I talked a lot that this is the fastest way to get it started,

but we would also wanna make sure that we give you the best

service for, like, day 2 operations with respect to monitoring, alerting, the failover capabilities,

making sure that everything just works out of box and all this stuff. I think we're just starting on that journey as part of it. We give you Apache Pulsar as a service because we wanted to consolidate your Kafka and messaging solution, queuing solution all into 1. But then when you start using, we don't want it to have you to do 10 more different infrastructure things to make sure this is working properly. So those are the day 2 things. Right? You will see us working a lot on those so that you don't have to like, when you come to the platform, it is truly good service. That's the 1 on the usability perspective.

The CDC work that, as you mentioned, like, making it turnkey with the cloud database that we have is another 1. And then also then having a laundry list of supported connectors that works out of bug. Alright. Well, for anybody who wants to get in touch and follow along with you and try out Astra Streaming, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I've been thinking about that since last night since I saw the questions.

I think the building, the data pipeline, the machine learning pipeline, it is still

very complex. Right? You need to know the data source, the enrichment

transformation, how do you connect to the different systems, the building machine learning model based on that. I think it still requires a lots of hand holding and gluing things together. I know that there are other vendors trying to make it consolidated.

But when it comes to open source ecosystem where you can connect these streaming systems and the databases and machine learning model and all this stuff, I think that is a big challenge still. I would say broadly across the industry,

a big gap is around

infrastructure

that supports

seamless deployment

across

multiple regions and multiple data centers. So we're trying to close that gap on the Cassandra side, on the Pulsar side, but there's still everything else. And, you know, all the other stateful infrastructure

that, by and large, is just starting to tackle that. Well, thank you both very much for taking the time today to join me and share the work that you're doing on Astra and contributing to the Pulsar Ecosystem. It's definitely a very exciting project and 1 that I'll be keeping a close eye on and probably play around with a bit myself. So thank you for all of the time and effort you've put into that, and I hope you each enjoy the rest of your day. Thanks so much. Thank you, Tobias.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links