Summary
Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
- Your host is Tobias Macey and today I’m interviewing Prabhat Jha and Jonathan Ellis about Astra Streaming, a cloud-native streaming platform built on Apache Pulsar
Interview
-
Introduction
-
How did you get involved in the area of data management?
-
Can you describe what the Astra platform is and the story behind it?
-
How does streaming fit into your overall product vision and the needs of your customers?
-
What was your selection process/criteria for adopting a streaming engine to complement your existing technology investment?
-
What are the core use cases that you are aiming to support with Astra Streaming?
-
Can you describe the architecture and automation of your hosted platform for Pulsar?
- What are the integration points that you have built to make it work well with Cassandra?
-
What are some of the additional tools that you have added to your distribution of Pulsar to simplify operation and use?
-
What are some of the sharp edges that you have had to sand down as you have scaled up your usage of Pulsar?
-
What is the process for someone to adopt and integrate with your Astra Streaming service?
- How do you handle migrating existing projects, particularly if they are using Kafka currently?
-
One of the capabilities that you highlight on the product page for Astra Streaming is the ability to execute machine learning workflows on data in flight. What are some of the supporting systems that are necessary to power that workflow?
- What are the capabilities that are built into Pulsar that simplify the operational aspects of streaming ML?
-
What are the ways that you are engaging with and supporting the Pulsar community?
- What are the near to medium term elements of the Pulsar roadmap that you are working toward and excited to incorporate into Astra?
-
What are the most interesting, innovative, or unexpected ways that you have seen Astra used?
-
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Astra?
-
When is Astra the wrong choice?
-
What do you have planned for the future of Astra?
Contact Info
- Prabhat
- @prabhatja on Twitter
- prabhatja on GitHub
- Jonathan
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Pulsar
- Datastax Astra Streaming
- Datastax Astra DB
- Luna Streaming Distribution
- Datastax
- Cassandra
- Kesque (formerly Kafkaesque)
- Kafka
- RabbitMQ
- Prometheus
- Grafana
- Pulsar Heartbeat
- Pulsar Summit
- Pulsar Summit Presentation on Kafka Connectors
- Replicated
- Chaos Engineering
- Fallout chaos engineering tools
- Jepsen
- Jack VanLightly
- Change Data Capture
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to data engineering podcast.com/census today to get a free 14 day trial. Your host is Tobias Macy. And today, I'm interviewing Prapat Jha and Jonathan Ellis about Astra Streaming, a cloud native streaming platform built on Apache Pulsar. So, Prabhat, can you start by introducing yourself?
[00:01:47] Unknown:
Sure. Hi. This is Prabhat Jha. I head head of the engineering firm for Astra's Streaming here at DataStax.
[00:01:54] Unknown:
I'm Jonathan Ellis. I'm CTO at DataStax, and I work on streaming with Prabhat.
[00:02:00] Unknown:
And, Prabhat, do you remember how you first got introduced to data management?
[00:02:03] Unknown:
I got into data management by accident, actually. After working for Devos for a while, I just I decided to do a startup, and that the start up was related to mobile application performance monitoring. So things like New Relic data doc for mobile apps way back in 2011. And when we launched it, we realized that we're getting, like, these metrics and logs from, like, hundreds and thousands of devices every minute. And that's where, like, it really hit me hard that I had to store that kind of data volume and analyze it. And back then in 2011, it was, like, sending those data from mobile SDKs to the Amazon SQS and then basically processing those data and storing data recently, searching and all this stuff. Right? That's how I got into it way back in 2011.
[00:02:47] Unknown:
And, Jonathan, how about you? We were talking before the show started about, you know, jobs in college and how colleges don't have the budget to hire someone who knows what they're doing, so they hire students. And startups are kind of the same way, and so in, 2005, a startup hired me to build a object storage engine, basically, like s 3, but specialized for backup data. And I was not qualified at all to do this, but they didn't have the budget to hire someone who was qualified, and so I got the job. That's my sales pitch for anyone who's thinking about joining a startup is that, you know, you will get to, you know, tackle challenges that are outside of your comfort zone and outside of what you've, you know, proved on paper that you can do. And that was my entry into kind of the big data space.
[00:03:38] Unknown:
Yeah. I can definitely agree that working at startups is a good way to kind of stretch your skill set because, as you said, you're going to be tasked with things that you never even knew were a problem until you have to try to solve them. And in larger companies, that's, you know, the domain of some senior engineer who has all of the credentials and has all the experience so that they can just kind of bang something out real quick and not have to really do a lot of research and or exploration. And, you know, early in my career as a sysadmin, there are a lot of things that got thrown my way that I had no idea how to tackle, but you just kind of figure out a way through it, and it's definitely a good experience. And so that brings us now to where you are today with Astra platform.
And I know that that's a new product that you're offering at DataStax, where DataStax has historically been a company built around Cassandra, and now you're adding Pulsar to the overall offerings. I wonder if you can just start by giving a bit of an overview about what you're building with the Astra platform and specifically Astra streaming and some of the story behind how you got to where you are now. The story there is that
[00:04:46] Unknown:
people have been using Cassandra with streaming use cases for almost as long as Cassandra's been, you know, hit 1 dot o. And in particular, people use it a lot with Apache Kafka. But that's kind of a situation where, you know, they use Kafka because there isn't a better option. In particular, there's a lot of friction with Kafka and Cassandra around the point that Cassandra is well known for being the best database in the industry to use if you need to replicate across multiple data centers, And Kafka doesn't do that. Kafka is a single data center architecture. That, you know, causes a lot of tension because you've got your system of record that is able to do this key architectural facility, whereas you've got your message bus that isn't able to do that. And so about a year ago, I started looking at this kind of the streaming space to see, you know, is there another technology that DataStax could invest in that's going to be more synergistic with Cassandra.
And that's when I found that, you know, Pulsar has been around for about 5 years. It was open sourced by Yahoo and donated to the Apache Software Foundation. But around 18 or so months ago, Pulsar passed a maturity inflection point to where you can use Pulsar in production now and expect it to work without having to have, you know, a committer on the team or someone who's digging into the code to figure out what that stack trace means. And Pulsar has a much more modern architecture than Kafka. So it's designed around separate compute and storage. It's designed around multi tenancy. It's designed around handling both PubSub and queuing workloads, which, you know, we can talk about that later. That's interesting.
So here's this thing that does geo replication. It has all these other benefits. And so we acquired a Canadian startup called KESK that had created a pulsar as a service last October, and we've turned that into this new product called Astra Streaming that exists alongside our Cassandra as a service
[00:07:16] Unknown:
in the Astra platform as well. And in terms of the streaming use case, what are some of the primary drivers for incorporating streaming with a Cassandra database and some of the overall product vision that you have for incorporating streaming into the DataStax offerings?
[00:07:37] Unknown:
This actually comes back to what I mentioned earlier that Pulsar does both PubSub and queuing workloads. And so PubSub means that that you have topics that you publish messages to, and then you have subscribers that subscribe to those topics. And each subscriber gets or can read a copy of all the messages in that topic. And so it's kind of a fan out of those messages to anyone who's interested in them. Whereas the queuing side is more of, I want to send this message out to a bunch of potential consumers and load balance it across them. I only want it to be processed once. And so both of these are used in microservice based applications where, you know, you don't want to have your microservices calling each other's APIs directly because that takes the micro out of the microservice design. You're basically creating a larger coupled unit out of those services when they're directly coupled. So you decouple those by putting a message bus in between, and so that's where, you know, Astra Streaming comes in is, you know, we have this Astra database that people are using, and it gives them, you know, significant advantages over using a system that only works on 1 cloud.
So Astra is open source. It does hybrid cloud replication. If you want to run it on your own data center and as well as in the public cloud, you can do that. And we have a native cloud service for when you don't want to manage that yourself. So having both of the pieces of the stateful infrastructure that you need to run a microservice application on, the database and the message bus, that's significantly more powerful than just having 1 of those.
[00:09:30] Unknown:
And we are talking to a bunch of customers, and we realized that they had multiple system installed. Like, for, like, a streaming use cases, they could have Kafka kind kind of system. For messaging use cases, they will have, you know, like, an every time queue kind of system or the JMS best implementation. Right? And, ultimately, even though they're using different messaging system data, data was being being being stored in Cassandra. So we're like, what is this platform that can solve these problems in 1 messaging system so that our customers don't have to install and manage and operate all of these things. Right? It's a loss of complication to be able to manage this different architecture at a scale. Right? We're looking for this kind of system, and that's where the advantage of also is that, you know, it's easy to sort of migrate from your legacy GMS based implementation and all this stuff into streaming based on Apache Pulsar. You can migrate your RabbitMQ loads to the Pulsar. And, of course, you can do the same same thing for Kafka as well. So just as an organization
[00:10:29] Unknown:
where you're already flooded with, like, so many tools and systems and services that you need to build for data warehouse, analytics, machine learning pipeline. It just helps you consolidate all into 1. And you mentioned that you had been doing some work with Kafka and that Kafka was sort of the only option for a while, but that there were some pain points. And you mentioned too that you had taken the time to revisit the overall streaming ecosystem. And I'm wondering what your initial criteria were as you were starting that search again to determine what technology do I want to bring in to complement my existing product and some of the potential other options that you were considering before you ultimately decided on Pulsar?
[00:11:14] Unknown:
So we're we're looking for something that, you know, would fit well with Cassandra, which was kind of our starting point there. And so we're looking for something that's high throughput, that's low latency, that does replication across multiple data centers. I think those are kind of the table stakes. And then when I started looking a little bit closer at Pulsar and I realized that, you know, they'd already designed in, you know, separate compute and storage and multi tenancy, which are super critical when you're going to build a cloud service on top of something. That became part of my criteria as well.
And I'm trying to think, but I don't remember anyone else that does all 5 of those. Yeah. And so, yeah, Pulsar is kind of in a class by itself in that respect.
[00:12:02] Unknown:
Yeah. And also, DataStax has made a bit on on data on Kubernetes kind of architecture. So data when when you store, like, we have there's lots of work on running Cassandra efficiently on Kubernetes. We also wanted to make sure that the platform that we choose is cloud native. And if you look at looking at the Apache Pulsar architecture because it was only created in, like, 2015, 16 time frame, Kubernetes was already there. Right? And when when we looked into it, it had a turn case installation process for, Kubernetes. So you don't have to, like, you know, reinvent this whole thing from a scrap for for you to efficiently run-in a cloud native environment. And on top of what Jonathan says, I think the biggest thing is 1 of the important advantage is read and write path separation.
In different messaging platform, if there's a lots of writes, then the reads are affected. If there's lots of reads, then writes are affected. You know, the new set of data, the volume of data keeps increasing. You want a system that can handle the read and write pass separately. And Jonathan already mentioned that Pulsar had a separate way to do do the storage, separate way to do the compute, that also basically manifest into the read and write path separation as well. So I could have a subscriber which needs to pull the message that is in the system for, like, last 1 month right from the day 1. This subscriber subscriber wakes up and start pulling the data, it will have almost no effect on the right path for the system. Right? So those are the important platform criteria we had because our customers who use Cassandra to store millions and millions of records, you know, they need this kind of a scale, and we just didn't find anything that could handle this.
[00:13:40] Unknown:
And so in terms of the actual architecture that you're using for the hosting and management
[00:13:46] Unknown:
of this Astra service. I'm wondering if you can dig into some of the technical capacity that's necessary to be able to operate Pulsar at scale and integrate it nicely with the Cassandra hosting that you've already got available? Pulsar is CloudNet. It runs out of work with Kubernetes. It comes with a bunch of HEM charts that you can get going very quickly. So it has that built into it. But Pulsar had other than the start up task, you know, there are not many known use cases of running Pulsar as a service in a multi cloud environment where, like, multiple people can use the same platform. Right? So it's 1 thing to have a multitenancy built into platform. But if you're running that platform in your own data center for your own teams, that's a different thing. But if you're exposing outside to the world, all of a sudden, the security concerns come into mind. The guardrails that you have to enforce to basically make sure that a noisy neighbor problem is not there. All the stuff that you have to think about when you're installing when you're launching it as a service in the cloud, that's where, basically, we said, okay. Let's look at the component of the Pulsar, which of the brokers and bookies and zookeepers can be isolated.
But can we run it in a way that if a customer has lots and lots of data, like, we can still handle them without affecting the other tenants that are using the same platform. Because at the cloud scale, we didn't want to be in the business of installing separate pulse or cluster for each customer. Then it becomes very, very expensive. So you have to think a resource sharing, then security isolations, and all this stuff. At the same time, if 1 or 2 of these customers have a really, really high volume usage, you want to be able to assign a sort of dedicated set of resources on that. And thanks to Apache Pulses, cloud native architecture on Kubernetes, you could do this kind of assignments that some of these brokers and bookkeepers will be, dedicated to, like, this customer kind of thing. Even at that point, it still it still don't have to launch a separate cluster for them. Right? So those are the, like, 1 aspect of it. The other 1 was, like, guardrails. So, you know, if you're a developer, you know, sometimes you can accidentally flood the system. So, like, imagine imagine somebody writes a script, and that that ends up creating, like, 1, 000, 000 topics, like, for the tenant. Like, that will be a very bad thing. Right? So we basically went through almost, like, between among broker configuration, bookkeeper configuration, and jukekeeper configuration. There's, like, thousands of combinations of things you can do, and you don't know which one's gonna hurt you in the long run. Right? We actually went through the exercise of looking at each configuration 1 by 1, Which 1 we need to enforce in guardrails 1. Right? So, you know, when you sign up, you know, obviously, we don't want you to create, like, 100, like, tenants. We don't want you to create a 1, 000, 000 topics. So how many of these you can create? Another aspect, as your clients start sending data to the platform, you know, it's possible that, you know, you are sending lots and lots of data to our platform, but the subscribers and consumers are not there. So there's a huge backlog. We didn't want it to get to the point where we are, like, storing gigabytes of data for customers just sitting on the platform without getting consumed. Right? So the guardrails around, like, how much backup storage you should have. All these kind of things came into building this control plane. And, also, because Pulsar's multi tenancy if it's you were running on your data center, you will be able to, like, admin the brokers and bookkeepers yourself. But in the hosted service, we are not gonna give you access to the underlying infrastructure because, you know, because it's a multi tenant system. You are not the only 1 running on it. Right? So here are the interesting exercise we ended up doing. So what we have is basically a pulsar binary, a control plane, and this is the 1 that is basically acting as a proxy for all the admin operations and things like that so that we know what is happening. And because of this, we can monitor usage better and all those stuff.
[00:17:37] Unknown:
In terms of the proxying and being able to filter the admin operations but still be able to expose some of the necessary capabilities to the end user, I know that the open source project has an admin interface available to it, and I'm wondering what your approach was for either adapting the existing tooling or just using that as inspiration for building your own interface to the underlying Pulsar clusters that you're managing for your customers?
[00:18:05] Unknown:
Pulsar has a great ecosystem. I know we talked a lot about the architecture and everything, but Pulsar has a very good ecosystem. You know, there are lots of tools and SDKs built by community, which has been contributed back to the project. And as part of that, they have also built these admin tools. But the difference was that data stacks already has this Astra DB as a service, which is Cassandra as a service. Right? So when we launched Astra streaming and because it's for the same customer, we wanted them them to have the same unique experience of, like, running database in the cloud. That same experience should translate into running a pulse rate in the cloud. So we had to do the authentication model, the authorization model so that when you are logged into 1, you don't have to sign in again. Like, if you are a DB admin on Cassandra, maybe it makes sense to have a admin role on Pulsar as well. Right? So those are the things we had to add on top of it. So it just didn't make sense for us to take what already existed in with respect to admin tool and add that because we had to redo it anyway. So the lots of work that we have done in Astra's streaming to expose you to the service had been on actually control plane where we do this authentication, authorization, rate limiting, Social Security enforcements, things like that to do that. So that's why we ended up building ours from scratch.
The other advantage of the building, our own configuration panel because is that if you look at the people who gonna use Astra streaming services, they're gonna be different kind of a skill set. You could be a very strong Java developer who know about Pulsar a lot, but it could be somebody who have heard about Pulsar, who have heard about streaming, and you're you're trying to see how it's working for you. Those users are gonna require lots of hand holding. So we wanted to build that experience that basically cater to both users. So if somebody coming up, we wanted you to give you the fastest way to get a start with Pulsar. Because you're small tenant, for example, you literally sign up, and in less than 10 seconds, you created a tenant. Everything that is needed to start with Pulsar. Right? I think those experience we could not do without building from a scratch. In terms of the underlying operations of the system,
[00:20:10] Unknown:
you know, obviously, at the admin layer, you want to be able to restrict the overall functionality or capabilities of end users to be able to dig into the guts of the system. But were there any other modifications that you had to make to the Pulsar brokers or the bookies to be able to propagate some of those restrictions down through the different layers because of the way that it was designed and implemented as an an open source project that is primarily intended to be executed within the confines of a single organization?
[00:20:40] Unknown:
1 was definitely related to configuration. Right? So even though Pulsar had the brokers and bookings, as you mentioned, to us, and Zookeeper, we didn't wanted to expose that directly outside our Right? So there's a lot of security rules around that. So for example, Tasha streaming is available on multiple clouds. So it runs on AWS, GCP, Azure, but the control plane is only on 1 cloud because you need to have a central control plane. So now I have a control plane running on 1 cloud, and this pulsar cluster is running on multiple clouds. So we wanted, for example, the control plane to be able to talk to the data plane on different cloud securely, but without end user being able to do the same thing. So this all these networking securities and firewall rules that you place between the Kubernetes clusters running on different clouds and all this stuff. Like, those were not easy to solve. Like, we had to go through the guards and everything. Right? With respect to underlying broker and bookkeeper and and all this stuff, I think lot of it worked out of box for us. The issue was more around, like, fixing the problems when it's under the high load. Right? And when we launched, we didn't know how many customers we're gonna get, what kind of workloads we're gonna get. Right? We were, like, testing for, like, 10 100 and thousands of customers using this platform to solve their thing. As we are, like, flood testing the system with millions and millions of requests happening, like, per minute and those kind of things, we found some edge cases, like memory management, garbage collection, and all this stuff that you just normally would not find if you're not a heavy user of the system. So we ended up fixing those problems and contributing back to the community. And, Jonathan, you wanna add some more to this? The other example I can think of is around the Cassandra
[00:22:21] Unknown:
Pulsar integration that we wanted to do. So Cassandra, you know, it's a NoSQL database. It's categorized in that space, but it is NoSQL database with the concept of a schema. So in Cassandra, your data is typed. And in Pulsar, you also have the concept of a schema, and you can evolve the schema, and you can have messages that use, you know, an earlier version of the schema. You can have the, you know, forwards compatibility with those. And basically, there was improvements we needed to make in how you could update a Pulsar schema dynamically and inspect it dynamically to make that going back and forth between Cassandra and Pulsar seamless. Because 1 of the most common things you want to do is you want to take Pulsar topic of events and, you know, dump those into a Cassandra table with, you know, some light filtering or maybe some light transformation.
But fundamentally, for every event in the topic, you want to generate, you know, 0 or 1 rows in the table or possibly multiple tables. We support that as well. And going the other way around as well, you wanna take, change log of changes to a Cassandra table and put that onto a Pulsar topic. So we made some improvements around Pulsar schema handling. We contributed those to the Apache project, and they landed in the 2.8 release, which came out last month.
[00:23:49] Unknown:
In terms of some of the other tooling and sort of monitoring and logging aspects of running Pulsar as a service in a multi tenant high scale environment, what were some of the extra bits and pieces that you had to engineer to sort of scaffold the overall Pulsar environment that you're running and be able to ensure that your mitigations for, you know, noisy neighbors are effective and be able to identify users who are coming up against their quota limits and be able to offer them additional scale and capacity when they need it and to some of the other operational
[00:24:24] Unknown:
and tooling aspects of being able to provide a robust end user service. When you obviously launch these services, as I said, Pulsar comes with a cloud native setup itself. So when you get when you install Pulsar out of box and that you get from community, it comes with a bunch of bunch of Helm charts, which has a Helm chart for, like, Prometheus, Grafana, alert manager, everything built into it. So you're like, hey. I have everything. Why do we need anything else? Well, those only work with 1 cluster and not at this scale. So that's number 1. Number 2 was that we needed to do the true monitoring of the system. So what we have built is this open source project called Pulsar heartbeat.
Clearly, basically sits outside the Pulsar cluster for and it's sending a synthetic workload, like, every second just to make sure that the latency across the brokers, the bookkeepers, and everything is working fine. So it actually sends a message on a topic. It creates topics on the fly. It creates consumers on the fly and measures the end to end latency. Right? So that's how basically we know that which cluster on which cloud is working well or not, for which tenant and all this stuff. As new tenant get added to the system, as new topics get created, as I mentioned earlier, we have reinforced bunch of guardrails around it. So somebody is getting closer to the system. You know, obviously, we wanted them to know that they are getting closer to the limits that we have enforced in the system. But then we also had to build a mechanism that we can override to those configuration for selected customers. Right? So those are the additional things we ended up adding into our control plan to do that. We didn't have to modify underlying pulsar for these kind of things because we have a control plane in front of it. Right? And that's also the beauty of the Pulsar edge open source project, and it's a very, very stable project. Right? So what we realized that that there were some enhancements needed to be done on security aspect of control access for the guardrail. But the underlying bookkeeper, then that way the deployment architecture, everything works in a cloud native environment is fairly good with Pulsar. So we didn't have to invest a lot on that. Of course, there are some significant differences when you're running Kubernetes on Amazon versus Google Cloud versus Azure, and we had learned some of those. 1 of the, for example, features in Pulsar, it comes with tiered storage by default. So what tiered storage does is that you will have hot and live data in the memory, but then you can offload your historical data back to the m s 3 or the GCP storage or things like that. So that was super important to us because now we don't have to buy lots of expensive storage for our customers. So what we have in force, for example, is a policy that anytime the backlogs reach a certain certain threshold or a certain threshold with respect to the time and the size, it automatically goes to the s 3 backup. Right? As a vendor, you know, our cost to support its customer is minimal because we are not assuring all this stuff. This allows us to do the scale very efficiently because we all know that Amazon s 3 and GCP storage, everything scales very efficiently. So you can have Amazon data that is coming, but we are not paying that much extra for that. So that was another thing we had to we turn it on by default for you, and that is another advantage of doing that. As a customer, you know that you're not gonna lose data. We have the tier storage on even if you, like, your subscribers wakes up 1 week later or decide to add a new application that needs historical data and everything. Like, there's no data deletion. Everything is stored over there, like, by default. Since Prapat mentioned pulsar heartbeat, this is a good time to point out that as part of developing Astra Streaming,
[00:27:58] Unknown:
we've open sourced as much as possible of the tooling we've created around Pulsar to run this service. So Pulsar heartbeat is open source. Our admin console is open source, and our Helm charts are open source. And we've taken these, not only did we say, hey, it's open source, good luck. We've actually bundled them together into a distribution of Pulsar that you can run on your own infrastructure if that's what you want to do with, you know, included heartbeat, with the included admin console, and so forth. Yeah. And there's hardly anything to buy us in our stack,
[00:28:34] Unknown:
that that is not open source, so that we have not given back to the community. Strategic approach we have taken in that. We are not just thinking of open source as a as a dumping ground of things that we have done in secret and just go there. All of the things that Jonathan mentioned actually have been in open source from day 1. And if you want to find that distribution of Pulsar, it's called Luna Streaming. So Astro Streaming is the service. Luna Streaming is run it yourself.
[00:28:58] Unknown:
And then in terms of some of the other sharp edges, you've mentioned a few different edge cases that you've run up against as far as being able to handle multitenancy and being able to scale it, you know, some of the bugs that you found. But what were some of the sharp edges that you've had to kind of sand down as you started to scale up and add in more customers and start to have to support such a wide variety of different workloads that, obviously, you weren't gonna be able to come up with on your own because customers always do strange and wonderful things.
[00:29:30] Unknown:
So when I started looking into Pulsar, I was talking to 1 of my friend who knows both about Kafka and Pulsar. And he described Pulsar that he's a beautiful system architect. It's designed by great system architects, but I think the people who actually programmed it were, like, not great great programmers. And what he meant by that is that the if you look at the API API and and all this stuff, it's not that intuitive. It's like, okay. This API work this way. This other API doesn't work this way. Right? And I'm a huge freak when it comes to the developer experience of the platform. So just running into those gotchas when you're trying to interact with post our administrator using the API, and we wanted that to be consistent. We ended up making lots of fixes around that. 1 example, for example, is that in Astra streaming as well as in a positive pulsar, you are able to connect to different destinations. Like, so you can send data to Pulsar topic, and it will go to the Elasticsearch. It can go to the Cassandra. Right? And go to the various systems that exist out there. So 1 trivial example is that when when upload the Elasticsearch configuration, for example, and if your configuration is not good, you don't know until the the data comes to the topic and it tries to connect to the Elasticsearch. Right? Which is not a great user experience. You wanna know if the configuration is not correct. You wanna know right away. Right? So those are the things we ended up fixing fixing in our streaming services, agile and upstream, I think, as well already, where this user experience of, like, coming to the platform and being able to get started quickly with the infrastructure you already have, whether it's the Elasticsearch or different kind of systems, That's where we also ran into some, like, some issues that we ended up fixing. At a higher level,
[00:31:09] Unknown:
we went to market, we launched Astra Streaming Beta, and we said, hey. Come, you know, build applications with, you know, with Pulsar. And in several respects, the Pulsar API is, you know, a step forward from, you know, things like the Kafka API where, you know, Pulsar builds in, you know, asynchronous publish and subscribe events. But, you know, nobody wants to take an application that's working against Kafka Kafka or it's working against the JMS API and rewrite that to use Pulsar. So it's fine for, hey, I'm building something new and I'll start with Pulsar. But it's a tougher sell for, hey, you should rewrite your code so that you can use our new service. 1 of the things that we've been investing in is, you know, improving Pulsar's Kafka compatibility.
So there's a Java client for Pulsar that implements the Kafka API. And so we've contributed some improvements to that, but we have also spent a lot of effort making it so that you can just take a Kafka connector. So you have a connector to, you know, Elasticsearch or you have a connector to, you know, MongoDB and send data back and forth between those systems without writing any code, you can take any of those, you know, 120 plus connectors that exist for Kafka and drop that in Pulsar, and it's able to use that to talk to any of those systems. So that was also released in Pulsar 2.8, so that's something that we're fairly proud of, driven by our experience with customers that we're giving back to Apache.
[00:32:53] Unknown:
RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder today.
Yeah. That was gonna be 1 of my next questions is the sort of migration story for people who have an existing Kafka workload because I know that there are a few different compatibility layers. There's the Java API that you mentioned. I also know that the folks at Stream Native have added a protocol adapter to Pulsar for being able to speak the Kafka protocol for clients that aren't using Java as their implementation language. And then there's also the Pulsar IO set of libraries to kind of mimic what the Kafka Connect project was doing, but native to Pulsar. And so it's definitely interesting to hear that you have existing Kafka Connect libraries to be able to work natively with Pulsar as well. So I'm wondering if you can maybe just spend a bit more time talking about some of the technical challenges and implementation details that you've had to go through to be able to manage those migration paths for people who were already using Kafka and want to migrate to Pulsar?
[00:34:23] Unknown:
That's a little deeper than I can go. We can definitely introduce you to the engineers who wrote that code to talk about that. But actually so Andrei Yeageroff, who worked on the Kafka Connect piece of things, gave a talk at Polestar Summit about that. And so he does go into, like, here are the impedance mismatches that I ran into and had to find a way around.
[00:34:46] Unknown:
As you're building out the LUNA distribution specifically, maybe some of the considerations that you have for what to include and sort of the feature comparisons that are available for people who decide to choose the luda distribution versus the Apache open source version or the stream native distribution or if there are any other sort of distributions that are out in the ecosystem that you're tracking for being able to help people with that selection process.
[00:35:15] Unknown:
Yeah. Our goal with Luna Streaming is to give people a batteries included for my, you know, friends in the Python community. That's 1 part of Python's philosophy is we want to give you when you download Python, you know, you don't need to download a whole bunch of other stuff. Python has batteries included, and you can be productive with what you just downloaded. And so that's what our goal is with Luna Streaming as well. And so that's why we've included the new improved monitoring tools, the improved admin tools, the improved helm charts. 1 thing that's unique about Luna Streaming is it it smooths the on ramp for people who haven't already bought into the Kubernetes world. You know, Kubernetes is the foundation of both our streaming and our database technology, but, you know, I think everyone recognizes that Kubernetes is the future of operations, but it isn't the present everywhere.
And so if we are saying, hey, to use Pulsar, you need to, you know, spend a week installing Kubernetes first. That's a big obstacle to some people. And so we partnered with a company called Replicated to give you basically Kubernetes in a box as part of loop Luna Streaming. And so it will lay down a Kubernetes environment if you point it at a, you know, cluster of 3 machines or 9 machines, it will lay down a minimal Kubernetes environment on those machines or on those VMs
[00:36:40] Unknown:
and then install Pulsar on those for you. Yeah. And I will add that something similar for for astral streaming. I know I mentioned a lot about guardrails and enforcing the control and everything. What we didn't wanted to compromise that underlying you are what you are getting is the Apache Pulsar so that all the other toolings and frameworks and SDK that exist in the open source and that's built by community or by different vendors, those should work out of box. So when you sign up for Astra's team and create a tenant for yourself, that is no different than creating a tenant for yourself on your local running Pulsar. So the existing tools like Pulsar CLI, performance tool, and the ecosystem of the, the connectors and everything, we do not you know, those things will work out the box, like, for you. Obviously, with the Astra streaming, we wanna make sure that that connectors we support are, like, tested heavily, and it works out of box, and we can give you, like, a a peace of mind with respect to how it worked. But on the streaming, like, if you have a connector that works with open source, those will work with streaming Autobots. You don't have to do any modification.
I say with respect to licensing as well, there's no restriction. Everything is a party license. There's no, like, open core and this and that, all the stuff that you see here at enterprise peace of mind that you're getting is a pure, truly open source software.
[00:37:58] Unknown:
The other thing that we do with Luna Streaming is let me characterize it this way. If you want, like, the newest, hottest features in Pulsar, then you should get the, you know, the official Apache Pulsar distribution. 1 of the things that we're trying to do with Luna Streaming is give people something that moves a little bit slower. And in exchange for that slowness, you know, we do more testing and, you know, we back port fixes as necessary so that when we have, you know, Luna Streaming 2.6.2, you know, we pulled in some fixes that would appear later on in Apache Pulsar 2.6.3, and there wasn't a Luna Streaming 2.6.0.
[00:38:38] Unknown:
Like, we waited until it was, you know, as stable as possible and then released that. So that's kind of the in the enterprise market that DataSec serves, that's what they're optimizing for is more the peace of mind than the latest and greatest. And especially on security and stuff. Right? So, you know, as you could tell, our customers are, like, big enterprises, Fortune 500 companies and App Store, and they have their own security process. So lots of those kind of finding, whatever they end up finding, we end up fixing, like, early in. Sorry. We end up fixing those and Luna's streaming as well as contributing contributing back to the community. So, you know, they saw some sort of, like, ecosystem, this thing going on between Luna and open source Apache Pulsar.
[00:39:20] Unknown:
And another element of the Pulsar ecosystem and the Pulsar project that we haven't touched on yet is the Pulsar functions capabilities for being able to do ad hoc event driven execution of arbitrary code on the different messages within the topics. And I'm wondering if you can just talk through some of the tooling and user experience enhancements that you've been able to build in around how you manage the packaging and deployment of code to be run-in these functions environments and just some of the operational challenges of executing arbitrary user code in your managed service.
[00:39:58] Unknown:
On the Pulsar Functions, for people who don't know, is that you can write your own custom code in supported languages, which is Python, Java, Golang as well. I'm not sure how good the Golang support it, but definitely Python and Java, and we test that quite a bit. So idea is that you can push a arbitrary piece of code, and that will act on each message on a topic. And then that piece of code usually is for, like, a message transformation, validation. You wanna enrich that record in the topic with something else. You could do all this stuff. So PulsarFunction is the underlying platform powering a Pulsar Connector ecosystem. So when we run a Cassandra Connector, when we run Elasticsearch Connector, actually, underlying mechanism is through Pulsar Functions. Right? So it is a first class citizen of of it. Right? The question is when you have a hosted platform, the managed platform, and you have you don't know the developers who is pushing code to it, Like, how do you enforce the security and the guardrails around it? Right? That's a very, very, you know, difficult thing to do at a scale, obviously, as you could tell. So in Astra streaming that we have right now, we don't allow you to upload custom code right now. We are working on that path where it will be available, but it's not that difficult either. Because what happens is that because we run a Pulsar on Kubernetes cluster and we can tie the Pulsar functions to a pod in Kubernetes. Right? So what we are working on is that you upload your code, and it runs in 1 of the pod that is assigned to your namespace.
So you can do whatever you wanna do. If you wanna hurt yourself, you're gonna hurt yourselves. You know, you're calling that pod. Right? Also, architecture, because cloud native aspect of this is makes it relatively easier for us to support this. But having said that, we still have not done it because we wanna make sure that the networking aspect of it so we don't want anybody to write code that can sniff around the record for somebody else topic and all this stuff. So we're still working on that, but I don't think we are very far from that, you know, because the underlying cloud native architecture and the way it runs in Kubernetes. Yeah. We will never run, for example, this Pulsar functions in the process JVM process itself. You know, Apache Pulsar allows you to configure where this runs, whether it's in the process, same process, or on the same node or in a different pod. We are not gonna do the same process, same notice. This is gonna be the part thing where we can enforce some security around it and make sure that it works. And then 1 of the other pieces that you call out in the marketing material on the Luna homepage is the ability to perform machine learning on streaming data. And I'm wondering if you can just talk through some of the use cases for that style of application
[00:42:37] Unknown:
and some of the supporting systems that are necessary to be able to power that execution and the capabilities that Pulsar has natively that make it a tractable problem for you.
[00:42:49] Unknown:
Yeah. I mean, when we talk to lots of our customers and my own personal experience at that last startup, 1 of the use cases of using queuing and messaging system was to build build this machine learning pipeline. Right? Where where you have a bunch of data coming, you wanna update your model, you wanna start with the model and all this stuff. In Pulsar itself, through Pulsar functions, you can run Python code and and Java code as we talked about. Right? Some of our customers have already done when when they use Apache Pulsar from machine learning, they're using Pulsar functions with TensorFlow if if they're a Python shop. There are a couple of examples of customers using deep learning for j in the Pulsar functions, if they are like a Java shop. Right? So when I think about machine learning with respect to messaging system, you have a bunch of data coming up, and you need to create a model. That means that each recording that topic likely needs to be cleaned.
Somehow, it will enrich with some of the data. So imagine a record comes in. You have some information of a customer or event in that payload that needs to be enriched with something in Cassandra. So with that, which you can basically in the pulse rates, you're not talking to 3 different systems. Basically, you can chain different pulse functions, so 1 step after another. So imagine a pulse function where a function under the transformation, second does the enrichment, and third 1 is actually sends the data to your system, which is which is updating the machine learning model on the fly. Right? So we don't have a fixed way to basically, you know, describe that this is how we should do machine learning. But usually, if you look at the least common denominator on the machine learning pipeline is to clean the data in different steps and also functions and appropriately training them. You can create this ML pipeline easily, and we don't have to maintain yet another infrastructure.
And because you are running the Pulsar function in the Pulsar itself, you know, it's always up to date with the latest data. You can enforce schemas on it. Right? So they just schema drift on the topic. Your Pulsar function will not work, and you would know that issue right away. Right? The other things in Pulsar architecture that allow you to help you do that, obviously, you know, when it comes to the serious machine learning use cases where your model is used and everything, probably you would not use Pulsar functions for that because it's a thing in itself. So in that case, your topic and Pulsar function would reconnect to that deployment somehow and do the message flow and everything. Yeah. But for lightweight machine learning pipeline and stuff, I think you can easily chain functions together.
[00:45:10] Unknown:
And then going back to the overall ecosystem question, 1 of the other interesting pieces of having companies like DataStax and Stream Native and some of the other ones out there using Pulsar in production at scale and investing in it is that it acts as an accelerator for the Apache project itself. And so I'm wondering what are some of the elements of the road map that you are most excited for, that you're working towards, and some of the overall potential that you see for the Pulsar project now that there is much more momentum behind it in the past, you know, year or 2? This is another thing we have a Pulsar Summit talk that people can refer to. It's a excellent half hour
[00:45:55] Unknown:
talk on applying chaos engineering to Pulsar. 1 of the things that we're focusing on is reducing that time between, hey, there's a dot 0 release and, you know, it's stable. We can build a LUNA streaming release on top of this. We're leveraging a set of tools that we built for testing Cassandra called Fallout. We open sourced those a couple years ago and now we're applying them to Pulsar as well. And so, basically, the idea with Fallout is that you can declaratively specify scenarios that you want Fallout to apply either sequentially or concurrently to a distributed system.
So in the Pulsar space, we can be, you know, adding bookkeeper storage nodes at the same time a zookeeper node fails, at the same time as we're adding a new tenant in a bunch of topics, at the same time as an existing tenant is sending a 100, 000 messages per second through the system. What happens when you start composing these scenarios together and can you find any, you know, misbehaviors that you need to fix?
[00:47:07] Unknown:
And that puts me in mind of the Jepsen project with Kyle Kingsbury. I'm wondering if you or anyone else in the community has engaged with him to do some of that type of stress testing and sort of ferreting out the distributed systems design flaws that exist in Pulsar to be able to address them?
[00:47:25] Unknown:
2 answers for that. So Fallout is built on top of Jepsen. So Jepsen is a closure tool and to create new scenarios, you have to write closure. And so we said, you know, we love closure, but it's not a super common skill set. So we created a domain specific language that allows you to build these scenarios without having to go all the way down to closure. The other thing that Fallout does is it gives you, you know, basically a CI service for running these, and so you can set and it will generate reports and performance graphs and, you know, all those things of doing a test run that you'll want to investigate what happened to it, you know, later.
The other answer that I think is super interesting here is there is an engineer named Jack Van Lightly who was hired by Splunk relatively recently. And he's applying formal modeling techniques to Pulsar and to Bookkeeper. So, you know, Fallout and Jepsen, you know, they'll allow you to, you know, run a whole bunch of different scenarios and see what happens. You can't prove that there's not a scenario that you didn't happen to run. And so what formal modeling does is it lets you build, you know, a theoretical representation of the bookkeeper storage protocol, for instance.
You can test it at a much lower level. It's also not a proof technique. It's not gonna prove that the bookkeeper protocol is correct. But instead of testing, you know, thousands of scenarios in a week, I can test, you know, millions of scenarios in an hour. And so it lets me be much more exhaustive about my confidence that a bookkeeper protocol is correct. As part of the work that he did, he actually did find a bug in the protocol several months ago. You know, we were able to fix that as a community and, you know, so I think the combination of formal tools that are more commonly coming out of academia and chaos engineering that's coming out of, you know, the very practical side of companies like Netflix, it's very powerful combination for improving correctness of distributors.
[00:49:34] Unknown:
In terms of the ways that your customers are using the Astra platform, platform, what are some of the most interesting or innovative or unexpected use cases that they've employed?
[00:49:45] Unknown:
I think the answer to most of those, like, the generic answer is they're doing crazy stuff with Pulsar functions. You know, you can do arbitrarily complex things without having to stand up and manage a separate cluster because Pulsar is managing that for you or it's incorporated into the, you know, same Kubernetes platform as Pulsar. 1 company, a financial services company, is actually using Pulsar functions for all the business logic for the transactions flowing through their system. And I guess 1 of the reasons that's so cool to me is that when you're a a Fintech company, you're super, super risk averse. And so these guys did all kinds of homework and to make sure that, you know, that all the failure scenarios were spoken for and and then they went ahead and put it in production. So that would be my favorite example.
[00:50:38] Unknown:
Oh, 1 of the funny example actually, and this is not a real event streaming example. When we first announced that we're gonna do the astral streaming and it is event streaming platform, few people actually understood as a literal event streaming platform where, like, you have a video camera streaming the events and other stuff. So when we launched the alpha version of we asked, hey. What's your use case? I was like, I'm excited to try this platform to see how I can do live video streaming or not. We were like, uh-oh. This is not the right streaming we are talking about here. That was sort of funny thing that we ran into as well.
[00:51:13] Unknown:
The joys of the challenge of naming things and the collisions of meaning that occur in the English language.
[00:51:19] Unknown:
Yeah. By the way, if anybody has any recommendation of what you call it, please let us know. We have tried data streaming. That gets confuse confusing with something else. We have tried event streaming. Obviously, I just saw that was, like, something different. So we still don't know what is the thing that describes what we do here.
[00:51:37] Unknown:
I don't think that there's a succinct way to put it, but there are definitely more verbose ways that you could be add clarity.
[00:51:43] Unknown:
Exactly.
[00:51:44] Unknown:
Alright. And then in terms of your own experience of building the Astra platform and onboarding customers and being able to support it at platform and onboarding customers and being able to support it at scale, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:51:58] Unknown:
I think just going through the guardrail process to reinforce, like, what we would allow a regular user to do so that they are not hurting themselves and they're not hurting us. Like so it is a pay as you go system. So when you sign up for Astra Streaming, you are not thinking about how many brokers and bookkeepers you need. We just charge based on data in and data out. At the same time, we wanted to enforce that at the end of the month. You're not surprised by a huge bill because the role process was sending bunch of data and all this stuff. Right? So just thinking through the process of what is the right guardrail because everybody's use case is different and how can we make it make this platform so that we can turn on these things easily for different customers. I think we're the most difficult part because different configuration works differently with each other. So you have to make sure that the combination and the defaults that we give is correct. That was definitely a challenge. And with respect to user experience, actually, we are still learning about this because it's only been few months since we have launched streaming as a platform as a service here. The people who are coming to our platform, they have a different kind of experience. They've been using Kafka for a while, and they they have heard that this Pulsar is a new thing. This really works well. They're coming from that background. Sometimes some of the people have used Pulsar for a year, and they want to see if they can have a managed offering because they don't wanna do it themselves.
And then you have a completely, like, a new set of app developers who are building mobile apps and web application using GraphQL and whatnot, and their understanding of a streaming system is very different. Right? So how do you build a user experience where obvious things are obvious, but also the complicated things are also there in a way that it caters to both of them? That has been sort of the difficult balance for us to get it right. You know? I don't think we have cracked the code yet. Like, what is the fine line between, like, hand holding you as your brand new user to the system versus expert 1. And we're still working on that. And our design team does bunch of interviews and everything to make sure that we got it right. If I'm looking forward to, like, what are the next challenges,
[00:53:56] Unknown:
we touched on 1 of those around allowing arbitrary user code to run-in Pulsar functions and how we plan to tackle that. Looking ahead longer term, I think it's gonna be interesting to tackle streaming analytics as well as, you know, the raw foundation that Pulsar gives you. And you can do that kind of by hand with Pulsar functions, but it's much more involved doing it that way versus using something like Flink and Flink SQL.
[00:54:25] Unknown:
So do we provide Flink as a service as well? That's something that we're gonna be looking at. Yeah. And on the same line, you know, as you know, the virus, I've heard few episodes about change data capture. And, you know, this is a beast in itself. And if you have a postgres and my SQL kind of database where everything is happening in 1 node and, like, you have all the logs and everything that you need to replicate to the system, it gets easier. But when you have a systems like Cassandra, you know, no SQL database running on multiple nodes and there's no master concept of it, that means you would reconcile this change log into 1 place before you send that to the Elasticsearch or different kind of system. You know, getting that turnkey out of box, we are a Cassandra company, so we should know how to do this better. Right? Just working on that, fine tuning that experience that if you need to do the CDC on Cassandra, his latest platform. And so those are the another things that we already have an offering already, but we're constantly, like, hard tune fine tuning that as well. That has been super interesting project for us actually here. For people who are interested in exploring Pulsar and they want to be able to get it up and running fairly quickly, what are the cases where Astra might be the wrong choice?
[00:55:33] Unknown:
Man, for people who wanna get up and running with Pulsar quickly, you know, Astra is the right choice, like, for everyone. Well, I guess if you need Pulsar functions in, July 2021, we don't have that yet. And so that would not be a good choice.
[00:55:47] Unknown:
Yeah. That'll be 1 reason. As I said, you know, parser ecosystem has a 100 plus source connector, destination connect, sync connectors, and all this stuff. Right? We're not sure that the quality of all those connectors are really, really up to the standard. Right? So we only support limited set of connectors based on what our customers have asked for. So if you need to connect article database for whatever reason, you know, you won't be able to do that here in that system. We don't have that. And those are the use cases I would say, you know, if you don't have the features that we need, then you, of course, you're not gonna use Astra Estiming for that. But if you know that you need Pulsar, I think that ASR streaming is the fastest way to get started, for sure.
[00:56:27] Unknown:
You've mentioned a few things already about what you have planned for some of the upcoming road map, but what are some of the other elements that you're excited to work towards in the near to medium term, both for the Astra project and also more broadly for the Pulsar ecosystem?
[00:56:42] Unknown:
So as a service provider, we care a lot about both day 1 and day 2 experience. So, like, you know, of course, I talked a lot that this is the fastest way to get it started, but we would also wanna make sure that we give you the best service for, like, day 2 operations with respect to monitoring, alerting, the failover capabilities, making sure that everything just works out of box and all this stuff. I think we're just starting on that journey as part of it. We give you Apache Pulsar as a service because we wanted to consolidate your Kafka and messaging solution, queuing solution all into 1. But then when you start using, we don't want it to have you to do 10 more different infrastructure things to make sure this is working properly. So those are the day 2 things. Right? You will see us working a lot on those so that you don't have to like, when you come to the platform, it is truly good service. That's the 1 on the usability perspective.
[00:57:30] Unknown:
The CDC work that, as you mentioned, like, making it turnkey with the cloud database that we have is another 1. And then also then having a laundry list of supported connectors that works out of bug. Alright. Well, for anybody who wants to get in touch and follow along with you and try out Astra Streaming, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:57:58] Unknown:
I've been thinking about that since last night since I saw the questions. I think the building, the data pipeline, the machine learning pipeline, it is still very complex. Right? You need to know the data source, the enrichment transformation, how do you connect to the different systems, the building machine learning model based on that. I think it still requires a lots of hand holding and gluing things together. I know that there are other vendors trying to make it consolidated. But when it comes to open source ecosystem where you can connect these streaming systems and the databases and machine learning model and all this stuff, I think that is a big challenge still. I would say broadly across the industry,
[00:58:37] Unknown:
a big gap is around infrastructure that supports seamless deployment across multiple regions and multiple data centers. So we're trying to close that gap on the Cassandra side, on the Pulsar side, but there's still everything else. And, you know, all the other stateful infrastructure
[00:58:57] Unknown:
that, by and large, is just starting to tackle that. Well, thank you both very much for taking the time today to join me and share the work that you're doing on Astra and contributing to the Pulsar Ecosystem. It's definitely a very exciting project and 1 that I'll be keeping a close eye on and probably play around with a bit myself. So thank you for all of the time and effort you've put into that, and I hope you each enjoy the rest of your day. Thanks so much. Thank you, Tobias. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Guest Introductions and Backgrounds
Overview of Astra Streaming and DataStax Offerings
Streaming Use Cases and Integration with Cassandra
Technical Architecture and Challenges of Pulsar
Admin Interface and Security Considerations
Monitoring, Logging, and Operational Tools
Kafka Compatibility and Migration Paths
Luna Streaming Distribution and Features
Pulsar Functions and Machine Learning Integration
Future Roadmap and Ecosystem Development
Customer Use Cases and Experiences
Day 2 Operations and Future Plans
Biggest Gaps in Data Management Tools