Summary
One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- You can help support the show by checking out the Patreon page which is linked from the site.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- A few announcements:
- There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
- The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
- If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
- Your host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Pulsar is and what the original inspiration for the project was?
- What have been some of the most challenging aspects of building and promoting Pulsar?
- For someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components?
- What are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to?
- What projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison?
- The documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka?
- One of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged?
- When is Pulsar the wrong tool to use?
- What are some of the improvements or new features that you have planned for the future of Pulsar?
Contact Info
- Matteo
- Rajan
- @dhabaliaraj on Twitter
- rhabalia on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Pulsar
- Publish-Subscribe
- Yahoo
- Streamlio
- ActiveMQ
- Kafka
- Bookkeeper
- SLA (Service Level Agreement)
- Write-Ahead Log
- Ansible
- Zookeeper
- Pulsar Deployment Instructions
- RabbitMQ
- Confluent Schema Registry
- Kafka Connect
- Wallaroo
- Kinesis
- Athenz
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production, and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to data engineering podcast.com/gocd to download and launch it today.
Enterprise add ons and professional support are available for added peace of mind. And go to data engineering podcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which linked from the site. To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media. We've got a couple of announcements before we start the show. There's still time to register for the O'Reilly Strata Conference in San Jose, California happening from March 5th to 8th. Use the link data engineering podcast.com/strata dash sand dash jose to register and save 20% off your tickets.
The O'Reilly AI Conference is also coming up, happening April 29th to 30th in New York. It will give you a solid understanding of the latest breakthroughs and best practices Also, if Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the open data science conference happening in Boston from May 1st through 4th. It has become 1 of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets, go to data engineering podcast.com/odscdasheastdash2018 and register.
Your host is Tobias Macy, and today I'm interviewing Rajan Dabaglia and Matteo Merley about Pulsar, a distributed open source PubSub messaging system. So, Matteo, could you start by introducing yourself?
[00:02:34] Unknown:
Sure. Hi, Deiza. So my name is Matteo Merli. I I work on Pulsar and I I used to work at Yahoo for several years on PubSub messaging, distributed database replication and various infrastructure, tasks. And, I continue to work on the project and trying to bring it forward, with with new features and and new, new bugs.
[00:02:59] Unknown:
And, Rajan, how about yourself?
[00:03:01] Unknown:
Yeah. Sure. So my name is Rajan Dabaliya, and, I'm working on a project in Yahoo. And before joining Yahoo, I've been working with the different teams, related to platform infrastructure and building various distribution system. And for years back, I have joined Yahoo. And since then, I'm working on Apache Pulsar, and I'm also working here on distributed storage system, which is nothing but the storing the distributed key value pairs. Yeah. So, yeah. That's it.
[00:03:33] Unknown:
And going back to you again, Matteo, do you remember how you first got involved in the area of data management?
[00:03:39] Unknown:
When I started working on Yahoo, I started working on on, database replication. So Yahoo has a massive, basic database that is originated from the old, like, Peanuts paper. And, 1 of the critical part of that the system is the, moving data basically from different different regions. So you you have a database in which updates can is a a metric consistent database. So events can come from different region and has to be reconciled, but data has to be moved across different data centers at a massive massive scale. So that's what what the the first project that I started working on and that's where I started seeing copious amount of data and and moving this this that data around.
[00:04:23] Unknown:
And, Rajan, how about yourself? Do you remember how you first got involved in the area of data management?
[00:04:27] Unknown:
Yeah. So before joining Yahoo, I have been working on various distributed system where it basically, strongly requires, various messaging in the database system. And regarding messaging, we have been working on different messaging system with a very large scale of data. And, regarding the database, we also had a huge requirement of building a system, which basically requires the, replication of the data between different clusters. And then, I joined Yahoo where, it caught, it caught my eye where, Yahoo was building a distributed messaging system, which is right now known as a Pulsar. And then we joined, I joined Yahoo. And, that's where I started working on Pulsar where at the time, it was basically started, developing and then it's getting matured, with the time and many different teams have basically started onboarding on Pulsar to basically, satisfy their asynchronous behavior and pops up requirement.
And, now large portion of the Yahoo, they have started on, started using the Apache Pulsar on a, very heavy and,
[00:05:36] Unknown:
heavy basis. Yeah. So can you explain what the Pulsar project is and what the original inspiration for the project was, particularly given the fact that there were existing publish subscribe systems?
[00:05:50] Unknown:
The main reason was that, we at Yahoo, we we we realized so the reason for Pub Sub is that, even that many groups were using that on on on their own. There was no central system at Yahoo to to for for messaging. So basically, everything was on on its own and every team choose a different solution either, could be ActiveMQ, could be Kafka, could be many many different ways, but the problem there was that each team, didn't have the necessary the the operational expertise to operate this system for critical, projects. And also there was like a lot of duplicated effort in around monitoring, around how do we operate it, how do we how what we do when this thing fails and so on. So there was a push to have a a single service for the whole company. And when when we started that, we basically evolved this from other proprietary Yahoo systems and trying to figure out what was if we could leverage anything that was out there in the open source and what was the the best the best solution and the best approach for for the task. And so immediately what we we we test out all these available solution but there we couldn't find anything that was able to to scale to to the point that we we we could support the internal use cases, all in a single cluster or even some of the most demanding use cases, especially with requirements in term of durability, in terms of what's what's the kind of latency that we can offer, some kind of SLA, and in terms of operating a very large cluster. So that's that's when we when when we had sat down and started designing from ground up. So we we took a bookkeeper, which was another yahoo project and, that that was for the for a storage part and eventually we we evolved in a way that we build our own broker on top on top of it to to have a very lightweight, broker component that could leverage the the horizontally scalable storage for for Bookkeeper.
[00:07:57] Unknown:
Actually, yeah. I would like to also add a few more things. So why we have seen that Pulsar could be the right use case, right system, to basically bring in the Yahoo. So, in Yahoo, we have, most of the teams basically, after learning most of the teams requirement, we have realized that we have 3 fundamental requirements, which we exactly need in Yahoo, which was not basically available into other, pops up messaging framework in the market. The first 1 was the the throughput and the performance. So in Yahoo, we have many applications which requires, strict SLA in terms of the publishing the message.
We have 1 of the biggest system, Sherpa, which basically, requires, 5 millisecond of publish latency. So as Matthew mentioned, Pulsar has 2 major components. 1 is the broker and second is Apache bookkeeper. So bookkeeper is basically, persistent storage system where basically persist all all the published messages. So this Apache bookkeeper basically, architecture basically fulfills this requirement regarding the throughput because the Apache bookkeeper, so each individual node basically known as a bookie. And beauty's, bookies architecture basically, helped us to isolate the IO, operation for the read and write because, in Apache Bookie, the read and write basically, happens into, different execution path, where whenever any message will be published or if you want to persist any particular message into Bookie, Bookie has its own journal.
For 1 individual Bookie note, it basically has 1 single journal where it basically upends the message, which is basically right ahead log and basically upends, every right. And, it basically does those right, basically periodically flushed into the dedicated, separate storage device in the background. So, whenever any happen, that would basically happens from that separate storage device. So if any consumers tries to read as far as possible, all it it can basically saturate the IO, but then it basically, impact the storage device. It won't basically impact the actual journal device where we are writing. So if we have, consumers where basically they are reading as fast as possible, even if the audio will be saturated, it won't impact the published latency. So the published throughput was the main use case. The second was the, scalability.
So right now in Yahoo, for 1 of the cluster, we are having more than 1, 000, 000 topics. So, yeah, Pulsar is basically serving more than 1, 000, 000 topics in 1 of our cluster. And that scalability basically, can be achieved, in Pulsar, because, if you consider, another pops up framework, which is, let's say, Kafka. So in Kafka, every consumer basically, can read from 1 partition. So basically, creation of the number of partition is basically consumer bound. Now so so now even in Kafka, if let's say if we have a topic and if we are basically having dedicated file or directory for every topic, then our all the Ios can be scattered, in various, different directories and then basically, scattered the Ios. So, it basically direct impacts, the, IO read operation because it basically it has to basically flush the test case and periodically. In Pulsar, we are using the Pulsar bookkeeper where it basically aggregates a multiple topics messages into 1 large log file. So it basically prevents that, IO scattering and then use the most calibrating in terms of the IO throughput as well. And the third main use case was the geo replication where in Yahoo for most of the application, we, need to replicate the message from cluster a to there are multiple, different region are available where we have to basically replicate the messages. So in Pulsar, Pulsar broker itself basically takes care the replication of the, from 1 geo 1 geo cluster to the different, geolocation.
So in that way, it won't require any user intervene or it won't require any user attention while replicating the message because broker, server side will be basically taken care. So these are the 3 main use case, fundamental use case, which were which were like a strong requirement we had in Yahoo and that motivated, us to build up concern.
[00:12:28] Unknown:
And there are a few questions that I have coming out of that. 1 of them is for the geo replication. You mentioned that the pulsar layer is what's responsible for handling the distribution of the messages to those distributed clusters. So I'm curious if it's essentially just replaying the message as a publish event to the other clusters. And it seems that that way, the underlying bookkeeper layer doesn't have to have any understanding of the separate clusters as far as distributing the data in that manner. Is that accurate?
[00:13:00] Unknown:
That is correct. So actually so the the main kind of application is it is async publication. So this is being handled a broker broker to broker. As as Rajan was mentioning, that's a very core feature of the system. So it's not some tool that you have to run. It's a property that you configure. Let's say, I want my topic to be replicated in replicated in region b, a, b, and c, and the system will take care of it. The other kind of replication, the sync replication, you can still doing that by placing storage nodes in different clusters and using rack rackware policy to make the data being present in both clusters. And that that that would be a single application, and, it's it's actually a single logical cluster with, with data scattered across multiple regions. So that's more like of a deployment, configuration choice.
[00:13:49] Unknown:
And then on the persistence side, is it tunable as far as the level of consistency that's required for a right to be acknowledged where you submit the, publish event to Pulsar and then it requires some, number of nodes to acknowledge the right before it will return back to the person doing the publish that it was successful?
[00:14:14] Unknown:
Yes. That's in that is correct. So the in Bookkeeper, there are different tunables that that you can, configure. But, so it's a very the replication model for Bookkeeper is very different from what what it is for Kafka and it has several advantages in with with Bookkeeper and Pulsar. So as we say that we have 2 different layers. 1 for 1 1 is the broker, 1 is the storage. And that gives you a complete decoupling from where the data has to be stored and where where the data is being served. And that is more much more flexible. Yes, you can configure a multiple, the level of replication that you want.
And you can also configure different ways how to distribute the data across different storage nodes. So for example, you can say that I want 2 copies or I want 3 guarantees copies but I want to write 1 more copy and wait for 3 acts or 2 acts, for example. So you can trade more, more rights for lower latency. And, that's something that, it can help a lot in
[00:15:21] Unknown:
in reducing the latency because it can prune out this lowest node. It also basically, we can also tune the replication factor per tenant level as well. So in Pulsar, every tenant basically can have their group of topics, which we call as a namespace. And, that replication factor can be tunable, per namespace level. So, it basically, we can basically tune at tenant level as well. So it can be basically, configurable, as per the tenant requirement.
[00:15:51] Unknown:
And do you provide a sort of, sort of general purpose set of defaults for, when you install the clusters so that you don't have to worry about working with all those different parameters when you first start working with the software because sort of the general issue of usability versus configurability is a spectrum of if you have too many knobs to tune, then you're not quite sure which ones to tune and you can actually make things worse than they started off. So
[00:16:16] Unknown:
Yes. So the default is to have 2 copies of the data. So Bryte to to to start the nodes and wait for 2 acknowledgments. So that gives you the 2 copies variability and that's the default. And, of course, it's tunable. So you can change the default for the whole cluster or you can change the the setting for a for a particular group of topics.
[00:16:37] Unknown:
And what have been some of the most challenging aspects of building the Pulsar project as well as, promoting it outside of Yahoo and gaining interest in the project itself?
[00:16:48] Unknown:
Yeah. These are very, very different challenges. So for building the system, I would say that, the main challenges were basically on the scalability part. So how to design a system that has to be able to scale in very different angles in throughput on a single topic, in number of topics and adding new nodes and also the the operations part are very, critical. Once you reach a certain scale of operations that all the tools and the reliability of these tools and and this operation needs to be very very solid. And also, how do how does it drag the system when when you when you put any kind of different workloads, in the same cluster?
How to make sure that tenants doesn't affect all the workload for other other tenants? Other Reflection has other things that
[00:17:40] Unknown:
Oh, yeah. I think, we got everything. So I think the, catching the more attention from the outside, of Yahoo or, to a large community, I, I think the project has to have solid, infrastructure in terms of, building a system more reliable and then, have providing the monitoring APIs and, also providing some easier way to deploy the service, which is a purchase of Apache parcel servicing to various, cloud providers. So I think, Apache Pulsar, as a community has contribute, many different contributors have contributed a lot in terms of making Pulsar more reliable and, providing a different solutions where people can easily come and then bet they can deploy in various different cloud providers like AWS, Google Cloud, and, they can basically start with Pulsar, very quickly. So it basically gives a very fair idea to user that how quick, it would be to deploy. And then once they start using it, the the throughput and the scalability, which Pulsar provides, I think, then they would basically start realizing the main idea behind, using, Pulsar as a pops up, messaging framework, with the use cases. So I think, there are a lot of things have been done, and, we have been trying, we are putting our consistent efforts to, taking it more, further and forward.
[00:19:04] Unknown:
Yes. On the on the adoption part, I was just at that. So 1 of the main of the main challenges there is that so this is a critical part of infrastructure. So it's not some project that, anyone can can take easily on. So it it there's a lot of lot of work to convince people that this is a available alternative to other established products. And, and that is worth giving a try because we believe that many of the operations, many of the scalability issues are are are are are not not an issues with with Pulsar, and that's that's and that's there are several advantages.
[00:19:45] Unknown:
And how much has the overall system design and architecture evolved over the course of the project's lifetime? And are there any decisions that were made early on that have proven to be problematic for being able to grow and evolve the project as new use cases or requirements are introduced?
[00:20:05] Unknown:
I don't recall any any this any, like, big, obstacle. So I think that the the the the architecture is is is very solid in the sense that so the the main idea with with Pulsar is that you have a Pub Sub layer that is based upon a distributed, log. And that that's a very flexible architecture and that that basically it lets us support very different use cases. For example, like, with with Pulsar, you you can have different types of consumers. So you can consume our data, either in order like like you do in Kafka. Basically, you have, you know, topic, you have partitions, you can have a consumer per partition that consume all the data in order, but you can also use the more, queue, traditional queue approach in which you have a single shared queue if you have multiple consumer that that are all taking items out of the same queue. And, and support this because we have this this little log underneath and then we have this serving layer which is the broker on top of it. So that has not changed since since inception.
We've been just adding few more features here and there. We we are adding more components but not really any dramatic changes from the beginning.
[00:21:28] Unknown:
And for somebody who wants to start using Pulsar in their own environment, what are some of the infrastructure and network requirements this they should be considering, and what's involved in deploying those various components? There's
[00:21:42] Unknown:
no, like like, special requirements. The the minimum requirements are very very, usual. Like, I would say that at least 3 nodes if you want to have the data replicated. For in terms of deployment, we have instructions in the in the in our in our website. There are instructions to deploy in in very different environment from, from bare metals with Ansible scripts or with, on AWS with Terraform and so on. Or also like with Kubernetes, we have recipes and or already available. So in just like a few few commands you can spin up clusters in the cloud or or in a Kubernetes cluster and, and have Corsa running. Of course, like having a massive deployment with, will require probably some some care in choosing the the right hardware and, but this is just a matter of of having some capacity planning and deciding how much bandwidth, how much what what kind of disks are better for for for these use cases and and all.
[00:22:48] Unknown:
Yeah. So I think, I will add a very quick overview regarding the deployment. So, Pulsar is basically having 3 major components, or the 3 different subcomponent, basically, it requires to be deployed. The 2 we already talked, the Pulsar broker, which basically solves the service layer where it basically, client basically connects to the broker and then, it basically, pass through, the message messages, to basically produce it and to the consumer. And, for the persistent, storage perspective, we are using the Apache Bookkeeper where, we we can at least start with the, 3 different, Apache Bookkeeper and Bookkeeping Bookie Notes as Matthew mentioned. If we want to basically have at least 2, if you want to basically keep 2 copies of the data. And there's a third component, which, Pulsar uses as a zookeeper.
So Pulsar uses zookeeper, to store the, cluster metadata and the topic metadata. So it can basically stores them for configuration information to the, zookeeper and also, for the, ownership and the synchronization between the, namespace and the topics. So, basically, Pulsar requires 3, these 3, different subcomponent to be deployed. And as Matthew mentioned on the website, there are different, instruction have been defined if you want to, deploy into the bare metal cluster or if you want to deploy into the Kubernetes or, on the Google Cloud, with the Kubernetes or, Amazon, AWS.
So, those instruction, and the recipes are already available. So with, like, few minutes of the efforts, we can deploy it very easily. And then based on the, capacity, and the traffic, those nodes can be basically, adjusted based on the traffic and the capacity.
[00:24:40] Unknown:
And as you're scaling the cluster, whether it's vertically or horizontally or, across geographically distributed regions, what are the different factors along which you can scale and where do they start to break down that people should be paying special attention to as they do start to increase the overall capacity of the clusters?
[00:25:02] Unknown:
So typically, like, say, 1 of the distinctive features of Pulsar is that, as we told that that we have these 2 layers, is that it's very easy to expand the capacity of a of a running cluster. And that is because, the data for a single topic or a partition on a topic is not stored on any 1 node, at at and in in its entirety. So the data is for time and that gives that we can expand the capacity. If we need more more storage capacity, we can just add more storage nodes over over time. And the same thing that can be done for for for brokers. So if you see that the the CPU or the bandwidth for the brokers is is running high, then you can just add more brokers. And automatically, there's a load manager component that takes care of assigning topics to brokers and that will also take use the load information for a broker to to offload some of the topics to these less loaded brokers. So this basic administration operation are automatically. So you you can you can just spin up new nodes and the traffic will move there.
[00:26:11] Unknown:
And in the event of, network partition, where does Pulsar fall in terms of the cap theorem of whether it continues to operate in a partially degraded mode or whether it will cease to accept new inputs, until the network partition is healed?
[00:26:29] Unknown:
So the consistency model is basically, broken all its topics and, and they they they are the ownership through Zookeeper. So that the consistency level is is granted by Zookeeper. That's what gives the broker to the ownership and that ensure the consistency. The COSITA is is also validated at the bookkeeper level. So there is a fencing mechanism so that if for if if a new broker takes over any any anyway, that will basically be it is able to fence over off the the the old writer. So the the guaranteed consistency. So, again, this is more like a a CP system.
But, again, the the the difference with so the the there are different turns off the same time that that Zookeeper is a c CP system, but you also had multiple multiple nodes to give you availability. So if you have bigger quorum, you have more chances of availability. You you don't have a single point of failure. That's that's
[00:27:32] Unknown:
who And over the course of our conversation, we've mentioned Kafka a number of times as being a somewhat comparable system in terms of the way that people might use the 2 different projects. So I'm wondering if there are any other projects or services that you would consider to be competitors to Pulsar and if you can do a bit of compare and contrast as you introduce some of those different projects.
[00:27:57] Unknown:
Sure. Like, bearing all the reason that we that we enumerated at the at the beginning, like, of all the scalability, geo replication, that features that were not there in any other system. I think that the interface wise, a model messaging model wise, there's certainly similarity between, with with Kafka and other systems. I would mention, like, RabbitMQ. I would say that there are kind of, like, 2 there are 2 main set of messaging systems. 1 that come from, like, from, like, this log aggregation and that evolved eventually into streaming. So Kafka Kafka was designed initially as a log aggregation system, and, that is reflected a lot in in in in so in in even in the current state.
Eventually, it evolved in into into into streaming and, in a streaming platform now. But then you have more, traditional message queues like ActiveMQ, RabbitMQ and these are the traditional moderation of Pubsub that have like a much more flexible API and a richer set set of features in how to route messages, how to consume it. You don't have to keep track of your offset in a PubSub, typically. And the main issue with these all these traditional Pub Sub is that they're that they were always lacking in the storage part. They didn't have a very store a very solid storage and a very solid replication story. So very flexible API, but the the persistency guarantee were were a bit weak and on on that front. While on the other end, you have a a log a log based system, with replication and so on, but it's a bit more flexible. For example, it's not trivial to build a queue on top of Kafka because Kafka is each partition has 1 consumer. So you you cannot attach multiple consumers. You cannot only add more more partitions and, that makes, like, building a a worker queue very very tricky operation.
So so impulsor that I just cut short here, but since we we have both streaming and queuing a model on top of our log, so we we can support we can we can offer the user a very high level and easy to use API that can support both queuing and streaming and, and have that base by by a very solid storage and, very solid, horizontally scalable storage.
[00:30:16] Unknown:
And is it possible for you to have 2 different consumers that are consuming the same topic in a different fashion where 1 of them is using the more log oriented approach where it's tracking an offset and consuming the data in a streaming fashion, and then you have other consumers that are treating it as just a shared queue and pulling whatever the most recent messages are as well off of it? Yes. That's correct. So
[00:30:40] Unknown:
the the idea there is that on the topic, when you publish, you don't have to worry about how that is is gonna be get consumed. And on the consumer on the consumer end, you can you can have different subscription, and each subscription is is, is independent from from the others.
[00:30:58] Unknown:
And I know too that there's an API compatibility layer for people who are using Kafka. They can ostensibly just drop Pulsar into their system, add in that API layer, and continue to use their consumers and producers as if nothing had changed. I'm wondering if you have any other similar compatibility layers for being able to provide an AMQP interface right now for AMQP. That's something that we are thinking of, but,
[00:31:27] Unknown:
that's, right now for MQP. That's something that we are thinking of, but, that's the so the thing about Kafka is that most of the people are tied to the API, so it's very easy for us to basically build build this wrapper. Right? And and this and this wrapper is just using postal API Pulsar client library underneath. With the MQP is that everyone can use a different kind of official client library. There's no official API. So the on that front probably would make more sense for us to build a a service that speaks MQP, but that's a much a lot of a lot of work to do that. So that's, that's still there in the air.
[00:32:08] Unknown:
Yeah. The overall flexibility into the AMQP protocol, I can imagine, would cause a lot of complexity in terms of being able to provide a wrapper that fills all the various different use cases, and you'd probably be better suited to building to specific implementations where you want to, for instance, replace RabbitMQ for interfacing with somebody who might be using Celery with Python or some of the other, task based systems.
[00:32:34] Unknown:
Yeah. That that's correct. So NQP is a very is a very, big specification and, implementing the full the full spec would be a lot,
[00:32:44] Unknown:
a good good amount of work. And going back to the Kafka compatibility, I'm wondering if that goes so far as to also support some of the other tooling and plugins that are built around the Kafka ecosystem such as the confluence schema registry or the Kafka connect, implementations?
[00:33:07] Unknown:
So, right now we we don't support this the the schema registry around that, but I think that shouldn't be a problem because and I so I I think that the the schema registry goes out of band for for the messaging protocol. So it's not something that happens in line with the messaging. So it can be done. On the other hand, we are working on on having a a schema registry in in Imporza itself. So that's something that that we want to integrate within our messaging protocol to, ensure that that the schema can be can be enforced and, at the can be validated and and and enforced in a way. So 1 option would be also to to use our own, scheme registry to to validate that as well.
[00:33:54] Unknown:
And again, on the subject of compatibility with Kafka, at least in terms of the conceptual models people have about it, is the idea of the message persistence. And I know you mentioned that Pulsar has the ability to maintain the log of messages that have been published to a topic. And when I was looking through the documentation, it looks like it's configurable as to how long those individual messages can be kept around for. So I'm wondering what the long term storage capacity for a pulsar system looks like as compared to how somebody might use Kafka for being able to do something like an, event oriented system where they can replay the entire sequence of events from the beginning to end and how long those events are persisted
[00:34:41] Unknown:
for? Sure. I will say that 1 big big difference from Kafka, Parsar is that in Parsar, the broker actually knows the position of each consumer. And, because of that, it can for example, the default is to delete the data as soon as all the subscription has consumed the data. So you don't need to have time based retention. And if some consumer is is falling behind because it's it's down for several hours, you you will not lose data based on retention because this retention time on in addition of the subscription based retention. So you can see that I want I want my data for at least 6, 12 hours or 3 days or or whatever time. It just depends on on on on space. And the other interesting, difference on the storage part is that with bookkeepers so since as as as I keep repeating that the the data for a single topic is not stored in a single node, that is that is very important because in Kafka, all if you have a partition, because this partition has to be stored in its entirety on a in a single Kafka broker. So, basically, you cannot retain data more than x amount of time because that's once you fill up 1 single disk of of of a machine, the the partition cannot grow anymore.
While and, with with PortSight Bookkeeper, what we can do, we can keep adding new segments, and these segments are placed in different brook in the in the different storage nodes. So if you fill up all your disk space, you can just add more nodes and you you can still write into these nodes. So that's basically your your your storage capacity. It is not limited by a single node capacity.
[00:36:28] Unknown:
Yeah. And, 1 more thing. So as Mato mentioned, so the biggest difference between, Pulsar and Kafka would be where, Pulsar broker exactly knows what is the rate position of the particular, consumer subscription. So as soon as the consumer subscription basically hits the data, Pulsar broker can basically, post the data immediately. And the second thing is the retention where, we can define how long you want to keep your messages. And, the third third thing, in Pulsar also supports would be the message TTL. So, you can also configure at the namespace level, how long you want to basically, keep your messages. And then once the time to leave of that message would be expired, no matter when the subscription has read or not, it will basically be deleted.
So in some subscription where, they they basically, they are building a a a backlog and they are not basically able to consume the message for a long time and, it basically grows the the the the storage, then we can have 2 options. 1 would be we can always say say the message TTL, the name space level. So message can be, deleted or the data would be deleted after certain point of duration. And second would be, we can basically, enforce publisher to not produce any more messages until, the consumer basically, consume all of the messages. So let's say, if we have, certain, limitation at the tenant table, that specific tenant basically can consume x percentage of the storage of entire entire cluster, then let's say, 50 gigabyte of storage. So basically, we can always enforce that, a consumer basically a publisher can basically publish some messages up to 50 gigabyte of of data. And, if consumer are not consuming the messages, then basically, we don't want to lose the data, but we want to basically force the publisher to stop publishing the messages.
So it basically supports, various kind of use cases where, based on the retention requirement, you can either, stop, publish to publish the messages or you can basically expire the messages so your publisher can basically keep publishing the messaging without impacting your, actual storage limitation.
[00:38:44] Unknown:
And also having the ability to expire the messages based on the overall amount of time that the message has spent in queue can be valuable for when the data may have some sort of expiration period in inherent to it. So if somebody wants to make a reservation for a particular time and then that particular time has gone by, then it doesn't matter anymore for it to be processed. So you can just purge that data without ever having to actually use up the CPU cycles on the consumer side. It's actually a use case that was mentioned, when I was speaking with the gentleman Sean t Allen for Wallaroo.
That was a use case that he mentioned is people requesting as well for, again, a streaming context where if the data has some inherent life cycle and you can't process it before the end of that life cycle, then you don't want to waste the computation on processing it when it's not necessary anymore.
[00:39:41] Unknown:
Yes. Yeah. So each each use cases has very different threads off, and that's really depends on the on the on the on the application. Right? That's for in some cases and there are very different properties of the data on the kind of events that are being pushed in topics. And, we try to be as flexible as possible to to accommodate all kind of use cases all in in the same in a in a single cluster and make it work everything in a in a single cluster.
[00:40:09] Unknown:
And given the fact that Pulsar is so scalable and adaptable and tunable, definitely seems like there are a large variety of use cases that it can be applied to. But when is Pulsar the wrong tool to use Or what are some use cases where you think that somebody should use something besides Pulsar?
[00:40:29] Unknown:
I think, we have seen 2 use cases, in Yahoo where people try to use Pulsar and we, and then we realized, it's not a good match for the Pulsar. The first would be Pulsar should not be used as a database where if people basically try to publish messages and then they basically want messages to be stored as a database perspective. Right? So definitely that's not the good use case. 2nd would be, sometimes people wanted to use as a scheduler system where you publish the message and you want to consume a message at particular specific defined time.
So, since, basically, you are building a, scheduler on top of Pulsar. So I think, again,
[00:41:10] Unknown:
Pulsar might not be the right system if you I would correct there, just saying that. So we both of us, we are seeing, like, a very bad scheduler implementation to our parser. But is I think that it is possible to do it in the in the right way. It's just that we we have seen very diff very various bad implementations.
[00:41:30] Unknown:
Yeah. And Yeah. Anything anything else you would like to add, Mateo? My pulse might not be direct too. Yeah. So I,
[00:41:36] Unknown:
I think that this the strong points are I mean, you you can it's it's kind of flexible system that that you can adapt for very different things. The strong points are very in the in the in the data durability and, that the guarantee is that and and the and the storage and the low low latency and the high throughput and, and the fact that another thing that that we haven't mentioned that is that so in Bookkeeper is the the the guarantees are that the data has been, flushed to disk. So we are actually able to offer, like, very low SLAs in public public latency and high throughput by while making sure making sure that the data is actually flushed on disk. While audio system like in Kafka is the the default is to let the page cache handle the the right so the right is actually in memory in the in the OS memory and then the OS will flash it. But so when you get to acknowledgment for from Kafka, it's actually that you you have data into machines in memory. So you still have a lot of chances of losing that data if you have that power outage on these kind of scenarios.
[00:42:37] Unknown:
And are there any particular new features
[00:42:41] Unknown:
or overall system improvements that you have planned for future releases of Pulsar? I would say that some of because 1 from our our end are more like on, adding this, schema registry and having a way to enforce the the the format of the data and have a way to also discover it so that you can build tools that will discover what kind of data was was published on a topic as well with validating and enforcing that information. Also, 1 of the bigger feature is on the topic compaction. They try to basically get snapshot of of the database based on keys and version probably has other all other feature that that they're working
[00:43:19] Unknown:
on. Yeah. So I think those are the main features and, the other features we are basically trying to build would be introducing the, replication service, which basically can usually replicate from pulsar to any other different other system. So basically could be it should be pluggable. So let's say, if user is basically publishing the messages into the local cluster and then that user will also like to, publish the same messages or would like to replicate the same messages to different other system. Let's say, in AWS, we have Kinesis or we would like to, would like to basically publish publish the messages to some other non messaging system as well. So right now, if users are basically needs to write their own implementation, where they are basically publishing their build up dual publishing where they are publishing to the pulsar as well. And second, they're basically publishing to some of the system as well. So what if we can basically provide a similar feature as what geo replication does within Pulsar where in geo replication, replication is basically happening right now from cluster a to cluster b within the Pulsar broker. Now it can be basically done to, some other different system by, adding the plug ins framework within the Pulsar application service. So in that way, it will basically, reduce the load at the client side and client doesn't need to pay any attention or they don't have to basically, worry about the replicate the replication to any other different system.
So, we are basically building that, pluggable framework where we can basically pluggable plug plug in any geo replication framework, which can basically talk to any end, different system and then you can basically reliably, replicate the messages. Yeah. So that is the another big thing which, we are working right now to, facilitate our users.
[00:45:01] Unknown:
Are there any other topics or aspects of the project that we didn't cover yet that we think we should talk about before we start to close out the show? No. I was thinking,
[00:45:12] Unknown:
do we would like to cover the multi tenancy, within Pulsar, which basically can cover the security and, probably, different other features like, isolation and all the other different resource quota, other features.
[00:45:27] Unknown:
And, 1 of the other features that we didn't talk about yet is the fact that Pulsar was designed to provide for a multi tenant deployment where you can have isolation between different users of the system. So I'm wondering if you can talk a bit about how that's implemented at the system level and some of the security capabilities that you have built in for operators of the cluster to ensure that you don't have data leakage between those different tenants of the system, as well as some of the use cases that it provides for in a hosting context, for example, that would be impractical for some of the other systems that somebody might use in that kind of environment?
[00:46:07] Unknown:
So, in Pulsar, we have basically, edit the multi tenancy feature at multiple layers. So right now, Pulsar provides the authentication and authorization capability where in Pulsar, we can we can basically plug in any authentication framework. And, by default, within Pulsar, we are providing the default Athens authentication framework, which is another open source project from Yahoo. So authentication is basically pluggable framework, which, a Pulsar provides where it basically can authenticate and get the principal role of the user. And based on the that role Pulsar basically does the authorization. Right now, it's basically sent central authorization.
It's not pluggable, though. So it basically provides, tenant identification, and then basically it, and then it basically provides user to use the culture for publishing the consuming the messages. Another another thing would be resource isolation where it basically try to protect or isolates the resource, in terms of the storage and in terms of the CPU. So in terms of the storage, as we discussed earlier, where in person, we can or we can always configure the, resource quota in terms of, how much, storage space a user can or a tenant can basically acquire. So at the namespace level, we can always define, what should be the maximum storage capacity that tenant can use. And and beyond that, basically not allow tenant to publish the more messages or try to, store more data into the disk. In that way, it can basically, protects the specific tenant to acquire all the storage space within Pulsar. Regarding the CPU isolation in Pulsar, we have, essentially introduced, message conception throttling.
So we can always, configure any namespace or any particular topic, how many messages a consumer can can basically, receive. So in that way, if sometimes what happens, many different consumers, they basically try to consume as fast as possible. And in that particular topic, there are, message in throughput is basically pretty high. So, there could be use case where, that specific number of consumer can basically heat up all the, network and the CPU resources of the cluster. So to prevent prevent that, we can always define throttling limit per consumer or topic and, which basically protects, that that that those specific consumer taking the, resources in terms of the CPU and the, network bandwidth. And then, so those are the main multi tenancy,
[00:48:45] Unknown:
features Pulsar provides. Mato, would you like to add anything else? Yeah. We just, like, mentioned that. So these are all like this soft isolation, basically softer level protections from preventing tends to affect each other. Right? But the main, the main core feature that enables us to provide, multitenancy and is the IO isolation that happens in the at the storage node level. And that happens because the the right path and the right path are separated. So you have different devices and you have a journal which the writes are are written to and then and then cache memory. And then you have a storage device, which are separated physical disks. The 1 is called multiple disks that basically what gives you is that all the reads are coming from from the storage device while device are are only synced on the ledger device on the journal device. So the the 2 the the 2 IO paths are separated and that that allows that if some reader is causing a lot of this Caio, it won't impact the the the writers. So it won't affect the SLA of publishing latency or throughput for other other tenants. And that's the feature that enables us to offer that this as a as a service.
[00:49:55] Unknown:
So for anybody who wants to follow the work that you reach up to or get in touch, I'll have you add your preferred contact information to the show notes. And so as a final parting question to give the listeners 1 more thing to think about, if You could just describe from your perspective what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:19] Unknown:
That's a good good question. I I would say that there there are very different system with very different trades off. And, I would say that the there there are not many truly scalable systems. And, this is something that in when you choose a system is people people typically don't worry that much on that part, and, like, all these pain points that you have when when scaling, you can only see that after a while, after you get acquainted with a certain system and, and you figure out all the operational pain point or, like, when things getting more complex due to scale. And, so I think that this is something that Pulsar has has we have been doing a lot of work on that front, and, I think that something that we that that we can have closed gap in in in this in this particular area of messaging and not the broad data management, but in the having a single platform for sharing data across different components in a very solid and scalable way. Rajan, do you have any thoughts on that topic?
[00:51:27] Unknown:
Yeah. So, I have a similar belief where whenever we are building, such a scalable and distributed system where it basically includes multiple different other components. So the system so if I give example of Pulsar where it depends on 2 other different components, which is Apache bookkeeper and the zookeeper. Sometimes with the scale, in Yahoo, we are basically going, on a very high very large scale of data and the number of traffics in terms of the number of messages going in and out. After it scales, we realize many different pinpoints which could be within the system or within the different subcomponents. So at some point, you basically, hit that particular limit where you need some big change or modification or the implement where to scale it more. It could be, let's say, within plus side. It could be, let's say, within within, a zookeeper where it can basically support x number of, ZooKeeper nodes. And, to base you cannot go beyond that number of, ZooKeeper nodes, let's say, or or or or or, the storage. You have to you really have to basically, come to, at the ground again and you have to, think what could be done in a very in for the next step, to basically build, another beginnings of the scalability.
So but those things basically, would be always a good learning curve when you basically hit the the the very first milestone of the, your scalability limit and then you basically try to improve and make it more, scalable and take it more further. So should be always a a learning goal when you basically hit that that that limit.
[00:53:09] Unknown:
Alright. Well, I wanna thank the both of you for taking the time out of your day to join me and discuss the work that you're doing with Pulsar. It's definitely a very interesting project and 1 that appears to solve a lot of interesting problems. So I'm definitely gonna be keeping an eye on how it evolves over the coming months years. So, thank you again for your time, and I hope you each enjoy the rest of your day. Thank you for having us and for all the interesting questions. Thank you.
Introduction to Guests and Pulsar
The Origin and Motivation Behind Pulsar
Geo-Replication and Consistency in Pulsar
Challenges in Building and Promoting Pulsar
System Design and Evolution
Comparing Pulsar with Other Messaging Systems
Message Persistence and Storage Capacity
Use Cases Where Pulsar Excels and Falls Short
Future Features and Improvements
Multi-Tenancy and Security in Pulsar
Biggest Gaps in Data Management Tooling
Closing Remarks and Future Outlook