Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production, and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it.

Go to data engineering podcast.com/gocd

to download and launch it today.

Enterprise add ons and professional support are available for added peace of mind.

And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which linked from the site.

To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.

We've got a couple of announcements before we start the show.

There's still time to register for the O'Reilly Strata Conference in San Jose, California happening from March 5th to 8th. Use the link data engineering podcast.com/strata

dash sand dash jose

to register and save 20% off your tickets.

The O'Reilly AI Conference is also coming up, happening April 29th to 30th in New York. It will give you a solid understanding of the latest breakthroughs and best practices Also,

if

Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the open data science conference happening in Boston from May 1st through 4th.

It has become 1 of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective.

To save 60% off your tickets, go to data engineering podcast.com/odscdasheastdash2018

and register.

Your host is Tobias Macy, and today I'm interviewing Rajan Dabaglia and Matteo Merley about Pulsar, a distributed open source PubSub messaging system.

So, Matteo, could you start by introducing yourself?

Sure. Hi, Deiza. So my name is Matteo Merli. I I work on Pulsar and I I used to work at Yahoo for several years on PubSub messaging, distributed

database replication and various infrastructure,

tasks. And, I continue to work on the project and

trying to

bring it forward,

with with new features and and new,

new bugs.

And, Rajan, how about yourself?

Yeah. Sure. So my name is Rajan Dabaliya, and,

I'm working on a project

in Yahoo.

And before joining Yahoo, I've been working with the different teams,

related to platform infrastructure and building various distribution system.

And for years back, I have joined Yahoo. And

since then, I'm working on Apache Pulsar,

and I'm also working here on

distributed storage system, which is nothing but the storing the distributed key value pairs. Yeah. So,

yeah. That's it.

And going back to you again, Matteo, do you remember how you first got involved in the area of data management?

When I started working on Yahoo, I started working on on,

database replication. So Yahoo has a massive,

basic database

that is originated from the old, like, Peanuts paper. And,

1 of the critical part of that the system is the,

moving data basically from different

different regions. So you you have a database in which updates can is a a metric consistent database. So

events can come from different region and has to be reconciled, but data has to be moved across different data centers at a massive massive scale. So that's what what the the first project that I started working on and that's where I started

seeing copious amount of data and and moving this this that data around.

And, Rajan, how about yourself? Do you remember how you first got involved in the area of data management?

Yeah. So before joining Yahoo,

I have been working on various distributed system where it basically,

strongly requires,

various messaging in the database system. And regarding messaging,

we have been working on different messaging system with a very large scale of data. And,

regarding the database,

we also had a huge requirement of building a system, which basically requires the, replication of the data between different clusters. And then,

I joined Yahoo where,

it caught, it caught my eye where,

Yahoo was building a distributed messaging system, which is right now known as a Pulsar. And then we joined, I joined Yahoo. And, that's where

I started working on Pulsar where at the time, it was basically started, developing and then it's getting matured,

with the time and many different teams have basically started onboarding on Pulsar to basically,

satisfy their asynchronous behavior and pops up requirement.

And, now

large portion of the Yahoo, they have started on, started using the Apache Pulsar on a, very heavy and,

heavy basis. Yeah. So can you explain what the Pulsar project is and what the original inspiration for the project was, particularly given the fact that there were existing publish subscribe systems?

The main reason was that,

we

at Yahoo, we we we realized so

the reason for Pub Sub is that,

even that many groups were using that on on on their own. There was no central system at Yahoo to to

for for messaging.

So basically, everything was

on on its own

and every team

choose a different solution either,

could be ActiveMQ, could be Kafka, could be many many different ways, but the problem there was that each team,

didn't have the necessary the the operational expertise to operate this system for critical,

projects.

And also there was like a lot

of duplicated effort in

around monitoring, around how do we operate it, how do we how what we do when this thing fails and so on. So there was a push to

have a a single

service for the whole company. And when when we started that, we basically evolved this from other

proprietary

Yahoo systems and

trying to figure out what was if we could leverage anything that was out there in the open source and

what was the the best the best solution and the best approach for for the task. And so immediately what we we we test out all these available solution but there we couldn't find anything that was able to to scale to to the point that we we we could support the internal use cases,

all in a single cluster or even some of the most demanding use cases, especially with

requirements in term of durability, in terms of what's what's the kind of latency that we can offer, some kind of SLA,

and in terms of operating a very large cluster. So that's that's when we when when we had sat down and started designing from ground up. So we we took a bookkeeper, which was another yahoo project and, that that was for the for a storage part and

eventually we we evolved in a way that we build

our own broker on top on top of it to to have a very

lightweight,

broker component that could leverage the the horizontally scalable storage for for Bookkeeper.

Actually, yeah. I would like to also add a few more things. So why we have seen

that Pulsar could be the right use case, right system,

to basically bring in the Yahoo.

So, in Yahoo, we have,

most of the teams basically,

after learning most of the teams requirement, we have realized that we have 3 fundamental requirements,

which we exactly need in Yahoo, which was not basically available into other,

pops up messaging framework in the market. The first 1 was the the throughput and the performance. So in Yahoo, we have many applications which requires,

strict SLA in terms of the publishing the message.

We have 1 of the biggest system,

Sherpa, which basically,

requires, 5 millisecond of publish latency.

So as Matthew mentioned,

Pulsar has 2 major components. 1 is the broker and second is Apache bookkeeper.

So bookkeeper is basically,

persistent storage system

where basically persist all all the published messages.

So this Apache bookkeeper basically,

architecture basically fulfills this requirement regarding the throughput because the Apache bookkeeper,

so each individual node basically known as a bookie. And beauty's, bookies architecture basically,

helped us to

isolate

the IO,

operation for the read and write because, in Apache Bookie,

the read and write basically, happens into,

different execution path,

where whenever any message will be published or if you want to persist any particular message into Bookie,

Bookie has its own

journal.

For

1 individual Bookie note, it basically has 1 single journal where it basically upends the message, which is basically right ahead log and basically upends, every right.

And,

it basically does those right, basically periodically flushed into the dedicated, separate storage device in the background.

So,

whenever any

happen, that would basically happens from that separate storage device. So if any consumers tries to read as far as possible,

all it it can basically saturate the IO, but then it basically,

impact the storage device. It won't basically impact the actual journal device where we are writing. So if we have, consumers where basically they are reading as fast as possible, even if the audio will be saturated, it won't impact the published latency. So the published throughput was the main use case. The second was the,

scalability.

So right now in Yahoo,

for 1 of the cluster, we are having more than 1, 000, 000 topics. So, yeah, Pulsar is basically serving more than 1, 000, 000 topics in 1 of our cluster.

And that scalability basically,

can be achieved,

in Pulsar,

because, if you consider,

another pops up framework, which is, let's say, Kafka.

So in Kafka, every consumer basically,

can read from 1 partition. So basically,

creation of the number of partition is basically consumer bound. Now so so now

even in Kafka, if let's say if we have a topic and if we are basically having dedicated file or directory for every topic, then our all the Ios can be scattered, in various, different directories and then basically, scattered the Ios.

So, it basically direct impacts,

the,

IO read operation because it basically it has to basically flush the test case and periodically. In Pulsar, we are using the Pulsar bookkeeper where it basically aggregates a multiple topics messages into 1 large log file. So it basically prevents that, IO scattering and then use the most calibrating in terms of the IO throughput as well. And

the third main use case was the geo replication where in Yahoo for most of the application, we,

need to replicate the message from cluster a to there are multiple, different region are available

where we have to basically replicate the messages. So in Pulsar, Pulsar broker itself basically takes care the replication of the,

from 1 geo 1 geo cluster to the different,

geolocation.

So in that way, it won't require any user intervene or it won't require any user attention while replicating the message because

broker,

server side will be basically taken care. So these are the 3 main use case, fundamental use case, which were which were like a strong requirement we had in Yahoo and that motivated,

us to build up concern.

And there are a few questions that I have coming out of that. 1 of them is for the geo replication.

You mentioned that the pulsar layer is what's responsible for handling the distribution of the messages to those distributed clusters. So I'm curious if it's essentially just replaying the message

as a publish event to the other clusters.

And it seems that that way, the underlying bookkeeper layer doesn't have to have any understanding

of the separate clusters as far as distributing the data in that manner. Is that accurate?

That is correct.

So actually so the the main kind of application is it is async publication. So this is being handled a broker broker to broker.

As as Rajan was mentioning, that's a very core feature of the system. So it's not some tool that you have to run. It's a property that you configure. Let's say, I want my topic to be replicated in replicated in region b, a, b, and c, and the system will take care of it. The other kind of replication, the sync replication, you can still doing that by placing storage nodes in different clusters and using

rack rackware policy to make the data being present in both clusters. And that that that would be a single application, and,

it's it's actually a single logical cluster

with, with data scattered across multiple regions. So that's more like of a deployment,

configuration choice.

And then on the persistence side,

is it tunable as far as the level of consistency

that's required

for a right to be acknowledged where you submit the,

publish event to Pulsar and then it requires some,

number of nodes

to acknowledge the right before it will return back to the person doing the publish that it was successful?

Yes. That's in that is correct. So the in Bookkeeper, there are different tunables that that you can, configure. But,

so it's a very the replication model for Bookkeeper is very different from what what it is for Kafka and it has several advantages in with with Bookkeeper and Pulsar. So as we say that we have 2 different layers. 1 for 1 1 is the broker, 1 is the storage. And that gives you a complete decoupling from where the data has to be stored and where where the data is being served.

And that is more much more flexible.

Yes, you can configure a multiple,

the level of replication that you want.

And

you can also configure

different ways how to distribute the data across different storage nodes. So for example, you can say that I want 2 copies or I want 3 guarantees copies but I want to write 1 more copy and wait for 3 acts or 2 acts, for example. So you can trade more, more rights for

lower latency.

And, that's something that,

it can help a lot in

in reducing the latency because it can prune out this lowest node. It also basically, we can also tune the replication factor per tenant level as well. So in Pulsar, every tenant basically can have their group of topics, which we call as a namespace.

And,

that replication

factor can be tunable,

per namespace level. So, it basically, we can basically tune at tenant level as well. So it can be basically, configurable,

as per the tenant requirement.

And do you provide a sort of, sort of general purpose set of defaults for, when you install the clusters so that you don't have to worry about working with all those different parameters when you first start working with the software because sort of the general

issue of usability versus configurability

is a spectrum of if you have too many knobs to tune, then you're not quite sure which ones to tune and you can actually make things worse than they started off. So

Yes.

So the default is to have 2 copies of the data. So Bryte to to to start the nodes and wait for 2 acknowledgments.

So that gives you the 2 copies variability and that's the default. And, of course, it's tunable. So you can change the default for the whole cluster or you can change the

the setting for a for a particular group of topics.

And what have been some of the most challenging aspects of building the Pulsar project as well as,

promoting it outside of Yahoo

and gaining interest in the project itself?

Yeah. These are very, very different challenges. So for

building the system,

I would say that, the main challenges were basically on the scalability part. So how to design

a system that

has to be able to scale in very different angles in throughput on a single topic, in number of topics and

adding new nodes and also the the operations part are very,

critical. Once you reach a certain scale of operations that all the tools

and the reliability

of these tools and and this

operation needs to be very very solid. And also, how do how does it drag the system when when you when you put any kind of different workloads,

in the same cluster?

How to make sure that tenants doesn't affect all the workload for other other tenants? Other Reflection has other things that

Oh, yeah. I think, we got everything. So I think the,

catching the more attention from the

outside,

of Yahoo or, to a large community,

I, I think the project has to have

solid,

infrastructure in terms of,

building a system more reliable and then, have providing the monitoring APIs and, also providing

some easier way to deploy the service, which is a purchase of Apache parcel servicing to various,

cloud providers.

So I think,

Apache Pulsar,

as a community

has contribute, many different contributors have contributed a lot in terms of making Pulsar more reliable and, providing a different solutions where people can easily come and then bet they can deploy in various different cloud providers like AWS,

Google Cloud, and, they can basically start with Pulsar,

very quickly. So it basically gives a very fair idea to user that how quick, it would be to deploy. And then once they start using it, the the throughput and the scalability,

which Pulsar provides, I think, then they would basically start realizing the main idea behind,

using, Pulsar as a pops up, messaging framework, with the use cases. So I think, there are a lot of things have been done, and, we have been trying,

we are putting our consistent efforts to, taking it more, further and forward.

Yes. On the on the adoption part, I was just at that. So 1 of the main

of the main challenges there is that so

this is a critical part of infrastructure. So it's not some project that,

anyone can can take easily on. So it it there's a lot of

lot of work to convince people that this is a available alternative to other established

products.

And,

and that is worth giving a try because

we believe that many of the operations, many of the scalability issues

are are are are are not not an issues with with Pulsar, and that's that's and that's there are several advantages.

And how much has the overall system design and architecture evolved over the course of the project's lifetime?

And are there any decisions that were made early on that have proven to be problematic

for being able to grow and evolve the project as new use cases or requirements are introduced?

I don't recall any any

this any, like, big, obstacle.

So I think that the the the the architecture is

is is very solid in the sense that so the the main idea with with Pulsar is that you have a Pub Sub layer that is based upon a distributed,

log.

And that that's a very

flexible architecture and that that basically

it lets us support very different

use cases.

For example, like, with with Pulsar, you you can have different types of consumers.

So you can consume our data,

either

in order like like you do in Kafka. Basically, you have, you know, topic, you have partitions, you can have a consumer per partition that consume all the data in order, but you can also use the more,

queue,

traditional queue approach in which you have a single shared queue if you have multiple consumer that that are all

taking items out of the same queue. And, and support this because we have this this little log underneath and then we have this

serving layer which is the broker on top of it. So that has not changed since since inception.

We've been just adding

few more features here and there. We we are adding more components

but not really any

dramatic changes from the beginning.

And for somebody who wants to start using Pulsar in their own environment, what are some of the infrastructure

and network requirements this they should be considering,

and what's involved in deploying those various components? There's

no, like like,

special requirements.

The the minimum requirements are very very,

usual. Like, I would say that at least 3 nodes if you want to have the data replicated.

For in terms of deployment, we have instructions

in the in the in our in our website. There are instructions to deploy in in very different

environment from,

from bare metals with Ansible

scripts or with,

on AWS with Terraform and so on. Or also like with

Kubernetes, we have recipes

and

or already available. So in just like a few few commands you can spin up clusters in the cloud or or in a Kubernetes cluster and, and have Corsa running. Of course, like having

a massive deployment

with, will require probably

some some care in choosing the the right hardware and,

but this is just a matter of of having some capacity

planning and deciding how much bandwidth, how much what what kind of disks are better for for for these use cases and and all.

Yeah. So I think, I will add a very quick

overview regarding the deployment. So, Pulsar is basically having 3 major components,

or the 3 different subcomponent, basically, it requires to be deployed.

The 2 we already talked, the Pulsar broker,

which basically solves

the service layer where it basically,

client basically connects to the broker and then, it basically,

pass through, the message

messages,

to basically produce it and to the consumer.

And, for the persistent,

storage perspective, we are using the Apache Bookkeeper where, we we can at least start with the,

3 different,

Apache Bookkeeper and Bookkeeping Bookie Notes as Matthew mentioned. If we want to basically have at least 2, if you want to basically keep 2 copies of the data. And there's a third component,

which, Pulsar uses as a zookeeper.

So Pulsar uses zookeeper,

to store the,

cluster metadata

and the topic metadata. So it can basically

stores them for configuration information to the, zookeeper and also,

for the,

ownership and the synchronization

between the, namespace and the topics. So, basically, Pulsar requires 3, these 3, different subcomponent to be deployed. And as Matthew mentioned on the website, there are different,

instruction have been defined if you want to,

deploy into the bare metal cluster or if you want to deploy into the Kubernetes or, on the Google Cloud, with the Kubernetes

or, Amazon,

AWS.

So, those instruction,

and the recipes are already available.

So with, like, few minutes of the efforts, we can deploy it very easily. And then

based on the, capacity,

and the traffic,

those nodes can be basically,

adjusted based on the traffic and the capacity.

And as you're scaling the cluster,

whether it's vertically or horizontally or,

across

geographically distributed regions, what are the different

factors along which you can scale and where do they start to break down that people should be paying special attention to as they do start to increase the overall capacity of the clusters?

So typically, like, say, 1 of the distinctive features of Pulsar is that, as we told that that we have these 2 layers, is that

it's very easy to expand the capacity of a of a running cluster. And that is because, the data for a single topic or a partition on a topic is not stored on any 1 node,

at at and in in its entirety. So the data is for time and that gives that we can expand the capacity. If we need more more storage capacity, we can

just add more storage nodes over

over

time.

And

the same thing that can be done for

for for brokers. So if you see that

the the CPU or the bandwidth for the brokers is is running high, then you can just add more brokers. And automatically,

there's a load manager component that takes care of assigning topics to brokers and that will also take use the load information for a broker to to offload some of the topics to these less loaded brokers. So this basic

administration operation are automatically. So you you can you can just spin up new nodes and the traffic will move there.

And in the event of, network partition,

where does Pulsar fall in terms of the cap theorem

of whether it continues to operate in a partially degraded mode or whether it will cease to accept new inputs,

until the network partition is healed?

So the consistency model is basically,

broken all its topics

and, and they they they are the ownership through Zookeeper. So that

the consistency level is is granted by Zookeeper.

That's what gives the broker to

the ownership

and that ensure the consistency. The COSITA is is also validated at the bookkeeper level. So there is a fencing mechanism so that if for if if a new broker takes over any any anyway, that will basically be it is able to fence over off the the the old writer.

So the the guaranteed consistency.

So, again, this is more like

a a CP system.

But, again, the the

the difference with

so the the there are different

turns off the same time that that Zookeeper is a c CP system, but

you also had multiple multiple nodes to give you availability. So if you have bigger quorum, you have more chances of availability.

You you don't have a single point of failure. That's that's

who And over the course of our conversation,

we've mentioned Kafka a number of times as being a somewhat comparable system in terms of the way that people might use the 2 different projects. So I'm wondering if there are any other

projects or services

that you would consider to be competitors to Pulsar

and if you can do a bit of compare and contrast as you introduce some of those different projects.

Sure. Like, bearing all the reason that we that we enumerated at the at the beginning, like, of all the scalability, geo replication, that features that were not there in any other system. I think that the interface wise, a model

messaging model wise, there's certainly similarity between,

with with Kafka and other systems.

I would mention, like, RabbitMQ. I would say that there are kind of, like, 2 there are 2

main set of messaging systems. 1 that come from, like, from, like, this log aggregation and that evolved eventually into streaming. So Kafka

Kafka was designed initially as a log aggregation system, and, that is reflected a lot in in in in so in in even in the current state.

Eventually, it evolved in into into into streaming and,

in a streaming platform now. But then you have more, traditional message queues like ActiveMQ,

RabbitMQ

and these are the traditional moderation of Pubsub

that have like a much more flexible API and

a richer set set of features in how to route messages, how to consume it. You don't have to keep track of your offset in a PubSub, typically.

And

the main issue with these all these traditional Pub Sub is that they're that they were always lacking in the storage part. They didn't have a very store a very solid storage and a very solid

replication

story. So very flexible API,

but

the the persistency guarantee were were a bit weak and on on that front. While on the other end, you have a a log

a log based system,

with replication

and so on, but it's a bit more flexible. For example,

it's not trivial to build a queue on top of Kafka because Kafka is each partition has 1 consumer. So you you cannot attach multiple consumers. You cannot only

add more more partitions and,

that makes, like, building a a worker queue very very tricky operation.

So so impulsor that I just cut short here, but since we we have both streaming and queuing a model on top of our log, so we we can support we can we can offer the user

a very high level and easy to use API that can support both queuing and streaming and,

and have that base by by a very solid storage and, very solid,

horizontally scalable storage.

And is it possible for

you to have 2 different consumers that are consuming the same topic

in a different fashion where 1 of them is using the more log oriented approach where it's tracking an offset and consuming the data in a streaming fashion, and then you have other consumers that are treating it as just a shared queue and pulling whatever the most recent messages are as well off of it? Yes. That's correct. So

the the idea there is that on the topic, when you publish, you don't have to worry about how that is is gonna be get consumed.

And on the consumer on the consumer end, you can

you can have different subscription, and each subscription is is,

is independent from from the others.

And I know too that there's

an API compatibility

layer for people who are using Kafka. They can

ostensibly just drop Pulsar into their system, add in that API layer, and continue to use their

consumers and producers

as if nothing had changed. I'm wondering if you have any other similar compatibility layers for being able to provide an AMQP interface right now for AMQP. That's something that we are thinking of, but,

that's,

right now for MQP. That's something that we are thinking of, but, that's the

so

the thing about

Kafka is that most of the people are tied to the API, so it's very easy for us to basically build build this wrapper. Right? And and this and this wrapper is just using postal API Pulsar client library underneath. With the MQP is that everyone can use a different kind of official

client library. There's no official API.

So the on that front

probably would make more sense for us to build a a service that speaks MQP, but that's a much a lot of a lot of work to do that. So that's,

that's still there in the air.

Yeah. The overall flexibility into the AMQP protocol, I can imagine, would cause a lot

of complexity in terms of being able to provide a wrapper that fills all the various different use cases, and you'd probably be better suited to building to specific implementations where you want to, for instance, replace RabbitMQ

for interfacing with somebody who might be using Celery with Python

or some of the other, task based systems.

Yeah. That that's correct. So NQP is a very is a very,

big specification

and, implementing the full the full spec would be a lot,

a good good amount of work. And going back to the Kafka compatibility,

I'm wondering if that goes so far as to also support some of the other

tooling

and plugins that are built around the Kafka ecosystem such as the confluence schema registry

or the Kafka connect,

implementations?

So,

right now we we

don't support this the the schema registry

around that, but I think that shouldn't be a problem

because and I so I I think that the the schema registry goes out of band for for the messaging protocol. So it's not something that happens in line with the messaging. So it can be done. On the other hand,

we are working on on having a a schema registry in in Imporza itself. So that's something that that we want to integrate

within our messaging protocol to, ensure that that the schema can be can be enforced and, at the can be validated and and and enforced in a way. So 1 option would be also to to use our own,

scheme registry to to validate that as well.

And again, on the subject

of

compatibility with Kafka, at least in terms of the conceptual models people have about it, is

the idea of the message persistence. And I know you mentioned that

Pulsar has the ability to maintain

the log of messages that have been published to a topic. And when I was looking through the documentation, it looks like it's configurable as to how long those individual messages can be kept around for. So I'm wondering what the long term storage

capacity for a pulsar system looks like as compared to how somebody might use Kafka for being able to do something like an,

event oriented system where they can replay the entire sequence of events from the beginning to end and how long those events are persisted

for? Sure. I will say that 1 big big difference from Kafka,

Parsar is that in Parsar, the broker actually knows the position of each consumer.

And, because of that, it can for example,

the default is to delete the data as soon as all the subscription has consumed the data. So you don't need to have time based retention. And

if some consumer is is falling behind because it's it's down for several hours, you you will not lose data

based on retention because this

retention time on in addition of the subscription based retention.

So you can see that I want I want my data for at least 6, 12

hours or 3 days or or whatever time. It just depends on on on on space. And the other

interesting,

difference

on the storage part is that with bookkeepers so since as as as I keep repeating that the the data for a single topic is not stored in a single node, that is that is very important because in Kafka,

all if you have a partition,

because this partition has to be stored in its entirety on a in a single Kafka broker. So, basically,

you cannot retain data more than x amount of time because that's once you fill up 1 single disk of of of a machine, the the partition cannot grow anymore.

While and, with with PortSight Bookkeeper, what we can do, we can keep adding new segments, and these segments are placed in different brook

in the in the different storage nodes. So if you fill up all your disk space, you can just add more nodes and you you can still write into these nodes. So that's basically your your

your storage capacity. It is not limited by a single

node capacity.

Yeah. And,

1 more thing. So as Mato mentioned,

so the biggest difference between,

Pulsar and Kafka would be where,

Pulsar broker exactly knows what is the rate position of the particular,

consumer subscription. So as soon as the consumer subscription basically hits the data, Pulsar broker can basically,

post the data immediately. And the second thing is the retention where, we can define how long you want to keep your messages. And, the third third thing, in Pulsar also supports would be the message TTL.

So, you can also configure at the namespace level,

how long you want to basically,

keep your messages. And then once the time to leave of that message would be expired, no matter when the subscription has read or not, it will basically be deleted.

So in some subscription where,

they they basically,

they are building a

a a backlog and they are not basically able to consume the message for a long time and, it basically grows the

the the the storage, then

we can have 2 options. 1 would be we can always say say

the message TTL,

the name space level. So message can be, deleted or the data would be deleted after certain point of duration. And second would be,

we can basically,

enforce publisher to not produce any more messages until,

the consumer basically,

consume all of the messages. So let's say, if we have,

certain,

limitation at the tenant table, that specific tenant basically can consume x percentage of the storage of entire entire cluster, then let's say, 50 gigabyte of storage. So basically, we can always

enforce that, a consumer basically

a publisher can basically publish some messages up to 50 gigabyte

of of data. And, if consumer are not consuming the messages,

then basically, we don't want to lose the data, but we want to basically force the publisher to

stop publishing the messages.

So it basically supports,

various kind of use cases where,

based on the retention

requirement, you can either,

stop,

publish to publish the messages or you can basically expire the messages so your publisher can basically

keep publishing the messaging without impacting your, actual storage limitation.

And also having the ability to

expire the messages based on the overall amount of time that the message has spent in queue can be valuable for when the data may have some sort of expiration period in inherent to it. So if somebody wants to make a reservation

for a particular time and then that particular time has gone by, then it doesn't matter anymore for it to be processed. So you can just purge that data without ever having to actually use up the CPU cycles on the consumer side. It's actually a use case that was mentioned,

when I was speaking with the gentleman

Sean t Allen for Wallaroo.

That was a use case that he mentioned is people requesting as well for, again, a streaming context where if the data has some inherent life cycle

and you can't process it before the end of that life cycle, then you don't want to waste the computation

on processing it when it's not necessary anymore.

Yes. Yeah. So each each use cases has very different threads off, and that's really depends on the on the on the on the application. Right? That's

for in some cases and there are

very different properties of the data on the kind of events that are being pushed in topics. And, we try to be as flexible as possible to to accommodate all kind of use cases

all in in the same in a in a single cluster and make it

work everything in a in a single cluster.

And given the fact

that Pulsar is so scalable

and adaptable and tunable,

definitely seems like there are a large variety of use cases that it can be applied to. But

when is Pulsar the wrong tool to use Or what are some use cases where you think that somebody should use something besides Pulsar?

I think, we have seen 2 use cases,

in Yahoo where

people try to use Pulsar and we,

and then we realized,

it's not a good match for the Pulsar. The first would be Pulsar should not be used as a database where if people basically try to publish messages and then they basically want messages to be stored as a database perspective. Right? So definitely that's not the good use case. 2nd would be,

sometimes people wanted to use as a scheduler system where you publish the message and you want to consume

a message at particular

specific defined time.

So, since, basically, you are building a, scheduler on top of Pulsar. So I think, again,

Pulsar might not be the right system if you I would correct there, just saying that. So we both of us, we are seeing, like, a very bad scheduler implementation to our parser. But is I think that it is possible to do it in the in the right way. It's just that we we have seen very diff very various bad

implementations.

Yeah. And Yeah. Anything anything else you would like to add, Mateo? My pulse might not be direct too. Yeah. So I,

I think that this the strong points are I mean, you you can it's it's kind of flexible system that that you can adapt for very different things. The strong points are very in the in the in the data durability

and, that the guarantee is that and and the and the storage and the low low latency and the high throughput and, and the fact that another thing that that we haven't mentioned that is that so in Bookkeeper is the

the the guarantees are that the data has been,

flushed to disk.

So we are actually able to offer, like, very low SLAs in public public latency

and high throughput by while making sure making sure that the data is actually flushed on disk. While audio system like in Kafka is the the default is to let the page cache handle the the right so the right is actually in memory in the in the OS memory and then the OS will flash it. But so when you get to acknowledgment for from Kafka, it's actually that you you have data into machines in memory. So you still have a lot of chances of losing that data if you have that power outage on these kind of scenarios.

And are there any particular

new features

or overall system improvements that you have planned for future releases of Pulsar? I would say that some of because 1 from our our end are more like on, adding this,

schema registry

and having a way to enforce the the the format of the data and have a way to also discover it so that you can build tools that will discover

what kind of data was was published on a topic as well with validating and enforcing that information. Also, 1 of the bigger feature is on the

topic compaction. They try to basically get snapshot of of the database based on keys and version probably has other all other feature that that they're working

on. Yeah. So I think those are the main features and, the other features we are basically trying to build would be introducing the, replication

service, which basically can usually replicate from pulsar to any other different other system. So basically could be it should be pluggable. So let's say, if user is basically publishing the messages into

the local cluster and then that user will also like to, publish the same messages or would like to replicate the same messages to different other system. Let's say, in AWS, we have Kinesis

or we would like to, would like to basically publish publish the messages to some other non messaging system as well. So right now, if users are basically needs to write their own implementation,

where they are basically publishing their build up dual publishing where they are publishing to the pulsar as well. And second, they're basically publishing to some of the system as well. So what if we can basically provide a similar feature as what geo replication does within Pulsar where in geo replication,

replication is basically happening right now from cluster a to cluster b within the Pulsar broker. Now it can be basically done to, some other different system by, adding the plug ins framework

within the Pulsar application service. So in that way, it will basically,

reduce the load at the client side and client doesn't need to pay any attention or they don't have to basically,

worry about the replicate the

replication to any other different system.

So, we are basically building that, pluggable framework where we can basically pluggable plug plug in any geo replication framework, which can basically talk to any end, different system and then you can basically reliably,

replicate the messages. Yeah. So that is the another big thing which, we are working right now to,

facilitate

our users.

Are there any other

topics or aspects of the project that we didn't cover yet that we think we should talk about before we start to close out the show? No. I was thinking,

do we would like to cover the multi tenancy,

within Pulsar,

which basically can cover the

security and,

probably,

different other features like,

isolation

and all the other different resource quota, other features.

And,

1 of the other features that we didn't talk about yet is the fact that Pulsar was designed to provide for a multi tenant deployment where you can have

isolation

between different users of the system. So I'm wondering if you can talk a bit about how that's implemented

at the system level and some of the security capabilities

that you have built in for operators of the cluster to ensure that you don't have data leakage between those different tenants of the system, as well as some of the use cases that it provides for in a hosting context, for example, that would be impractical for some of the other systems that somebody might use in that kind of environment?

So, in Pulsar,

we have basically,

edit the multi tenancy

feature at multiple layers. So right now, Pulsar provides the authentication and authorization

capability

where in Pulsar, we can we can basically plug in any authentication framework. And, by default,

within Pulsar, we are providing the default Athens

authentication framework, which is another open source project from Yahoo.

So authentication

is basically pluggable framework, which, a Pulsar provides where it basically can authenticate and get the principal role of the user. And based on the that role Pulsar basically does the authorization.

Right now, it's basically sent central authorization.

It's not pluggable, though. So it basically provides,

tenant identification,

and then basically it, and then it basically provides user to use the culture for publishing the consuming the messages. Another another thing would be resource isolation where it basically

try to protect

or isolates the resource, in terms of the storage and in terms of the CPU.

So in terms of the storage, as we discussed earlier, where in person, we can or we can always configure the,

resource quota in terms of, how much,

storage space a user can or a tenant can basically acquire. So at the namespace level, we can always define, what should be the maximum

storage capacity that tenant can use. And and beyond that, basically

not allow tenant to publish the more messages or try to, store more data into the disk. In that way, it can basically,

protects

the specific tenant to acquire all the storage space within Pulsar. Regarding the CPU isolation

in Pulsar, we have, essentially introduced,

message conception throttling.

So we can always,

configure any namespace or any particular topic, how many messages a consumer can can basically,

receive. So in that way, if sometimes what happens,

many different consumers, they basically try to consume as fast as possible. And in that particular topic, there are, message in throughput is basically pretty high. So,

there could be use case where,

that specific number of consumer can basically heat up all the, network and the CPU resources

of the cluster.

So to prevent prevent that,

we can always define throttling limit per consumer or topic

and, which basically protects, that that that those specific consumer taking the, resources in terms of the CPU and the, network bandwidth. And then,

so those are the main multi tenancy,

features Pulsar provides. Mato, would you like to add anything else? Yeah. We just, like, mentioned that. So these are all like this

soft isolation,

basically

softer level protections from preventing

tends to affect each other. Right? But the main,

the main core feature that enables us to provide,

multitenancy

and is the IO isolation that happens in the at the storage node level. And that happens because the the right path and the right path are separated.

So you have

different devices and you have a journal which the writes are are written to and then and then cache memory. And then you have a storage device, which are separated physical disks.

The 1 is called multiple disks that basically what gives you is that all the reads are coming from from the storage device while device are are only synced on the ledger device on the journal device. So the the 2 the the 2 IO paths are separated and that that allows that if some reader is causing a lot of this Caio, it won't impact

the the the writers. So it won't affect the SLA of publishing latency or throughput for other other tenants. And that's the feature that enables us to offer that this as a as a service.

So for anybody who wants to follow the work that you reach up to or get in touch, I'll have you add your preferred contact information to the show notes. And so as a final parting question to give the listeners 1 more thing to think about, if You could just describe

from your perspective

what you see as being the biggest gap in the tooling or technology that's available for data management today.

That's a good good question. I I would say that there there are very different system with very different trades off. And, I would say that the

there there are not many truly scalable systems.

And,

this is something that in when you choose a system is

people people typically don't worry that much on that part, and, like, all these pain points that you have when when scaling, you can only see that after a while, after you get acquainted with a certain system and,

and you figure out all the operational pain point or, like, when things getting

more complex due to scale. And,

so I think that this is something that Pulsar has

has we have been doing a lot of work on that front, and, I think that something that we that that we can have closed gap in in in this in this particular area of messaging and not the broad data management, but in the having a single

platform for sharing data

across different components in a very solid and scalable way. Rajan, do you have any thoughts on that topic?

Yeah. So,

I have

a similar belief where whenever we are building,

such a scalable and distributed system where it basically includes multiple different other components.

So the system so if I give example of Pulsar where it depends on 2 other different components, which is Apache bookkeeper and the zookeeper.

Sometimes

with the scale,

in Yahoo, we are basically going, on a very high very large scale of data and the number of traffics

in terms of the number of messages going in and out. After it scales, we realize many different pinpoints which could be within the system or within the different subcomponents.

So

at some point, you basically,

hit that particular limit where you need some big change or modification or the implement where to scale it more. It could be, let's say, within plus side. It could be, let's say, within within,

a zookeeper where it can basically support x number of, ZooKeeper nodes. And,

to base you cannot go beyond that number of,

ZooKeeper nodes, let's say, or

or or or or, the storage. You have to you really have to basically,

come to,

at the ground again and you have to, think

what could be done in a very in for the next step, to basically build, another beginnings of the scalability.

So but those things basically,

would be always a good learning curve when you basically hit the the

the very first milestone of the, your scalability limit and then you basically try to improve and make it more,

scalable and take it more further. So should be always a a learning goal when you basically hit that that that limit.

Alright. Well, I wanna thank the both of you for taking the time out of your day to join me and discuss the work that you're doing with Pulsar.

It's definitely

a very interesting

project

and 1 that appears to solve a lot of interesting problems.

So I'm definitely gonna be keeping an eye on how it evolves over the coming months years. So, thank you again for your time, and I hope you each enjoy the rest of your day. Thank you for having us and for all the interesting questions. Thank you.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links