Summary
The way to build maintainable software and systems is through composition of individual pieces. By making those pieces high quality and flexible they can be used in surprising ways that the original creators couldn’t have imagined. One such component that has gone above and beyond its originally envisioned use case is BookKeeper, a distributed storage system that is optimized for durability and speed. In this episode Matteo Merli shares the story behind the creation of BookKeeper, the various ways that it is being used today, and the architectural aspects that make it such a strong building block for projects such as Pulsar. He also shares some of the other interesting systems that have been built on top of it and an amusing war story of running it at scale in its early years.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
- Your host is Tobias Macey and today I’m interviewing Matteo Merli about Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what BookKeeper is and the story behind it?
- What are the most notable features/capabilities of BookKeeper?
- What are some of the ways that BookKeeper is being used?
- How has your work on Pulsar influenced the features and product direction of BookKeeper?
- Can you describe the architecture of a BookKeeper cluster?
- How have the design and goals of BookKeeper changed or evolved over time?
- What is the impact of record-oriented storage on data distribution/allocation within the cluster when working with variable record sizes?
- What are some of the operational considerations that users should be aware of?
- What are some of the most interesting/compelling features from your perspective?
- What are some of the most often overlooked or misunderstood capabilities of BookKeeper?
- What are the most interesting, innovative, or unexpected ways that you have seen BookKeeper used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on BookKeeper?
- When is BookKeeper the wrong choice?
- What do you have planned for the future of BookKeeper?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Apache BookKeeper
- Apache Pulsar
- StreamNative
- Hadoop NameNode
- Apache Zookeeper
- ActiveMQ
- Write Ahead Log (WAL)
- BookKeeper Architecture
- RocksDB
- LSM == Log-Structured Merge-Tree
- RAID Controller
- Pravega
- BookKeeper etcd Metadata Storage
- LevelDB
- Ceph
- Direct IO
- Page Cache
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week, Write some Python scripts to automate it? But what about incremental sync? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to data engineering podcast.com/census
[00:01:33] Unknown:
today to get a free 14 day trial. Your host is Tobias Macy. And today, I'm interviewing Matteo Merli about Apache Bookkeeper, a scalable, fault tolerant, and low latency storage service optimized for real time workloads. So Matteo, can you start by introducing yourself? My name is Matthew Merly. I've been involved with Bookkeeper since several years now. I mostly use Bookkeeper in the context of Apache Pulsar. That's what we had started to use at Yahoo back in the day, who are messaging platform on top of Apache Bookkeeper. So how it gets into it? But, yeah, I'm on the PMCO of Bookkeeper, and I'm a PMC chair for Apache Portstar 2. And you recently joined the stream native team as well? Yes. Yeah. Continue work on Pulsar and Bookkeeper to the community and also like the to help companies to adopt these technologies.
[00:02:18] Unknown:
And also for people who have been listening to the podcast for a long time, they've heard you before when we talked about Pulsar when you were still at Streamlio. And so for people who haven't heard you on the show before, can you just share quickly how you first got involved in the area of data management?
[00:02:33] Unknown:
Sure. So I started working at Yahoo, like, 10 years ago. So and because everybody wasn't too interested in database system, database databases, and that was, like, a very good place to be to learn a lot. And 1 thing that I started working on was that database replication in particular. And we had, like, several different kind of technologies, like, each developing a different era with different designs and different trade offs and so on. We saw, like, all the problems that that we had in operating the system at a at a large scale. So Bookaboo was a project that that was, like, out of Yahoo Lab. At that point, it was kind of more, like, conceptual. It was not impression yet, but we thought that it was a really a good foundation for what to want to get the data platforms to end and all these kind of
[00:03:15] Unknown:
data movement at a very large scale. So that's kind of where I come from. And so digging a bit more into Bookkeeper itself, can you describe sort of what it is and some of the capabilities that it offers and some of the story behind it? You mentioned that early on when you first started working on it, it was sort of a conceptual concept. Yeah. Wondering if you can just share a bit more of the history of it and maybe how it went from the sort of proof of concept tool into something that you're actually using at production scale.
[00:03:42] Unknown:
Bookkeeper is is very simple that what it offers. It gives you the abstraction of a replicated log. You think a log as an appended only data structure. And it's a service that basically you can create logs and append entry into these logs and and read those entries out. So it's it's very simple. At the same time, it is very powerful because you can have many logs, lot of logs, and the date and once you write into these logs, the data is safe. It is replicated. It is consistent. And if you crash and come back, the the data will still be there. And those nodes and data will will get copied over. So the positive of this immutable log is very powerful because it is very scalable as well.
So you can have many logs and write a lot of data even on 1 log or many logs. So that's where it comes from. So the main idea so why bookkeeper was created was that the initial use case that the creator had in mind was to have high availability for the HDFS name node. So name node, they they have been a product for many years, like, in Hadoop because it was a single point of failure for Hadoop. Right? So 1 of the idea there was if you can basically write these audio updates on the name node into something that is replicated and durable outside the the name node, the name node will not have any local state at any point. So you can reconstruct it. So you can have basically very quick failover for for name nodes. Eventually, there was, like, some lot of things happening in the Hadoop community and didn't adopt the bookkeeper. It was good to be being developed as a sub project of ZooKeeper. The same people that did ZooKeeper also work on Bookkeeper initially.
That's when we started looking at it. And initially, like, we adopted it for we had basically Acumen queue clusters, and we wanted to have higher availability for this Acumen queue cluster. Basically, like, using that bookkeeper as a storage for active queue. And, basically, we we have implemented this within Yahoo, this active queue with bookkeeper service with general application and a lot of other features that was, I think, still in Russian in some form. It is still there. It was working very well, except that we had lot of issues with q and q and JMS and so on. So that that really was like a bad fit for the developer experience with JMS and q and q and q. So what we realized was that bookkeeper was really good. And so we kept all this part that we had for storage, and that's why we basically, like, restart from from scratch from from the broker part, and we we put the parser on on top of Bookkeeper.
[00:06:08] Unknown:
In terms of the use cases for for Bookkeeper, you mentioned that it had a number of different inspirations as to what it was originally intended for, and it has kind of morphed into being used for a number of different cases. And, you know, 1 of the most well known use cases is as the storage layer for Pulsar, as you mentioned. And I'm wondering if you can just talk through some of the other types of systems that have been built on top of bookkeeper because of the fact that it is conceptually simple, but in terms of its capabilities, it's quite powerful due to the way that it's designed.
[00:06:39] Unknown:
Sure. 1 of the use case that that we have important for Bookkeeper that that is is not just the data. We also keep the track of the cursors. The cursor will contain, like, which position in the topic your consumer is, but also, like, if it has consumed like out of order messages. And this we track that in a bookkeeper ledger in the log. So we just keep updating that log with all the updates. Other use cases is a low level storage API kind of system. So you can use Bookkeeper to build your own dat database, and I think that several people actually did that in opening and closed source in different forms and different ways. But you can use this, essentially, this log as a transaction log for for the database.
So once you write into this transact transaction log as a write ahead log, basically, your write is safe, and you can recover it. Like, if you crash, you can always recover from the log. And, actually, like, we do have 1 implementation of that in in Bookingberry itself, which is called the table service. It essentially, it is offer a key value store API, and it is built on top of Bookkeeper Ledgers. And it has the notion of a snapshot and the log, and both are ledgers. So you can use the ledger as a log, but also use the ledger as a blob in a way. Right? You can write many entries and close it and, it's very flexible in that that sense.
Kinda like common, but, yeah, you can also, like, use it as a blob storage. In many cases, that has been is used as a blob storage. For example, like, for some functions, we needed a storage layer for to store the the JARs. Like, if you can submit function as JARs, like, as UberJARs,
[00:08:13] Unknown:
these have to be stored in the system. So just store them in Bookkeeper because it's available, and they were for for that as well. Yeah. It's definitely interesting that because of the flexibility of it that you're able to use it for all these different pieces, whereas I know that sort of the biggest analogous system to Pulsar and Bookkeeper is Kafka in terms of the data ecosystem and Kafka for all of the different sort of surrounding capabilities has all these different additional deployable pieces, whereas because of the fact that Pulsar is architected on top of bookkeeper, it can take advantage of that file storage layer abstraction to store the functions as code and, as you mentioned, be able to keep track of the client positions within the within the streams that they're reading. That's interesting. And then in terms of the work that you're doing on Pulsar as 1 of the biggest projects that is actually using Bookkeeper, I'm curious if you can share some of the ways that the requirements of Pulsar has influenced the features and the direction of Bookkeeper itself.
[00:09:14] Unknown:
Yes. In many ways. In the sense that so we did lot of, like, performance work both on personal and Bookkeeper, and that work always, like, crosses the boundary. Right? It's, like, on both sides. When you optimize data path has been optimizing both both personal and bookkeeper. But other areas where in terms of expanding the number of logs that you can store in a single node, within the years, like, we we did rewrite the whole, like, storage engine and, like, how the indexing works of the entries to make it feasible to store, like, millions of logs per each storage node. The architecture has has not changed, but the the implementation has improved a lot in the past years to increase, like, the vertical scalability of each single node. And that was, like, in reaction of NIVE that we that we had for personal deployments.
[00:09:56] Unknown:
And then in terms of the actual architecture of a bookkeeper cluster, I'm wondering if you can just talk through the different layers of the system and how it interacts with the hardware and some of the design decisions that you made given the high level focus of bookkeeper as this simple abstracted append only log structure?
[00:10:19] Unknown:
Bookkeeper cluster, it is composed of bookies that that storage nodes. And method data is kept in zookeeper or applicable metadata store. Most of the logic of the replication, it is in the client library itself, and that is very powerful. And so for each log, you have the concept of a single writer. That is like a very good design choice that was done at the very beginning because it simplifies a lot of things in the replication. That single writer knows which hand has been pushed, which hand has been acknowledged, and so on. And, basically, there is no need to talk between each storage nodes. So the replication is telling via borrowed rights. So a client will issue borrowed rights, and it will wait, say, yeah, I want to write 3 copies, wait for 2 acknowledgments, and then write is good. That is all tracked by the client. The notes will have no interaction with that. So the implementation of a single of a storage node is very like a dump box. So it just have, like, a write entry or read entry, and and that's it. There are no other operation or interaction that clients with the storage nodes.
The implementation on the storage node can be anything. Right? Now the question is, you could use like a general, like, value store to write your booking presentation. But if you look at things like RocksDB, they're very general purpose, and they have to basically, they can't assume anything on the data. Right? The keys can be updated in all these forms. But in this case, the there are, like, few restrictions that are placed on the data itself. First, that that the data is immutable. And second, that the data is sequential. And that means also that that we expect the data to be, like, for a single log to be to be written sequentially and to be read sequentially. Right? We try to optimize for these kind of use cases. Now it may not be always true, but most cases, it is. So, therefore, like, we can do better than the general purpose of the store. And, essentially, Bookkeeper is kinda like a LSM kind of implementation because there is, like, logs, compassion.
But based on the fact that we know about this restriction that are placed on in this in this domain, we can avoid, for example, like the like, a lot of amplification that you will have in a general purpose value store and just do that in a more efficient way. In terms of the overall goals of Bookkeeper
[00:12:28] Unknown:
and the design of the system, I'm wondering if you can just talk through some of the ways that it has changed or evolved over time as it went from this concept of something to replace the Hadoop name node to where it is now, where it's being used for streaming systems and particularly given the shift in deployment topologies to being run-in the cloud as opposed to, you know, commodity hardware that is likely to be, you know, put into a physical rack somewhere where you have to deal with nodes that might come and go and dealing with the data replication across them?
[00:13:00] Unknown:
Conceptually, like, in the overall, like, design, there are not many changes. Like, the abstraction is all the same. Create a log, append entries, read entries, close the log, and and delete the log. But as I said, we did lots of work in the storage layer, and also part of that that work was to adapt to new environments. So when we started, like, all the deployments with that we had were basic basically on hard disks, on spindles, and, like, you had these nodes with, like, 12 hard disks with the ride controllers with ride cache and ride controllers. So we had to kinda, like, figure out that part of how to use the raid controller, like, put the journals into separated set of devices in the RAID, enable the right cache there, and so on. And with that end of model, like, we could basically use the spindle, but at the same time, it had a low latency.
So the main idea was that when a ride goes to the bookie, it has to be gone on the journal, f sync on the journal, and then returned. So that we wanted to have that protection to not not lose data in turn in case of, like, power out outage. Down the road, we had to kind of adapt to kind of different models. Right? So especially, like, first, we had deployment with SSDs, and there are some different tunables there, some different kind of, like, choices. And, eventually, we also had deployments on cloud, and that kind of changes the equation because, like, in cloud, you you can expect, let's say, a RAID controller. You can do, like, a soft RAID of multiple devices, but there's not really a RAID controller with the battery back write cache.
So you have different choices. You can either use, like, ephemeral storage or you can use duplicate storage like EBSs. There are like pros and cons on each of these and we always have deployments in both of them. For example, like, if if memory storage like these, like locally attached SSDs in cloud, it is very similar to a, like, a bare metal deployment with, like, a real disks, except that the life cycle of these disks now, it is not as durable as when you were in a bare metal. So in in bare metal, like, as long as you write on the disk and you sync on the disk, you can't recover the the data unless the the disk physically breaks down. Right? So if you're on a spindle or like a SSDs, you might stop writing. But when you write that on a terminal disk on the cloud, if your VMs is terminated, so is your disk.
And that brings that what is the value of, like, writing to a journal? What is the value of, like, syncing to the the journal? Because if if syncing there, like, prevents you from a power outage, if you have power outage, your game is gone or your disk is gone. So because of that, we had those, like, deployments with, like, either not seeking on journal or, like, really not having journals, like, skipping journal and relying just on a replication. And for some use cases, again, this is, like, relaxing some of the durability guarantees. So it it is not for everyone and not not every use case, but it's a good model. For example, like, when you have a really big amount of data volume, and that data is either, like, you can re ingest it or it's okay to lose some portion of it. Or we can deploy on that Sys. We did that also, like, some kind of, like, work to optimize for that and and all the the tunables that come with the EBSs. And that is more like falling in the same durability or file as before. Actually, now it's a the EBS is a it's a it's a it's a it's a it's a it's a it's a it's a it's a it's a it's a it's a it's a it's a it's a it's a it's a it's
[00:16:16] Unknown:
for more availability as well. Another interesting design choice in the way the bookkeeper is structured is the fact that it's a record oriented storage medium, whereas different other sort of distributed storage systems will rely more on, you know, replicating byte arrays and byte ranges and not necessarily have any concept of the record itself. And I'm wondering if you can just talk through some of the impact, you know, benefits and trade offs that that record oriented storage layer has in Bookkeeper, particularly in terms of how that factors into data distribution and allocation across the cluster that given that the different records might be of varying sizes?
[00:16:55] Unknown:
At the end of the day, like, yes, each system will write bytes on down on on the disk. Right? But 1 of the additional idea that we're learning in Keeper is that you can also do striping of entries across multiple nodes. So, for example, like, there's a cost of ensemble size, which is which bookies should I use for this log, essentially. I could, for example, I could choose to write, say, 3 copies of my data. So that each piece of data is replicated 3 times. But at the same time, I want to use, say, 5 bookies. At this point, we're doing, like, a round robin striping across 5 bookies. So a single node will not have all the data, but client will know that for this entry, which 3 nodes is supposed to be out of the 5. 1 of the thing that that this feature will achieves is that your throughput on a single log, it is not limited by a single node throughput. So you can exceed that amount very easily. So this is something that that you can do if you have these kind of, like, logical breakdowns of streaming to into entries, and you know where they are. The challenge within the different amount of size is that we typically don't know, like, on the right side is fine. It's typically, like, on the read path that we don't know upfront, like, how many entries you want to read. So the client wants to have, say, most, like, say, 500 k of buffer.
I don't want to read, like, 5 megabytes, and then I had to basically upgrade too much. But we don't know upfront how much size these entries will be until we read them. So we can, like, struggle a bit with this for a while. And a while ago, we added kind of, like, basically, like, implementation just doing, like, estimating averages and correcting those averages over time. Essentially, we know, like, how many bytes per entry particular log, and we use that as a heuristic to when we want to, like, read a batch of entries from Bookkeeper. So that worked, like, very well. So I don't think that that's a bigger issue right now at this point, but it definitely, like, is something that you have to think upfront, like, how much memory I'm gonna need. Especially, like, if you're not, like, reading from 1 single log. But if you have, like, in your single processing, your single broker, you have you're reading from many, many logs.
Each of them can can be, like, issuing, like, a batch reads for and you have to basically calculate the max memory.
[00:19:03] Unknown:
RotorStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plug ins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder today.
Another interesting capability of Bookkeeper is that it has built in capabilities for life cycling data into object storage and cold storage so that you can have a single interface to your entire storage capacity without having to keep everything on disk. And I'm wondering if you can just talk through some of the implications that that has on the ways that bookkeeper can be used and some of the design challenges that you've run across in terms of managing those data life cycles and being able to swap data off of disk, but still be able to read it into memory or be able to pass it through to the consumer for being able to access it after it's been life cycled off of the hot storage?
[00:20:19] Unknown:
So, actually, like, this implemented on top of Bookkeeper. So it is more like a personal feature if you want. I mean, the line is blurred a bit, but let's move it back on top of bookkeeper rather than in bookkeeper itself. Like, we thought a lot, like, we want want to implement this, like, a second tiering of storage, whether it should have gone into Bookkeeper or on top of Bookkeeper. The main reason we did it on top of Bookkeeper was because it's actually much more easier to implement, like, without revolutionizing how Bookkeeper stores the data. So there are still advantages potentially if we could do that inside the Bookkeeper because that kinda, like, changes the equation of, like, having the storage nodes to be, like, a stateful node versus mostly stateless node. That would be a very interesting thing that we kept, like, playing with that that idea, but we have it nailed down yet into a concrete form. So just to recap. Right? So the way that we do the terming storage right now, it is there's a component in Portstar itself, decide that a particular data ledger is, like, has crossed the threshold of, like, say, after 24 hours or, like, x amount of storage.
Therefore, we'd want to take it out of the hot storage, which is bookkeeper, and put it in the colder storage like s 3 or DCS. At that point, what happens is that we basically, like, do a scan of this ledger and write it into dump it into s 3. And then we delete the ledger from bookkeeper, and we're marketing the personal metadata data that this ledger is a different is a different from now and is in s 3 at this obvious idea and and so on. So from a user perspective that from a postal user perspective that is completely transparent and that that data is offloaded out of Bookkeeper. So Bookkeeper will have no tracking of that data, which works well and say it is very conceptually easy because you have this concept of logs that that letters that that are closed. So over time, like, a personal topic is not a 1 single book per letter. It is each letter, it's a segment.
And these segments are closed and sealed, and they're not touched again. So we can just take 1, dump it, and change metadata. So it is clean and and easy. There will be advantages in doing so within Bookkeeper itself. The main thing is that if you look at Bookkeeper with using EBSs and Jira, like, for, like, high durability, 1 of the main bottleneck easily becomes the what is the bandwidth that a single VM will have to EBS. So VMs typically are capped in the bandwidth that they have, like, depending on your VM type and so on. But, typically, like, easy to use is around, like, 700 megabytes per second or, like, 800. Some of them would have some more, like, the bigger VMs. But irrespective of how many EBS volumes you have, you're capped by with that bandwidth. So if you do write it twice to EBSs, like, 1 for journal, 1 for storage, that basically, like, we are kind of like halving the the max throughput that you can put on a single booking.
So 1 idea there would be to basically swap the storage out so that that we can use the booking as a, like, a short term journal. But, like, instead of, like, suddenly, UPS writing on a cloud storage like s 3, like, for all data, that basically changes a bit the picture of the bookie. It comes like very like short the state is only, like, a very short lived. It kinda removes this EPS bottleneck. The main problem here is that in the way that bookkeeper replication is done, if the client will write to 2 bookies or 3 bookies, at the same time, like, you are seeing that this data will get get we're getting into multiple s 3 objects, which does not make any sense. So there has to be some more deeper change there to make it a better fit. So we're still kinda like
[00:23:45] Unknown:
iterating on that idea. Yeah. It's interesting that it's at a separate layer because I know that the Prevega tool, for instance, is built on top of bookkeeper and also has that life cycle and capability that it takes advantage of. So I had assumed that it was built into bookkeeper, but wondering if they have their own implementation separate from Pulsar or if they're using a shared library between the 2 different projects.
[00:24:06] Unknown:
No. There's no shared library. I think I think we have invented kind of the the same thing over again, I guess. It's interesting how ideas get sort of built independently at about the same time, you know, in similar ways without any sort of collaboration between the different teams. Yeah. Sometimes it's a waste of, like, engineering resources, I guess. At the same time, like, you can move faster, like, if you don't have to like, sometimes, like, you start with something, and then you converge to to the same idea in a way, but you start to, like, form a slightly different point of view in a way. And as far as the operational considerations,
[00:24:41] Unknown:
you've talked through a lot of the different aspects of things like the bandwidth to EBS volumes and the, you know, whether to run with the journal or without the journal. But for somebody who's interested in deploying a bookkeeper cluster either because they wanna use it as the backing store for a pulsar cluster or because they want to build their own system on top of it, what are some of the things that they should be thinking of as they're building out their deployment logic for setting up a bookkeeper cluster and maintaining it over its lifetime?
[00:25:09] Unknown:
1 of the, like, most common thing is making sure that the the disks have enough IO bandwidth. And typically, like, people will create, like, a smaller volumes because, oh, I I don't want to retain lots of data, but that's, like, in a cloud environment that will also copy your IO bandwidth. So you'll not be able to write fast enough maybe. Or maybe the volumes are low, and then the book is gonna fill up. So these are things that are the most common issues that many people will have, like, how to evaluate, like, do we have enough bandwidth or not on disk? And the other part, 1 of the things that that I I saw, like, all over again is that I've seen in many cases that people have been, like, scaling up public people cluster because it's easy in the cloud. So okay. I start with 3 bookies. Oh, no. I have traffic spike. I will ramp it up to, say, 9 bookies now, And the traffic is gone, so I shrink it down to 3 bookies again.
And in the meantime, they you lost data because the data that was on these 2 bookies that were, like, spun up and then turned down, then that that is not there anymore. So there are tools to retrofit that that data, but it's kinda like you need to be careful, like, when, like, probably got up and down like a a system anyway. Right? So you have to wait until that that it gets replicated when you when you scale down your cluster. And I've seen it happen many times. Like, okay. There's data missing, why it is missing, and that's the reason. Yep. Definitely a big oopsie.
[00:26:26] Unknown:
Yes. And as far as your own experience of working with Bookkeeper and building systems on top of it, what are some of the most interesting or compelling features from your perspective as somebody who both builds the system and uses it? Most compelling, I think, is a really the replication model, the single writer model thing that is very effective and very
[00:26:45] Unknown:
in a way. But the other thing that I think that is it is very important is the is the fencing mechanism That is I think that this is actively overlooked and not understood, but this is very important. You know, this is a system, typically, like, you have the cost of leadership. Right? So I can, like, grab a lock on Zookeeper, for example, and I can be the leader for this resource. The problem with this is that when you lose that leadership, the you need something to validate your actions. Right? So in Bookkeeper, like, what you get is that if you are the leader, for example, like, I own this topic. So I am the leader for this ledger. Therefore, I can be the single writer.
If someone else tries to do something with that ledger, basically, I'm auto automatically fenced off. So if someone else try to read from that ledger and cause triggers our recovery, I am the writer, and I still think that I'm the leader for this, to be writing, but someone else has a different opinion. So it will trigger a read I'm automatically fenced off. And that really I can realize that, oh, by the way, maybe I should double check again if I'm still the leader because there is no, like, hard barrier otherwise. So this little lock, a mechanism, you need some kind of, like, fencing at the storage level to validate your action. And that's what what kind of bookkeeper offers.
We actually think that that is very compelling and can be used for many things. And we kinda like recently in the new version of Pulsar, we kinda wanna wanted to expose that to the user of of Pulsar as well so that they can take advantage of this fencing by having the concept of, like, exclusive producer. As a postal user, you can specify that I want to be the only writer for this topic, and that is kinda like exposing the same text messaging mechanism to the postal user at that point. You can use that to validate your legal action by writing on on the personal topic as well. And another interesting aspect of the bookkeeper architecture
[00:28:38] Unknown:
is you mentioned that right now, sort of the default coordinator for the bookkeeper cluster is zookeeper, but that that is actually pluggable. And I'm curious if you've seen other implementations besides zookeeper for that coordination layer and what your thoughts are on the effects of zookeeper on the ways that bookkeeper is deployed as more things move into the cloud and become more dynamic? Because I know that zookeeper generally requires you to statically define the members of the ZooKeeper cluster and how that reflects back out to Bookkeeper.
[00:29:14] Unknown:
So, yeah, there is a implementation for each city for Bookkeeper metadata back end. We are also working on porting the rest of Pulsar as well to be, like, independent of Zookeeper, but that's a longer story here. All this kind of consensus system in a way you need to have a set of bootstrap nodes that you have to define and host identity has to be stable. I don't think that there's an easy way around that or like there is then defining, like, initial set of nodes. Like, even if you implement your own consensus stack using Raft or anything else. Right? You always have to define initial set of nodes. And from there, you can, like, discover the rest of the cluster. But I think that's a con there's no, like, like, a easy solution there that can do better than than Bookkeeper on that front. Sorry. Than Zookeeper.
[00:30:00] Unknown:
Yeah. It's amazing the staying power that ZooKeeper has had even as more things move into the cloud. And as long as the resources that rely on ZooKeeper are able to scale up and down, then zookeeper being more sort of statically allocated has you know, people have been able to work around that, and it's had some pretty remarkable staying power because it is, you know, very well engineered and well considered building block for distributed systems.
[00:30:23] Unknown:
Yeah. Yeah. So I think that that's some many good good things in ZooKeeper. At the same time, like, the semantics sometimes go like, new system, like, etcetera, have been, like, moving much more quicker in a way. It can be, like, tricky to operate Zookeeper when you're, like, vertically scaling Zookeeper for, like, a lot of data or, like, a lot of IO. It can be done, but it's not very user operator friendly. To understand why it is taking so much time to do something, you really have to dig down into the nitty gritty details. Like, how the the election works to understand why it's taking, like, 2 minutes to do a little election because the selection was too big and so on. So it does work at the same time, like, it's not like a trivial thing to to operate it. In terms of
[00:31:06] Unknown:
the bookkeeper project itself, I'm wondering what you've seen as some of the most interesting or innovative or unexpected ways that it's been used in systems that have been built on top of it. So
[00:31:16] Unknown:
seeing like a very different use case, like, as I said, as a blob storage or, like, as a business for q q storage store. Like, nothing to think any any more interesting. Kinda, like, used by people who look via, like, databases of or, like, a data systems anyway. I don't remember, like, any any usage out of that or, like, any more
[00:31:35] Unknown:
interesting that what I mentioned already. So And in terms of your own experience of working on and building Bookkeeper as well as being a consumer of it and building other systems on top of it, I'm wondering what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:31:49] Unknown:
The most telling lessons, I think, is really like to when you're, like, using the Bookkeeper for messaging, the most important thing and I think that that is most time, like, overlooked. Most people like don't do like benchmark or low test for that is is like will I be able to read my data out? Like if my consumers are down for let's say 4 hours, like, I saw the data, but will I be able to treat the the data back while the the new data comes in? Right? Because otherwise, if I cannot read it faster than the data comes in, there's no point in continue. Right? Because you you just, like, keep falling more and more behind.
The lessons there, like, even the the times that that you think that it would work, depending on kinda like the different patterns on the use cases and how the data gets written and read. We kind of like had to do a lot of like testing and learning and tuning and re conducting some of the storage components to make sure that that is the case that we can read data faster than it's written. And it's not like always the case because like if you think that the rights are always sequential but the reads can be scattered in some ways. So you have to make sure that you can tunnel that into, like, a more, like, sequential read and having, like, this hierarchy of caches to make sure that that you can, like, read especially from disk. So that's kinda, like, the most challenging part that we had with GIPr, and I say that, but also, like, with any other streaming data systems. Because everything works fine until it does not anymore. Right? So when when is up, when is caught up, then, yes, all things coming from caches, then there's no challenge for any system. But the moment that you have to read, like, a significant amount of amount of data from these cases where, like, everything falls apart in many system.
We learn some of these lessons, like, in the hardware, and then we try to to make sure that we are prepared for them. Yeah. It's definitely an interesting point is that, you know, with a lot of different
[00:33:36] Unknown:
database systems, you can have a pretty good understanding of what the read versus write volumes are going to be where some systems are going to be heavier reads than writes, and so you optimize for the read path. And some systems are going to be, you know, write many read once, and so you optimize for the write path. Whereas with bookkeeper, because it's the underlying storage layer that other systems are built on top of, you don't necessarily know ahead of time what the relative volumes are going to be of rights versus reads where you might have something where you have 1 stream of data writing in, but then you have 15 different consumers all reading back out that same stream for different purposes. But at the same time, you might have it the other way where you have 15 different streams of data being written, but only 1 consumer. And so you have to be able to make it tunable for optimizing to handle both of those different extremes of read versus write.
[00:34:24] Unknown:
Yeah. And it has to be true, but also has it needs to be able to separate React. Right? Because, like, these 2 models can be in the same system. Right? You have, at some point, you just have rights because all the reads are, for example, are cached by brokers. But then at some point, these readers will come up and start to read the the cold data. Now you have, like, all these things in addition. So it is not not always the same. It will spike up and down depending on the non same consumers.
[00:34:48] Unknown:
And are there any sort of interesting or unexpected failure modes that you've come across or edge cases that you've hit in the process of building systems on top of bookkeeper and some sort of, like, interesting war stories that have come out of it? Yeah. I mean, definitely, a lot of war stories.
[00:35:04] Unknown:
In the beginning, like, we try to do the indexing of the entries on disk with using, like, a level DB back in the day, like, once it was out. The 1 interesting thing was is that it would just sit there and, like, all your reads will just time out. Like, you're trying to read from this kind of spinning up, and it's spinning like 1 thread at 100%. So we had this kind of situation in which at some point, like, some of the nodes will stop working, like, all the reads will turn out. We had to take it offline, rebuild the index, and put it back online. So we kinda, like, get around by, like, doing some workaround initially, and then, like, having these tools to rebuild the indexes very quickly. Like, we move later to Roxybee and try to make sure that's when it was coming up and to use this for the indexes.
And another, like, war story was that when we switched to the SSDs the first time, we had these early models. So, like, these SSDs have, like, a limited lifespan. The moment you do some such certain number of IOPS, then the disk will just, like, getting, like, IO errors and so on. And what happened is that since in Bookkeeper, like, the traffic is very well spread across all the nodes, we brought these new machines up with the old and new volumes. After, like, several months, we started seeing 1 failure, like, 1 machine. Okay. That's fine. That's this fails. Take it down. We have all the 40 machines. That's not that's not a problem. And after, like, 12 hours, another machine starts failing. Okay. That's that's okay. And then, like, more eventually, like, all of them were, like, failing because they all reached the same amount of of IOPS at the same time. So that was a bit stressful, but we kinda like rush it and do some tricks to, like, write the journal somewhere else and so on before that went down down.
[00:36:47] Unknown:
That's definitely an interesting firefighting scenario because, you know, particularly with physical machines where you have a particular lead time of being able to get new discs in, that's definitely a pretty scary situation.
[00:37:00] Unknown:
It is more. It is not the app powers. Right? So
[00:37:03] Unknown:
Alright. And so for people who are interested in building a kind of real time oriented distributed storage system or looking for something that will work well for serving as the underlying storage layer, what are the cases where Bookkeeper might be the wrong choice and they're better suited with either just a sort of a more traditional database or using something like a distributed file system such as Ceph or something like that?
[00:37:29] Unknown:
If you need, like, a modifiable data, then it's not a good choice because you can't modify it. Right? Or, like, same thing, like, if you need to do, like, to have, like, this point access. Bookkeeper the barebone Bookkeeper, it is not a good fit because it's logs. Right? The table service that that I took before, it is more like key value store, but this kind of like it's like a building database on top of Bookkeeper. Right? At that point, a database is what you need. Right? If you need to keep keep value access. The other thing is you can create many ledgers, but it is not like infinite, the number of ledgers. Like, you can't have like a billions of ledgers, for example. Like, it's not designed for that. You can have millions of them, but not 100 of millions. For example, like, if you have these many objects that you need, probably it is not a good choice. You should use either a database or, like, a processing system like s 3 to create these objects. Because a ledger, it is not like a free concept. It is not like as lightweight as a key value in a key value store because a ledger has metadata data on its own. So yeah. But otherwise, as a low level storage, I think that's very flexible. But yeah. So no revise, no key value, and not unlimited amount of ledgers.
[00:38:36] Unknown:
And as you continue to work with Bookkeeper and use it for building things like Pulsar, what are some of the things that you have planned for the near to medium term future of the Bookkeeper project?
[00:38:46] Unknown:
1 of the things that that we have been working on is mostly on the area or, like, improving the performance in a way or having, like, more operational tooling. And also, like, 1 of the new core feature that has been worked on is using, like, Direct IO for storage. That improves a lot of performance because we're not bound by the page cache anyway, but just that going through page cache even when we don't has a lot of cost. So we want to get that out. And also, like, 1 area that we've been doing work is on the having a way to detect data inconsistency between nodes. Like, we had already, like, auto recovery that if a node crashes, the data will copy from a head Not because but this is more like I want to run, for example, let's say, without journal. And then because of that, I might be missing some of the data.
But I wanted to be able to detect that in a good way so that the window of time between that data is unlubricated is much shorter. So that a new node might come up, it might be missing some of the data, but it will be able to check what it has locally and resync up immediately with the rest of the nodes. So more like having flexibility in the different kind of deployments.
[00:39:52] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think that different systems for streaming and batching, I think that 1 of the area that we want to work on and this is not just bookkeeper, but this was like post our plus bookkeeper in a way that that we are thinking of is to
[00:40:19] Unknown:
how to make it more working on data that is both, like, real time and historical. Right? Bridging the gap between the batching and the streaming work and having a system that can support both cases with a single API and single platform that can really run both. I think Bookkeeper is is a very good bidding block for that for both worlds between Bookkeeper and what we've been building on Pulsar on top of it and tier storage part and different ways to store the data
[00:40:46] Unknown:
historic data. I think that's where we want to get there in a better way. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on Bookkeeper. It's definitely very interesting project and an interesting building block for the overall data ecosystem. So appreciate all the time that you and all the other contributors have put into that, and I hope you enjoy the rest of your day. Thank you for having me and for the nice questions and the discussions. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at data engineeringpod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Message
Interview with Matteo Merli on Apache Bookkeeper
Origins and Capabilities of Apache Bookkeeper
Use Cases and Applications of Bookkeeper
Bookkeeper Architecture and Design Decisions
Evolution and Adaptation of Bookkeeper
Record-Oriented Storage and Data Distribution
Data Lifecycle Management and Cold Storage
Operational Considerations for Bookkeeper Deployment
Innovative Uses and Lessons Learned
When Not to Use Bookkeeper
Future Plans for Bookkeeper
Closing Remarks and Contact Information