Change Data Capture For All Of Your Databases With Debezium - Episode 114

Summary

Databases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a way to track changes as they happen. Debezium is an open source platform for reliable change data capture that you can use to build supplemental systems for everything from maintaining audit trails to real-time updates of your data warehouse. In this episode Gunnar Morling and Randall Hauch explain why it got started, how it works, and some of the myriad ways that you can use it. If you have ever struggled with implementing your own change data capture pipeline, or understanding when it would be useful then this episode is for you.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at dataengineeringpodcast.com/linode or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Randall Hauch and Gunnar Morling about Debezium, an open source distributed platform for change data capture

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Change Data Capture is and some of the ways that it can be used?
  • What is Debezium and what problems does it solve?
    • What was your motivation for creating it?
    • What are some of the use cases that it enables?
    • What are some of the other options on the market for handling change data capture?
  • Can you describe the systems architecture of Debezium and how it has evolved since it was first created?
    • How has the tight coupling with Kafka impacted the direction and capabilities of Debezium?
    • What, if any, other substrates does Debezium support (e.g. Pulsar, Bookkeeper, Pravega)?
  • What are the data sources that are supported by Debezium?
    • Given that you have branched into non-relational stores, how have you approached organization of the code to allow for handling the specifics of those engines while retaining a common core set of functionality?
  • What is involved in deploying, integrating, and maintaining an installation of Debezium?
    • What are the scaling factors?
    • What are some of the edge cases that users and operators should be aware of?
  • Debezium handles the ingestion and distribution of database changesets. What are the downstream challenges or complications that application designers or systems architects have to deal with to make use of that information?
    • What are some of the design tensions that exist in the Debezium community between acting as a simple pipe vs. adding functionality for interpreting/aggregating/formatting the information contained in the changesets?
  • What are some of the common downstream systems that consume the outputs of Debezium?
    • What challenges or complexities are involved in building clients that can consume the changesets from the different engines that you support?
  • What are some of the most interesting, unexpected, or innovative ways that you have seen Debezium used?
  • What have you found to be the most challenging, complex, or complicated aspects of building, maintaining, and growing Debezium?
  • What is in store for the future of Debezium?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network, you get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with the worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances and they've got GPU instances as well. Go to data engineering podcast.com slash linode. That's l i n od e today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. And you listen to this show. To learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet, listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media chronium Global intelligence, od sc and data Council. Upcoming events include the software architecture conference, the strata data conference, and pi con us go to data engineering podcast.com slash conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Randall Howe and Gunnar Marling about to be an open source distributed platform for Change Data Capture. So Randall, can you start by introducing yourself?
Randall Hauch
0:01:48
Hi, thanks. My name is Randall. I am a software engineer at confluent. I've been there for almost three years. I work on Kafka Connect framework which is part of the Apache Cut Cut project. And I work on Confluence family of Kafka Connect connectors. And before that I was at Red Hat, which is where I created the museum project in the last few years I was there. And prior to that I spent most of my career in various forms of data integration,
Tobias Macey
0:02:19
and Gunnar, How about yourself?
Gunnar Morling
0:02:20
Yes, I work as a software engineer at Red Hat. I am the current project lead of the museum. So I took over from Randall if years ago. And before that, I also used to work on other data related projects at Red Hat. So I used to be the spec lead for the bino nation to the spec and I also have worked as a member of the hibernates team for a few years. And Randall, do you remember how you first got involved in the area of data management?
Randall Hauch
0:02:47
Yeah, so before I was at Red Hat, I worked at a couple startups that were all around data integration. We were trying to solve sort of the the disparate data problems Where do you have data from in multiple databases, multiple data systems? And really I've worked in in that area in related areas
Tobias Macey
0:03:11
ever since. And Connor, do you remember how you got involved in data management?
Gunnar Morling
0:03:15
Yeah, so I think it came to be when I was actually looking into being validation so I vorse you know, working at another company and then beckoned do back in today I was reading up on this new spec been validation one dot o and I was writing a blog post about it because I really liked the idea and then in money or not, who wants to speculate back then he reached out to me and asked, hey, wouldn't you be interested in contributing to this and to the reference implementation? So that's what I did for a few years in my spare time. And at some point, I had the chance to work on you know, data related projects full time at Reddit, and that's what I've been doing since then. And so before we get too far into what DBZ m itself is, can we talk a bit about what Change Data Capture is and Some of the use cases that enables so essentially Change Data Capture is the process of getting changements out of your database. So whenever there's a new record or something gets updated, or something gets deleted from the database to change the capture process will, you know, get hold of this event and stream it to consumers. So typically do a diversity of do it via messaging infrastructure, such as Apache Kafka, so it's all nicely decoupled. And then you can have for consumers which react to those changements. And you can enable all sorts of use cases. So you could use it just to replicate data from one database to another or to propagate data from your database into search index or a cached but you also could use it for things like and maintaining audit blocks, for instance, or for propagating data between microservices for running streaming queries. So whenever something in your data changes, you get a updated curious salt. And so all these kinds of things you can do,
Tobias Macey
0:04:55
and so in terms of DBZ of itself, can you describe a bit about problems that it solves and some of your motivation for creating it, Randall.
Randall Hauch
0:05:03
Yeah. So with changing the capture, all of the different databases, have various support for letting client applications be able to see and consume those events that was talking about some databases, you know, don't even support it. And really, it all comes from the fact that the databases are really focused around returning and operating on sort of the current state of data. And so if you want to history of that data databases internally store that information in order to handle transactions and recover from transactions and from failures. But they oftentimes don't make it available, or at least that that was not really part of the sort of typical user interface. And so all these different databases expose the ability to see these events in various ways. And so if you are working with a variety of databases, it's really difficult to know Know, and use DBMS specific features for each of these systems that you're working with. And so when I was at Red Hat, that was one of the problems that we had, we were trying to, you know, as I mentioned earlier, work with a lot of different dbmss in Red Hat customers, you know, had a large variety of different types of dbmss. And, you know, we had, at the time that the desire to be able to listen to all of the events and and so, you know, doing that for each DBMS specifically, would be very challenging. And so we wanted to create a framework that abstracted away that process and so applications, different consumers that were interested in those events they could simply subscribe to, you know, we started out with a patchy Kafka that could subscribe to Apache Kafka and just magically see all the events because diesem would take care of reading those change events from each of the dbmss and inserting them into Kafka topics
Tobias Macey
0:06:58
and so I know that the general case of what people think of with Change Data Capture is for relational systems. But when I was going through the documentation, I was noticing that there are also systems such as Cassandra and Mongo DB that are supported. And I'm curious, what are some of the biggest challenges in terms of being able to provide support to all these different database engines for being able to pull the chain sets out and then provide some at least somewhat unified way of being able to process them and what the sort of lowest common denominator is in terms of the types of information that you can expose, right? I mean,
Gunnar Morling
0:07:33
so it's as random was mentioning, so it's always different API's? Right? So that's one way you would get change events out of my sequel, there's a different way to get them out of Postgres out of Cassandra and so on. And so for us as database team means we need to do this original engineering effort and for each of the connectors, and then we try to abstract or you try to keep you as a user as much as possible away from the nitty gritty details. So impact Particular if it comes to the relational connectors, you don't really have to think about too much. Okay, does this event come from sequel server? Or does it come from from Postgres, let's say. So Derek trucks, I would say in a pretty unified way, it's looking bit different. If you look at Mongo DB, it's just a different way of storing data right there, you have those documents, and you can you can evanesce structure it, there's no fixed schema. So the ones you would get from only to be they look quite a bit different and then work because then it's yet another case because well, this is a distributed store, by default, so you don't have those strong guarantees when it comes to ordering of events and so on. Depending on from which node of the cluster we receive changes. So in those for those NO SEQUEL stores, it's a bit less transparent for the consumer for which particular database this event is coming from, but then still, we strive for as much as uniformity as possible. So we tried to have similar config options, similar semantics, and Try to structure things for you as a user as similar as possible. So you don't have that much friction when moving from one connector to the other.
Tobias Macey
0:09:08
And so being able to work across these different systems is definitely valuable. I'm curious if you've seen cases where people are blending events from different data stores to either populate a secondary or tertiary data store or be able to provide some sort of unified logic in terms of the types of events that are coming out of those different systems. Or if you think that the general case is that people are handling those change sets as their own distinct streams for separate purposes.
Randall Hauch
0:09:37
Yeah. So So I would say, you know, the way that I've seen DBZ mused and other connectors used is that, you know, the, the primary purpose is to get those events into a into a streaming system. And once they're there, then you can do various kinds of manipulations on all of the events in those streams. And if you have different streams for different databases or different streams for different tables in different databases, then you know, you you can, you can process each of those streams in slightly different ways depending upon the structure of those streams, the type of events or or you know, if you want to you filter certain kinds of things. So all of that can be sort of downstream. And I think, at least, you know, one of the one of the primary goals of Wcm, at least when we started out was just get these events into a streaming system that had a rich, rich mechanisms for doing additional rich processing on top.
Gunnar Morling
0:10:37
Yeah, absolutely. That's, that's also something we see. So people use technologies like Kafka streams, or let's say Apache Flink to process those events once they are in Kafka, or whatever they are using. And that's also what something what we would like to facilitate more going forward. So, one very common requirement, for instance, is to aggregate or to correlate multiple streams from from multiple tables. So one example would be like a purchase order. If you think about a relational database, typically, this would be persisted in at least two tables, you would have a table with the order headers. And then you would have a another table with an entry for each order line of this purchase order. So if you wanted to look at the full, complete state of the purchase order, you need to look at D rows from those two tables. And this means in case of diabetes, and you need also to look at the change events from those two topics. So you would have one topic per table and people are trying to aggregate this data and then well, you can do this actually quite quite nicely using Kafka streams for instance. So then you could have a stream processing application there, which essentially subscribed subscribes to those two topics and joins them so it would give you again, a one aggregated event which describes default and purchase order and all its own lines and
Tobias Macey
0:11:58
given the fact that Change Data Capture is used for so many different use cases, and that the sort of source integration is different for each of the different engines that you're working with. I'm curious what other solutions exist as far as being able to handle those change data streams? And if they are generally more single purpose for a single database, or if there is anything else working in the same space as DBZ? What the
Randall Hauch
0:12:25
I mean, we tried to focus mostly on the stuff we do so I'm not really aware of any other open source project for sure, which would aim at having such a variety of support databases. So there's always solutions. You could use another tool for let's say, my sequel, or you could use something for this particular other database. But I don't think there's one open source project really another one which tries to have this complete coverage of Yeah, most of the oldest relevant databases. Yeah, there. There are a number of libraries open source libraries, that You're going to mention target, you know, particular dbmss, or particular data systems. But one of the things that we tried to do when we created to museum was do the capture of these events well, and put those events into a system that had sufficient guarantees that we didn't lose any data, that data wasn't reordered that you could always pick up right. So if anything went wrong, the system would be fault tolerant and would pick where it left off. And that was really the primary justification for you know, using and building on top of Apache Kafka in the first place. It's Kafka is able to to provide those guarantees. And so when you do that, right, you sort of decouple the responsibilities of knowing how to talk to the data system and extract those judgments from storing those in a system that that can be easily processed and used regardless of sort of what went wrong or when you're reading them or how much lag you have, you know, how far behind consumers are From the actual, you know, dpz and characters that are pushing data into those streaming systems, right, I mean,
Gunnar Morling
0:14:06
so an interesting development I saw lately instead more of those modern databases of which are typically distributed like like you gabite and so on the day tend to come with CDC API's out of the box. So I would say those more modern database vendors, they typically see the need for having this kind of functionality and the tent, you know, it's not an afterthought, as may be used to be the case, but they tend to think about this up front. That's, that's great. And now, I was following one discussion. And they were thinking, Okay, so maybe we can have a way to stream our events into Elasticsearch or something like this one particular sync, and I would not recommend to do that, right. So as a database, I think for them, they should rather focus on having a good API and maybe have a way to put the lens into Kafka, but they should not be in the business of thinking about all those different things because it's just Something you cannot keep up with, right? So there will always be new data warehouses, new search indexes, new use cases. So it's just better for the source sites to focus on that aspect, get changements out, as Randall mentioned in a consistent way to ensure the order is messed up and all this kind of stuff. And then you would use this, you know, other sync connectors to get the data from Kafka right and to the particular systems where you would like to have your data,
Randall Hauch
0:15:26
right, and a lot of the systems that, you know, support, you know, writing data or streams of events into particular systems when their systems like Elasticsearch, right you sort of have these Point to Point capabilities. And, you know, as we all know, in a lot of data architectures point to point, combinations get really complicated and, and messy, and it's much better oftentimes to have sort of a hub and spoke to where you don't have all of the you know, the point to point from database A into you know, a search into XOR into a object store that you have more flexibility when you have sort of a hub and spoke design
Gunnar Morling
0:16:05
that I mean, that's also something I always like to emphasize the ones we have to change events in the Apache Kafka, where you can keep the data there for a long time, right? Essentially, as much as you have disk space. So you could to keep your events for a year or for all essentially for an indefinite amount of time. And you could replay topics from the beginning. So this means you even could set up new consumers far down in the future, provided you still have those changements. And then they could take the events and ride them to this new data warehouse, which you may be not even thought about Windows changements. They're produced. So I think that's another big advantage of having this kind of architecture.
Tobias Macey
0:16:42
Yeah, that's definitely a huge piece of the CDC space is that as you mentioned, if you don't have all of these events from the beginning of time, you do need to take an initial snapshot of the structure and contents of the database to then be able to take advantage of it going forward. And so I'm curious what you have found as far as options for people who are coming to Change Data Capture after they've already got an existing database. And any ways that DBZ is able to help with that overall effort of being able to populate a secondary data store with all of the contents that you care about whether it's from just a subset of the tables, or you need to know the entire schema, things like that.
Randall Hauch
0:17:22
Yeah, really, ever, ever since the beginning of the museum, that was an important aspect to be able to sort of take an initial snapshot, and then consistently pick up where that snapshot left off and capture all of the changes from that point. And so you know, when you're when you're sort of starting a DBZ, and connector, very often want to begin with that snapshot and then transition to capturing the changes for a consumer of one of those change event streams. They often don't really necessarily care about the difference at some point, you know, they see that a record exists with some key and then a bit later, you know, after that Changes are being captured, they would see that that record has changed, and they see a change event and then maybe the record is deleted. And so most of the time that consumers don't often care and one of the interesting things when we sort of looked at at Kaka Kaka was just introducing the ability to have compacted topics compacted topic, basically, you sort of keep the last record with the same key. And so you know, as time goes on, you can sort of throw away the earlier records that have the same key. And so if a change event, right, if you're, if you're capturing change events, and you're using, let's say, the primary key as the key for the record, you can, you know, if there are multiple events, you can sort of, you know, depending on how you set up your topic, you can set that up as compacted and you can say, Oh, I want to only want to keep you know, the last, you know, month of events, and anybody, any consumer that begins reading that stream, let's say, you know, several months from now, they start from the beginning and they're essentially getting a snapshot Right, it just happens to be the snapshot that they start with the the events, the most recent events for every record in a table, let's say. And so that's a very interesting way of, again, sort of decoupling, you know how to read, take an initial snapshot and start capturing events on the production side, and on the consumption side, still providing essentially the semantics of a initial snapshot. But that isn't necessarily rooted in the particular time that the producer actually created. Its first
Gunnar Morling
0:19:30
snapshot, it's really interesting that you are talking about snapshots, because it's specific, quite a specific part of the story. But just interestingly, right now, we are looking into ways for improving this because well, depending on the size of your data, taking the snapshot can take a substantial amount of time. So essentially, I mean, what it does is it essentially scans all the tables so you can set up filters so you can specify which of the tables or collections for MongoDB you would like to have captured and then it'll scale On those tables, you could also limit down the amount of data. So let's say you have some logic killer deleted records, you could exclude those from the snapshot. But still, it could it could take some amount of time. And what you're looking to right now is to paralyze those snapshots. So the idea would be, well, instead of, you know, processing one table after the other, you could also, let's say, have four or whatever, number of verka threats and then read multiple tables in parallel. And we were just figuring out okay, for Postgres, for instance, we can do this quite quite easily in a consistent way. So we would have those parallel snapshot of workers and once they're all done, they still have the ability to continue the streaming from one consistent point in a transaction box. So that's one of the improvements we have planned. Yeah, hopefully for the next year. So to speed up those initial snapshots.
Tobias Macey
0:20:55
And can we dig a bit deeper into how to museum itself is architected and Some of the evolution that it's gone through from when you first began working on it and through to where it is now and how the different use cases and challenges that you've confronted have influenced the direction that it's taken.
Gunnar Morling
0:21:12
Yeah, maybe Randall, you can talk about the beginning that would be interesting for me to.
Randall Hauch
0:21:17
Yeah. So I mean, to be so I was working on several different prototypes, I was looking at a variety of technologies. And it wasn't really until I came across Kafka that I really understood the, you know, the, the requirement of having, you know, those guarantees that I mentioned earlier on the change stream, because there was always the problem of, you know, once you know, what happened if my change event reader, right, where I'm reading these change events from today, what happens when that goes down? And when it comes back up? Like how do you guarantee all of that and so, I went through several different prototypes started out initially with with my sequel only, and really just trying to understand how to read the data from MySQL, and how to Use Kafka in a way that that that would persist all the information and that was fault tolerant. And, and I had tried that with a couple of different systems that, you know, I had studied various kinds of distributed system technologies in data management technologies before I got to Kafka, but then once sort of, you know, that all the pieces sort of started to fit together, I really, you know, focused on okay, how, how can I get this to work with with my sequel, it became apparent that just starting to read the the binary log from my sequel wouldn't be that useful. Because you know, what happened to all the data if I if I start reading this all sudden, I get some changes, but I may have a million records in my table before I even start doing that. And so the initial snapshot, sort of, you know, became apparent that that was going to be a very important part of bootstrapping any kind of practical consumption application. And, you know, once I had something working for my sequel, Then, you know, we looked a little bit at, you know, Postgres. And I didn't want to build a system that was, you know, designed all around my sequel and wouldn't work for anything else. And so it really just at that point, you know, slowly evolved into what I would call sort of practical Connect connector implementations. And that was essentially what the the museum GitHub repository was seeded with.
Gunnar Morling
0:23:25
Yeah. And so when we took over when I took over, I think every connectors for MySQL, Postgres and Mongo DB, and I think Postgres was pretty at its beginnings. So we've already evolving dose and we're adding the sequel server we have some incubating support for Oracle and make them probably will be more than but really one challenge which became apparent is that we were duplicating efforts between those different connectors. So typically, somebody from the community would come and they would say, I would like to have this particular feature, so Could you implement as I could I implemented I will send you a PR for that a pull request. And then we figured, okay, so that's, that's a great feature that now we are actually implementing this in four different places. And you know that that's obviously not a great way to do it. So we tried to unify more and more parts of the code base. And this is still an ongoing process. So we are still not quite there where we would like to be but typically Now if this kind of situations happens, somebody comes and they would like to have a new feature there. We either can implement it in a way it's just working. For all the connectors we have in particular across all the relational connectors that works quite well or we say okay, so we have this option of you have this new feature and maybe it's fine Viet just edit for this connect or Derek but you already do this in a way so the future in the future will be possible to refactor this and, you know, extract this functionality into some more generic code. The reason for that just being so typically, if somebody comes they are on a on a particular database, so there's this MySQL user or there's this person Chris user, very often, they are very eager to work on this stuff. So that's really one of the things I'm very excited about in this community. So we asked K, okay, that's a great idea. How about you implemented for yourself? And then they often say, Oh, yeah, I would like to do it, but then well, they are on MySQL and Postgres, or whatever database they use, and now it would be asking a little bit too much too petty to say, Okay, can you make a tap and so it works for for all the databases, in particular, if this needs some large scale refactoring. So that's very have this kind of balance and we try to be as generic as possible, but still, we don't want to overwhelm a contributors with with this kind of work. Yeah, and that's
Randall Hauch
0:25:36
a big challenge. And that's something that gunner in the community, you know, as I when when I transitioned project leadership to gunner, that was something that, you know, was really a lot of technical debt when we started out you know, only having three connectors. It was very difficult, and actually we started with with my sequel, and then we added sort of a prototype for Postgres but it really was not fully formed. And, you know, even from that point, we could see that there was going to be some duplication, but there was it wasn't clear where those abstractions needed to be. And we just hadn't seen enough of those patterns. And I think, you know, the community under gunners leadership has done, you know, a really great job of trying to refactor and find those. And it's very challenging because, you know, all of these databases, not only are their API's different, but a lot of times their semantics are different. And so, you know, finding sort of common patterns among four or five different, you know, very different API's and how to use them can be quite challenging.
Gunnar Morling
0:26:40
Yeah, absolutely. I would say it's, it's still an ongoing battle for us, I think. So we've made some some good progress there. But still, there's so many things, you know, which we would like to see done differently, but it's, it's, you know, it's an ongoing effort, I would say,
Tobias Macey
0:26:53
yeah, the fact that the different databases have differences in terms of just how they operate in general. I'm interested in how that manifests as far as the types of information that they can expose for Change Data Capture, and what that means for the end user as far as being able to write some sort of logic that will assume certain types of events, and what happens when there are sort of either more granular events that are available, or if there's any sort of tuning available on the museum side to say, aggregate all the events up to sort of this level of granularity or I want to be able to see every type of change event, you know, not necessarily just at transaction boundaries, but like every change within that transaction. I'm just curious player any differences along those lines?
Gunnar Morling
0:27:38
Yeah, so currently, really, you would get a change event for each change. So for each insert, each update each delete, you would get a event or you would get an event. Now what definitely can differ this, depending on your configuration is what is the payload of events. So giving one particular example for Postgres, it's a matter of of configuration weather for an update event, you would get to complete old real estate or not. So typically, by default, as the vents are designed, you would get default new state of this record, which was updated, and also default previous state. So that's can be very useful, obviously. But then well, depending on how the database is configured, you might not get this complete previous real estate, maybe you just get a state of those columns, which actually were updated in this particular statement. So let's say you if your table has 10 columns, and you just update two out of those 10, then depending on the configuration, you might just see the old value for and those or maybe just what you can see the value for those columns which are affected. That's also the case for for Cassandra, for instance. So definitely would have some, some differences at this level. Now, as you were mentioning transactional guarantees and variants of transactions. boundaries that's that's also something which comes up very often. So typically, or as it is right now you would get all the changes without any awareness of the transactional boundaries. And what we receive as feedback from the community is that people would like to have this awareness. So let's say you are doing five updates within one transaction, there's many use cases very essentially would like to wait all those five changements, maybe they're in different tables, and then process them at once. Instead of processing them one by one as they arrive. So that's also something what we would like to add property very early next year. So we would have a way to expose the information about those transactional boundaries. So that probably would be a separate topic, which has markers Okay. Now a transaction has started then that would be another event, okay. This transaction has completed and by the way within this transaction, and these in that numbers of tables, Records in different tables has been affected. So this would allow a consumer to essentially await all the events which are coming from one transaction and then do this kind of aggregated processing. So that's that's definitely a very common requirement. And we would like to better support this.
Tobias Macey
0:30:17
Yeah. And I'm curious what you have seen as far as just tension in terms of the system design and feature sets of DBZ. from just being a raw pipe for these are the events that are coming through do with them what you will to being something that has some level of interpretation of those events for being able to say, Okay, these are the events that happened, this is what the actual outcome of that is, so that there is maybe less business logic required on the consumer end to be able to take advantage of those events, at the expense of being able to have all of the granularity for people who either might want it or have to be able to have it for maybe some sort of compliance. reasons.
Gunnar Morling
0:31:00
Yeah, I mean, so yeah, as I mentioned, so that's something we would like to better support. So so it's not quite clear to me yet how it looks. So we will work out the community will work towards making this a reality for but one idea I just could see us doing is to have some sort of ready made stream processing applications. So it would be a service we would provide as part of DBZ and platform and this service would be configurable. So you could tell it, okay, process those topics. So coming back to this order and online example. So you could say, Okay, take those two topics, all their headers and all the lines and then by means of this additional transaction boundary topic, which we are going to create, police produce transactional consistent events, which represent the entire purchase order and all in our minds. So then you would have events on a higher level and then do pure role level events. So that's one of the ideas I I could see us doing having this kind of configurable downstream components, which could be based on Kafka streams, for instance. And then another thing which we actually already have is support for the outbox patterns. So this is interesting, a concern which we sometimes see, okay, people feel not that comfortable about exposing the internal structure of the table. So maybe they are in the verb of microservices. And now they feel not comfortable about exposing their internal purchase all the table and all its columns to downstream consumers. So let's say they want to rename a column or change its type. And what do they want to have this kind of flexibility and they don't want to impose any downstream consumers right away. And these are experiments a nice way around it. And the way it works is as part of your transaction, your application updates its tables, so it would update the purchase order table, it would update the order lines table and so on. But then at the same time, it also would produce event and write it into a separate table. The other stable. So this really is like a messaging queue, you could say, and then you would use the museum to capture changes, essentially with just the inserts from this outbox table. And now, the events within this outbox table, you have full control over their structure and they would be independent from your internal database models. So if you were to rename this column in your internal table, well, you would not necessarily have to rename also the corresponding field in the outbox event. So you have some sort of decoupling there. And this is something Vf and edit support for so there's a routing component there, which would, for instance, allow you to take the events from this outbox table and stream them into different topics and Kafka. And this is definitely something very see a strong uptake of interest in the community. So people have blogged about it. And that's definitely something which is very interesting because it addresses I think, those questions you mentioned.
Randall Hauch
0:33:54
So I also think that you know, there are a lot of different ways that people want to use change to capture they want to capture Get these event streams. And people have different ways of using their data stores. So so there will be some users that have essentially large transactions where they're updating lots of records at once. And those kinds of patterns, right, it's very difficult to sort of put all of those events into like to create a sort of a super event, simply because of the size of what that super event might be. But I think a lot of the cases are all around these really small, smaller transactions. Purchase Order, I think, is a great example where there's, there's a relatively limited number of rows that are or change or entities that are changed in a data store. And so I think oftentimes, you know, it does make sense, it is a common pattern to be able to sort of reconstruct what those you know, super events might be that are more transactionally oriented rather than than sort of
Tobias Macey
0:34:52
individual entity words. And for somebody who wants to get DBZ and set up and integrated into their platform. What are the The components that are required and some of the challenges that are associated with maintaining it and keeping it running and going through system upgrades.
Gunnar Morling
0:35:09
So the main component or let's say the main runtime, we target is Kafka Connect so Kafka Connect as a separate project under the Apache Kafka umbrella, and it's a framework and also a runtime for building and operating those connectors. And you would have source and sync connectors and DBZ. Essentially, it's a family of source connectors, so they get data from your database into Kafka, so you would have to have your Kafka Connect cluster. So this typically is cluster so you have again, high availability and this kind of qualities. So you would need to deploy the connectors there. Then we expose some sort of metrics and monitoring options so you always can be aware of Okay, and what status securit does the connector is doing a snapshot at which table is it in the snapshot? How many rows has it processed, or is it streaming mode. So you know, which was the last event, what is the kind of lack you would observe between the point in time when the event happened in the database and when we process it. So you have all those monitoring options in terms of upgrading? So that's that's a very interesting question, because there's many moving parts, right? So there's the DBZ version, maybe you do schema changes, while the connector isn't running consumers, they must be aware of all these things. So that's something you have to be careful with, we always document very carefully the upgrading steps. So let's say you go from one to medium version to the next you would have a guide which describes Okay, that's that's the matter. That's the procedure you have to follow. And in general, we try to be very cautious and careful about not breaking things in a incompatible way. So that's actually very interesting. When I took over from Randall I think the convergence was ODOT five or four, so some some early Oh dot x version, and now you've would think It's you know, it's it's it's very early and it was very early in the project lifecycle but still people were already using this heavily in production it only crew ever since. So now if you do a upgrade and you put out oh eight out of nine out of 10 and well we have lots of production users already and now of course we are we don't want to make their lives harder, right? So we are very cautious to if you have to change something to duplicate options, so it's not like heartbreak, you have some time to adhere to certain changes and this kind of stuff. And now just very recently, yesterday, literally we went to the one that old version. So that's the that's the big one, that final version we have been working towards to for for a long time. And now we are even more strict going forward about those guarantees. So that's the message structure and you know, there's no guarantees in terms of how this evolves in future upgrades and this kind of thing
Randall Hauch
0:37:50
Yeah, tbz him and really event streaming. They're complicated distributed systems because they rely upon a number of you know, already distributed company. ponents so Apache Kafka, you've got a patchy Kafka broker cluster and in that, in itself is is non trivial in terms of setting it up and keeping it running and making sure you don't lose anything. There's the Connect cluster is is going to mentioned and, and then there's the the connectors that are deployed on onto that connect cluster. And they all can sort of be upgraded at different times, and at different levels. And so, you know, the Kafka community does really, I think, a fantastic job of maintaining backward compatibility. And we try to do the same thing and encouraging the connector developers to maintain backward compatibility. So that you know, as connectors are upgraded from one version to the next, that, you know, users don't have to, you know, change a configuration in order to do the upgrade or after they do the upgrade. And I think you know, DBZ does a really great job of that.
Gunnar Morling
0:38:52
Yeah, and let's say I have not heard too many complaints about us breaking things. So that's, that's a good sign. Another another variable there. Is the topology of your database. So typically people run their database at least in some sort of H a cluster so they could failover from a primary to a secondary node in case there's a hardware failure this kind of thing. So, this means also to be Zeum typically must be able to reconnect to this new primary node so to earlier a secondary announced the new promoted must primary node. So we need to be able to support that and that's something people are already using the communities I'm I have it feedback there, but still something that we also want ourselves to more systematically test and have this kind of setup and sometimes people don't want to connect to be zoom to their current production database, maybe they feel that they will just want to give it a try. And they don't want to connected to the primary database right away. So there's ways for instance, with my sequel, very acute half a MySQL Cluster and use the MySQL clustering to stream you know, to replicate the amongst the nodes within the cluster and then you would Used to be zoom, and it would connect to one of the secondary nodes. And obviously, this adds a little bit of latency. But then you have maybe this, or you have this decoupling between the primary database and 2pm, if that's what you are after. So that's, again, one particular deployment scenario, which people use and which we need to test as it's another moving piece here in the puzzle. And different
Randall Hauch
0:40:21
connectors, as they talk to this particular DBMS is different types of connectors place different loads on that source system. And so, you know, my sequel, the load on a MySQL database might be, you know, a certain amount and the load on a sequel server database may be different, and that also affects how many connectors you can run, because each of those connectors will be essentially reading the, you know, database logs, and, and placing a load on the database. And so a little bit of a, that depends on on how the relational database system, you know, exposes that functionality to dpcm and other CDC clients and so that that's one other factor to consider. And, and really, you know, DBZ this sounds really complicated because it is right when I started the museum with an emphasis on solving enterprise data problems. And so right, anytime you're doing that, you have to deal with all these fault tolerance issues and network degradation issues. And so it's very important that in order to get those guarantees, right, that was sort of why we chose Kafka. And it's why connectors are sort of written the way they're written. And when somebody picks up dpcm, or even some of the other connectors, it can seem like, it's overly complicated when you're just, you know, hey, I've got a local database, and I just want to try this out. I want to see what the streams of database changes are. And in order to get there, you sort of have to set up, you know, Kafka and connect and you have to deploy a connector and then you have to, you know, consume the events and so on this very small sort of proof of concept scale. It can seem a little overwhelming, but you know, That all of that infrastructure is really the best way to get enterprise scale functionality and change to capture that is tolerant of faults and degradation throughout the system that
Tobias Macey
0:42:14
that every enterprise has, from time to time. And I think it's interesting that you call that the sort of close connection between the design of DBZ and its initial build target of Kafka, and I'm curious if you have explored or what the sort of level of support is for other streaming back end, such as pasar, or per Vega, or if you've looked at other any other architectures or sort of deployment substrates for the DBZ and project itself.
Gunnar Morling
0:42:43
Yeah, actually, we do. So definitely Kafka is currently I don't know 95% of the people are using DBZ with but damn actually one thing which we had, even when I took over as what we call the embedded engineer. And this allows you to use the division connectors as a library within your Java application. And then essentially, well, you don't have these guarantees, like the persistence of Kafka and so on. But it's a way to get changes to consume changements in your application. So you just register a callback method. And Now interestingly, we see people using this embedded engine to connect it to other kinds of substrates, as you call them. So definitely people are using this with Amazon kinesis. I'm very for people using it with nuts. So people are are using this then as you were mentioning puzzler, that's, that's again, a very interesting case because they even have DVDs and support out of the box. So there's a counterpart to Apache Kafka Connect, which is part of our IO. So it's the same thing essentially a runtime for and framework for connectors but targeting an Apache pasir and now they if you get power, you already get to DBZ connectors with it and they are Running essentially why Python IO in this case. So that's, that's definitely interesting to see. And it's also on our roadmap, going forward to extract a little bit more from the Kafka Connect API. So it would be, you know, even easier or maybe more efficient to implement or to get those kinds of use cases for you would like to use something else.
Tobias Macey
0:44:23
And what are some of the most interesting or unexpected or innovative ways that you've seen DBZ? amused?
Gunnar Morling
0:44:29
So one thing I'm seeing is, is huge installations. And I find that very interesting. So I know if there's one users though they have 35,000, MySQL databases, so it five k MySQL databases and data stream changes out of dB. So typically, as it does, the stream changes out of those databases using DBZ. And typically, these kinds of use cases is multi tenancy use cases. So they would have like one schema with the same set of tables for each of their tenants so that's why they have to splurge amount of logical databases all the other day somebody came in this Oh, I have, you know, a half a couple of hundred degrees and prosperous connectors. And now I have this particular problem. So please help me and and first, it's always very interesting well I don't have we don't really have a way to test with a couple of hundred connector instances. So getting this kind of feedback from from these people there is very valuable because, well, in this case, they said, Okay, this is, you know, I have this huge amount of tables there. And this is running just fine. Just my connector, it needs 80 gigabyte of heap space. So that's a bit too much. Can we do something about it, and then we'll be taking a look and we thought, okay, so there's, you know, this, this sort of meta data structure, which means for each table column, and now they had like loads of columns. And what we could optimize this little, little bit and we could remove some redundancy and now they went from 80 to 70 gigs of space. So that's a nice improvement just was one one line of code change for us essentially. So I would say having Those large installations. And that's that's very interesting to see, oftentimes, sometimes people build their own control plane. So I know of people, they're also in this multi tenancy business. So they need to spin up hundreds of connectors regularly. And it's just not feasible to do this by hand, right? So they have all this automated ADF scripts to do that, and they have some monitoring layer that they can see all their connectors and some nice stores, some nice sort of dashboard. And this is really fascinating to see how far people push them to the limits there. And in your own experience of helping to build and grow the DBZ and project and community what have you found to be some of the most challenging or complex or complicated aspects?
Randall Hauch
0:46:41
rendel do you want to take it? Yeah, I think starting out. When we created the trapezium or when I created the DBZ project, there were there was really nobody there. It's very typical for with open source projects, and that's the way Red Hat sort of does its research and development, but there were people that had the same problem. bombs and they wanted to collaborate and work on a solution. And there were more people that wanted to just use a solution, whether it was available or not. It's so slowly that the community sort of started building up. And then as you know, the community started building up there was, you know, the connectors became more stable, had more functionality had sufficient functionality for more and more use cases. And I, you know, I think the great thing about the Davies and communities, it's, it's, you know, growing quickly, now, at a rate that, you know, increases, you know, every every year and there's a very large number of people that have contributed to the museum and probably significantly larger number of users of the DBZ connectors. And, you know, I think the last few years that the project has done a fantastic job, engaging the community and getting people to collaborate and work together to build you know, these these connectors that are very valuable for lots of lots of
Gunnar Morling
0:48:00
Yeah, absolutely. I mean, the community building part that's that's also what I have, you know, find the most interesting, I would say, I mean, it's always interesting to do the technical side of thing and solve this very complex problem there. But really growing this community and see how it you know, gets bigger and bigger. That's the three fascinating and one thing I'm very proud of, it's a very friendly communities. I don't remember any incident very, you know, any sort of flame force going on or whatever. So people are helpful. And the best I always like to see is when somebody comes to the use of chat, and they have a question and then somebody else from the community, it helps them out. So that's, that's really great to see people help out on the mailing list. And also, as Randall mentioned, too many people contribute. So right now we have about 150 different people who have contributed, you know, sometimes it's a small bug fix, but then there's other people who stick around and they for instance, they use let's say the sequel server connector so they have a continued interest in making this better and better and then in case of the Cassandra connector, this is even less by the community. So it's not directed engineers who are leading this effort, but it's people from a company called we pay. So they open source this under the umbrella and they are leading this effort. So this is great to see how all those people from different backgrounds, different organizations come together. And we all work together as a joint effort or in a joint effort to make this open source CDC platform a reality.
Tobias Macey
0:49:23
All right. Are there any other aspects of the DBZ? And project or Change Data Capture that we didn't discuss yet? Do you think we should cover before we close out the show? I think
Gunnar Morling
0:49:32
we discussed everything, what's relevant, I would say if you are interested to come to DBZ, m.io. You can find all the information there and you can find links to the community there. And well, we could, you know, get any discussion started.
Tobias Macey
0:49:46
All right. Well, for anybody who does want to get in touch with either of you and follow along with the work that you're doing. I'll have you add your preferred contact information to the show notes. And as a final question, starting with you, Randall, I'd like to get your perspective on what you see as being the biggest gap The tooling or technology that's available for data management today,
Randall Hauch
0:50:03
I think, really just sort of, I mean, we have a lot of different components for, for using data in different ways. And I think, you know, enterprises are capturing a lot more data than than ever, and being able to use that effectively and, and have sort of a collective understanding of within, you know, companies of what data they have, what data processes, they have the relationships, you know, around provenance. And, and data governance. You know, we're we're really I think, still we're making progress. But still there's there's just a very large ecosystem that we have to get our hands around and I think that will be continuing to struggle with that and to make progress maybe more slowly than we prefer.
Tobias Macey
0:50:50
And Connor, How about yourself? Yeah, I'm,
Gunnar Morling
0:50:51
I would say, I would see two main challenges and they're pretty similar areas. I think if you have a large organization, there tends to be many Many data portals are many data formats flying around. There's different ways how data is exposed in synchronous ways, asynchronous ways you have API's, you have data streams, you have Kafka, maybe different fabrics, like under other messaging infrastructure. So just being able to see what's going on, what are the data formats? Who is owning them? Who is allowed to see them? And who can, you know, see which part of the data answering all these questions, I think this is where the key points. So how do formats evolve? How long should format versions stay around? So having some sort of insight and all this data formats and schema and structures in your enterprise? I think having an answer for that or having good tools for that, I would think that would be a very valuable asset to have.
Tobias Macey
0:51:46
All right. Well, thank you both for taking the time today to join me and discuss your experience of building the DBZ and project and helping to grow the community around it. It's definitely a very interesting project and one that I have seen referenced numerous times and it's been mentioned been the show a number of times. So thank you both for all of your efforts on that front end. I hope you enjoy the rest of your day.
Randall Hauch
0:52:05
Thank you very much, Tobias. Absolutely.
Gunnar Morling
0:52:07
Thank you so much for having me. It was a pleasure.
Tobias Macey
0:52:14
Listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcasts. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast.com but your story and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!