Confluent Schema Registry with Ewen Cheslack-Postava

Hello, and welcome to the data engineering podcast, the show about modern data management.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production, and Go CD is the open source platform made by the people at Thoughtworks who wrote the book about it.

Go to dataengineeringpodcast.com/gocd

to download and launch it today.

Enterprise add ons and professional support are available for added peace of mind.

And go to data engineering podcast.com

to subscribe to the show, sign Sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site.

To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.

Your host is Tobias Macy, and today I'm interviewing Ewan Cheslak Postava about the Confluent Schema Registry. So Ewan, could you please start by introducing yourself?

Sure. I'm Ewan Shlalakisaba.

I'm a software engineer at Confluent,

where we're building a streaming platform,

with Apache Kafka at its core and, part of that was, and continues to be working on the schema registry.

And how did you, first get involved in the area of data engineering and data management?

Sure. So actually, in a sense,

I've

never done data of engineering and management myself.

I actually sort of got into it, distributed systems via a sort of unusual path. I started out in graphics

in academia, so I actually have an academic background. And

in graphics, I ended up on a project working on large scale virtual worlds, which took me into distributed systems,

and then,

ended up moving into,

from distributed systems, sort of found Kafka, found Confluent

3 years ago,

and saw a lot of interesting distributed systems problems to solve.

And obviously, with a storage system like Kafka, a lot of those are going to revolve around, sort of data engineering and management. And when you're talking about large scale virtual worlds, are you talking about things like Second Life and Eve Online and that kind of, variety of virtual world? Yeah. Exactly. So,

it could

variety of virtual world?

Yeah. Exactly. So, it could,

vary between sort of the more social ones, which is things like Second Life or more, like, games like, Eve Online. Yeah. And my research was basically,

targeted at how to visualize such large scale worlds, more effectively.

And what sort of, focus for visualization was that? Is it in terms of the network traffic between servers or the ways that the data is distributed

geographically

or is it sort of a time series visualization of interaction patterns? I'm just curious.

Yeah. Sure. So, this is actually it's it's mostly about the rendering aspect, but that, as you mentioned, sort of comes or has a variety of problems associated with it.

The sort of fundamental problem is there's just too much data to process that quickly. So I guess in a sense I've been working in big data for

longer than

longer than I let on. But yeah, a lot of the problems there are basically how do we, either simplify things or decide what not to display,

in order to make possible to actually download all that content and then for the, GPU to actually be able to render it.

It's very interesting stuff. And you mentioned that through the work that you were doing with all those distributed systems, you landed with Kafka,

and you're now working for Confluent,

and you're the primary contributor to their schema registry project. So I'm wondering if you can just briefly talk a bit about what Confluent does and what was the motivating factor for building the schema registry and how that factored into the work that you were doing with Confluent.

Sure. So,

okay. So I guess to to start with,

Confluent is is building a streaming platform,

around Apache Kafka. So Confluent was founded by 3, LinkedIn engineering,

team members that built Kafka at LinkedIn.

And schema registry was actually

1 of our first products. It was included in the first release. And the reason the sort of motivating factor for it is that,

the founder saw the same,

like this these set of problems play out at LinkedIn

around

scaling up the use of this sort of centralized,

streaming data infrastructure.

There's some really powerful things you can get out of having a, sort of

it's it's a distributed system but sort of centrally managed system for storing all of your streaming data, your event data. But but it gets complicated then when you have something that's shared by, you know, thousands of developers

to be able to even do something as simple as keep track of,

I have data in a what what Kafka calls topic.

That's like a stream of data and with thousands of developers it's really hard to keep track of just even what is the format of the data or what does the data,

represent.

And so

the idea behind the service is that, we can also sort of centralize the the management, that metadata

about the data. How like, what the format of the data is and, you know, documentation about what it represents. And this sort of enables 3 things. In in a lot of ways, you know, it seems like, or these days a lot of the time we consider any sort of centralization a bad thing but in this case it's actually really powerful and it allows you to do sort of 3 different things. The first is that you can actually validate compatibility of schemas as they change over time. So inevitably you're gonna see you know, my first version was missing a field that I wanna store there. I wanna be able to add it to,

add it to the schema.

What you wanna make sure is that you do it in a compatible way though so that when you end up with both pieces of data, at the same time having to be processed by the same application,

it's still possible to do that, processing against both of them without having to have, you know, 18 different variations

of, of the same code to handle the different structures. So that compatibility,

ensures sort of data cleanliness,

and makes it a lot simpler to write

applications that are able to support processing you know both old data and new data. So, that sort of the ensures good quality data.

The second thing that it allows you to do and this is 1 of the reasons, Kafka becomes really powerful when you when you put it as the sort of the center of your streaming data, pipelines is it allows you to then discover the format of the data.

So I'm able to just go in and check with the schema registry

what format, like, what is the actual schema in in the case of the schema registry, this is for Avro. What what format is it in? And I don't have to really go talk to anybody in order to do that. So I'm able to just query for that information, get the schema back and then potentially, you know, start using the data without ever having to

coordinate with anybody.

And so that actually turns out to be super powerful because, it allows you to decouple teams from each other and removes a lot of sort of overhead that there is in processing and figuring out process data. And so you can imagine that there are things here where, for example, maybe you have 1 team that's writing a specific application.

They want to

log some events from it. So they're logging for example every time there's a user interaction or page views is sort of a very, you know, easy accessible example.

They may store that data into Kafka. Store all of those events but then frequently what we see is multiple downstream

users also consuming that data. And they never have to coordinate in order to

in order to use make use of that data. So it's a sort of very powerful way of making data as accessible to anyone in the company as possible.

And sort of related to that is the third thing which is that you,

no longer have to have this sort of central scheme of planning and it turns out I I believe this happened at LinkedIn, and I know it happens, elsewhere where effectively if you don't have this tool in place and you don't have guarantees around compatibility,

and discoverability,

almost universally what ends up happening is you end up with some team that sort of the blessed team that's allowed to make changes to schemas and you have to go through like reviews with them and basically making any change or improvement becomes really painful and slow because it requires interaction with a bunch of other people.

By having the schema registry in place, you have enough guarantees that you can feel a lot safer about sort of independently

evolving the schemas

and even when you do have those teams that,

are effectively coupled because they're sharing the same data,

they're still able to operate and evolve the schema so that 1 team that was creating all the data is able to evolve the schema because they have these guarantees that they won't break anything downstream as long as they, you know, evolve it in a way that the schema registry guarantees

is compatible.

So it's sort of that set of that set of things

is sort of the

it's both the motivating factor for building it because, the founder saw how this played out at LinkedIn,

and it's also just sort of the set of benefits that you get from doing this, allows you to scale up,

sort of your data management to a very large organization

without really coupling any teams together requiring any, real overhead.

So does the schema registry

enforce the schema at right time so that if somebody is trying to submit to a given Kafka topic that has a schema associated with it and the data isn't in an appropriate format, does it reject that data and cause a right error in order to ensure that the integrity of the topic

is consistent with the schema that is associated with it at the particular state and time and version of the schema?

Yeah. That's exactly right. So, sort of the the first step is generally that, when you initially publish some data to the topic, there are no schema stored. So, of course, you're gonna be able to write, that schema. But,

the way this works in practice

is you

write with the KafkaProducer,

and your serializer

basically goes and checks with the schema registry

and,

tries to register the schema.

If there's already a schema registered, it'll check whether it's compatible or not. And if it's not compatible, you'll get an accept you'll get an exception back, an error code from the API and then an exception back in your application.

If it is compatible, you're actually allowed to just, evolve the schema right there. And in terms of, like, compatibility, we might want to get more into this later, but there are a couple different types of compatibility, and you can sort of configure which 1 you want.

And so you have this sort of, like, toggle to to say how strict you are in terms of your compatibility rules.

And is there any metadata that gets written into the record at the time of submitting to the topic that indicates

what version of that particular schema ID is being used at the time of submission so that if you are going back and reading through an entire

topic from the beginning all the way through, you can determine what the schema was at the particular time that the data was written because I know that with Kafka there are occasions where you might want to reprocess an entire event stream from the beginning.

Absolutely. Yeah. And you need, you need some sort of ID in the,

in the message in order to even just be able to decode it.

This is sort of a

a quirk of

the way that Avro works. Not all not all serialization formats work this way, but you have to have the schema that was, used to write the data in order to be able to read it. That makes Avro very efficient, but,

comes with that cost of having to have the original schema. So what actually gets written is we have a format that's it's very simple. It's just a single magic byte. That's just so that we can ensure that we're actually looking at sort of the serialization format that we expect. And then it has a an integer ID which is the schema ID.

Now that doesn't actually tell you the specific version

of the schema. Or sorry, I should say it uniquely identifies the schema, so that you can go look it up and decode it. It doesn't actually tell you the version within that topic because usually that's not actually needed, so you'd have to go through an extra step to look that up. But most applications applications don't need that version information. They just need to have

the schema so that they can decode it.

And I know that with the Avro format, 1 of the strengths is the fact that it does allow you to collocate the schema with the data as it's being written. So I'm wondering what are the benefits

over and above the built in Avro capabilities

for being able to track and manage schemas?

Sure. So you can embed so the Avro format is actually, sort of there's there's 2 pieces to it. There's just the how do I serialize a single record,

into a an array of bytes, and then there's the actual, on disk format. I think what you're referring to is actually the on disk format where the the format of the file is essentially right the header in the header of the file, just write the schema and then have a series of records that are serialized with that schema.

It may be a little bit more complicated than that, but that's the the basic gist.

So the problem with doing something like that in Kafka is that, actually, a lot of times, schemas will end up being bigger than the data itself.

And in,

Kafka, like, you can imagine you're writing a single record at a time, which means that you would have to dump the entire schema in with that record and then write that 1 record and then move on to the next record and do the same thing. So you're just putting a lot of redundant information in there, And so what having the schema ID and schema registry allows you to do is

substitute the entire schema with,

a single integer.

So that's not that big a deal to be repeating. Your data is probably large enough that that's not a high price to pay, and then still be able to go out and find that schema again even though you don't have the

entire copy of it. You just have a unique ID. So so in a sense, it's actually kind of a requirement,

in order to use Avro

with

Kafka. If you wanna be able to evolve the data over time, you need some record of what scheme was used to encode it, but you're not going to want to dump the entire schema into every message, so you need some other mechanism.

Yeah. I can definitely see how having the registry is a good way to

sort of decouple the data

and reduce the overall storage overhead associated with keeping all that information. And I'm wondering how you settled on Avro

as the serialization format and the schema format to support and what are some of the other options that you considered

thinking in terms of things like,

parkay for the actual on disk format or things like JSON schema for being able to define the actual schema itself?

Sure. So, we talked a lot about this,

during the, initial sort of, product design phase. So at LinkedIn, Avro made a lot of sense and they adopted it, because a lot of the systems that they were working with are sort of big data systems. And in the Hadoop ecosystem, Avro is fairly pervasive.

So the the sort of trade off is that outside the Avro world, if you're sort of like a regular application developer, it's not nearly well as known. It has support across a variety of languages but not maybe all of the languages that, you would want to use it with. So there was definitely there's definitely a trade off here. The other thing that I would say is popularly requested is JSON schema.

It's actually kind of funny because it's people applying schemas to a schemaless like an intentionally schemaless format. So it feels a little backwards but it's it's a good thing that they that they want that support. In terms of, what we built, we built what we honestly what we thought would

best address the needs of,

of our users, which many of them are coming from sort of a big data background. So they may, like, you're are either already familiar with Avro or it fits in well with their the rest of their ecosystem. That said, we're actually more than open to adding other support. It's just a tough thing to find on a, you know, with a long list of product features that people are looking for. Hard hard thing to find time for. I've also heard requests for other serialization formats as well. For example, protocol buffers, Thrift. There are a couple of other more fringe ones that, we occasionally get questions about. We're definitely open to, to adding more support. I'd like to see the JSON schema support. To be honest, 1 of the other factors here is that, Avro has

the spec has very well defined compatibility rules and sort of built in

understanding of how to project between different versions of schemas and the tooling is already there as well. So they like this is built into the other library. This isn't something that the schema registry had to build that we had to build for the schema registry itself. And so you don't have that in things like JSON schema and JSON schema is actually a fairly complicated spec.

It's actually more complicated than Avaro because it paused some other, like, you can apply other constraints on there. Say things like this field has to have be between this, this range of values. And there's a bunch of complicated stuff like that that would make actually doing these compatibility checks a lot more complicated than it is in app is in Avro.

So I think we could extend support to it, but couldn't say a particular time frame because it's a pretty substantial project.

And in terms of the compatibility that you've mentioned,

what are the sort of dimensions along which you can evolve a given schema in order to ensure that

messages that were submitted at an earlier time can be coerced into the new schema format?

So actually, the the first thing I'd say here is I would highly encourage people who are interested to go read the,

Avro spec. It's a very straightforward, relatively simple, and well written sort of design doc and specification,

and it includes,

information about how this works. It's a it's a very simple list of of,

how you

take 2 schemas and resolve them.

In terms of the operations that you can do, it's it's mostly the the sort of obvious things that you would think of. So I'm missing a field and I wanna add 1. That would be a backwards compatible change. But

what

backwards

compatibility

means But what backwards compatibility means is that with a newer version of the schema, I'm still able to read old data, which you can imagine

for backwards compatibility is gonna be a popular option because it allows you to read all historical data even on the most recent version of your application.

And so adding a field is backwards compatible because if I include a default value, then I should say that adding a field with a default value is backwards compatible.

Then when I load some data from

an old schema,

even though that field is not there, we have a default value that we can fill in. So we still have a path to sort of getting it into the newer format. There are some other things like actually, removing a field is also going to be backwards compatible because when I read an older version, I just ignore that field and that's fine. It is possible to rename fields in a backwards compatible way, but you can't do things like changing default values. You can do things, Avro is actually kind of unusual in that it supports a union type. Union types if you if you add sort of another type to that union where by union type I mean you can say this field can be either an integer or a string. If you then say it can be an integer or string or a long, that's also a safe operation

because you're just adding to the, set of things,

set of types that you're allowed to,

decode there. So I think that's like there's there's probably some more of them and there are probably some more like, you know, extending enum types in certain ways can be a safe operation as well. But the vast majority of the time, what people care about is just adding and removing fields.

And for the schema registry itself,

right now, it's

fairly

I don't necessarily wanna say tightly coupled, but it's very closely associated with Kafka as the storage back end for being able to

propagate those schemas and, you know, associate the schemas with the data being written. Are there any other storage back ends that you've considered

expanding support to with the schema registry? Or for somebody who isn't using Kafka, are there alternative

tools or technologies that people can use for being able to achieve similar,

outcomes?

Sure. So,

I guess that there are 2 aspects to being tied to Kafka. The first is actually the storage layer. So, for those who aren't familiar, the way that the schema registry actually works is that it allocates a single partition topic, and that's where we store all of the schemas. Basically, each schema is just an a record in that topic. That makes it super simple because

if you have Kafka,

then getting up and running with the schema registry is just 1 basically stateless app that you have to deploy, which, you know, people are used to doing. That's that's easy.

So in terms of the storage layer, that is actually already extracted out. In fact, there's a second implementation called in memory store in addition to the Kafka store implementation,

which as you can imagine,

just holds all of the data in memory. This is useful both for testing and actually we effectively use it as a cache inside the schema registry. That store interface though, while we haven't exposed any configuration

in order to make it pluggable,

basically all the support you need is there. And the API surface area is really small. It like kind of amounts to a hash map because that's effectively what the schema registry boils down to. It's just a map of like IDs to end versions to the actual schemas themselves. So that's the storage side.

The thing I do wanna wanna I mean, we basically just need to add a config in order to make it really pluggable.

I've very very occasionally heard requests for like can we just use

MySQL or something because we already have a MySQL database that would be handy to use for this. We can just reuse that, piece of infrastructure and not have to deploy Kafka itself. That would be sort of a reasonable thing to do. And in fact, you know, the entire design of the schema registry was intended to support using it with other systems in addition to using with Kafka. Obviously, the the sort of storage is an implementation detail, not something sort of fundamental that you can only use this with Kafka. So the and in fact, you could use it with, for example, you know, basically any any, like, legacy messaging system. If you needed to still pass some data through there, it would be reasonable to use the exact same sort of format and APIs that we use, for Kafka. So that's 1 aspect of it which is storage.

The second aspect is actually the the rest of the implementation of the distributed version of

the schema registry.

And we're actually that that would be a harder thing to replace in terms of implementation. So while I think you could plug in a different storage back end, you would still currently need ZooKeeper or Kafka for coordination.

And the the reason for this is that the way it's designed is to make it as easy as possible for the client. And the way that we do that is we

have a single master architecture where 1 of the schema registry nodes

is

elected the leader.

And in order to do that we need some sort of coordination system.

So it would be a much bigger change to replace that. So right now it probably still best works if you're going to have Kafka in your environment anyway. But it is definitely feasible that you could could do this some other way. So

I think you also asked about sort of other options in terms of if you if you don't have Kafka in your environment.

So I would actually I would actually say still use schema registry.

You kind of there there are all are alternatives

available in the sense that

you can use other approaches. For example,

a common alternative solution here is actually just check everything into git or whatever your version control system is. It's

doesn't give you the exact same benefits, but it does give you probably the most important 1, which is,

some way to

validate the compatibility with old versions.

And actually because of this 1 of the best ways to do it if you're gonna do it in git is to

make sure that you actually save a copy in the current source tree of all the old data and and have a tool that goes through and actually validates during like a pre commit hook or something

validates that any new

versions are compatible with all the old ones. In the way that you that you want them. It does give you also,

potentially a sort of central discoverable

place for storing those formats. So if you need to find out what the format of some data is, you're able to go do that easily. If you're doing this in Git

and you're in like a many repo world I would highly recommend putting

all of those into a single separate repo that everybody depends on or if we're in a single repo world then you wanna put them all in 1 sort of central location in terms of directory structure just so that

in terms of

coordination and having to enter and having to bug other people to find out the format of data, it's not like you're gripping through the entire source tree or anything. You know exactly where you have to go for it. Now I would say however this is sort of a suboptimal,

solution. It's not that great.

It actually is pretty close to what you see in some other systems. So for example, it's pretty common to have database schemas and migration scripts,

checked into your repository and frequently for every previous version. And that's basically trying to accomplish

the same thing in terms of sort of

compatibility and in that case repeatability of of doing deployments. And guaranteeing that if you do sort of change the schema you have a way to convert existing data which is similar to the ability to convert data on the fly when your, application is reading it from an old format to a new format.

And Yeah. I mean other than that, I would say basically you're down to using the Aver compatibility checks that the Aver library provides and doing your own validation

somewhere in your in your pipeline. I think there's at least 1 other like sort of,

open source schema registry that I'm aware of but it comes with effectively the same operational

overhead so you're you're still gonna use it. I guess the other thing I would say is, like, don't overestimate

how difficult it would be to run these 3 things. If you're talking about a very lightweight solution, you can probably run and you're not too worried about like sort of the and you have,

an approach to back things up, then you could potentially just deploy a ZooKeeper and a Kafka on the same node and then just deploy a schema registry against that. So it's not like, you know, minimum of 6 nodes just to run schema registry if that's the only thing you're currently using Kafka for. So that's definitely an option. It's you know, we don't normally recommend that people run

ZooKeeper and Kafka on the same node, and, obviously, you know, generally with Kafka, you want to get all of the nice

replication features, but for that particular use case, if it's if it's just to run schema registry, you might be able to get away with a very lightweight deployment,

such that it's not that big a deal. And if it's that lightweight, generally, the maintenance cost is not very high.

There's not a lot of things that can can go wrong when your when your traffic is, you know, registering a schema every once in a while.

And 1 thing that I briefly wanna mention for the listeners that I'm not sure if we stated explicitly

schema registry itself is open source and publicly available for anybody to download and use if they did want to add some of the support for alternative

storage mechanisms

or any other particular modifications that would potentially be useful.

Yeah. That's right. We,

Confluent has wide variety of open source software.

This is infrastructure stuff, so,

you know, we definitely know that you wanna, you know, not get locked into a completely proprietary solution when this is the thing holding all of your data.

And what are some of the biggest challenges that you faced while designing and building the schema registry?

A few different things. So the first 1 is something that we, you know, already discussed, which is just deciding what format to use because Abra has those compatibility checks, you know, well designed and built in. That was a a pretty big 1. Another,

item actually so I I since we were just talking about the storage back end a little bit ago, the storage back end actually wasn't a tough decision. That's an example of 1 where it was easy. 1 of our goals with a lot of with the way we build services

is

to aim for services that are have as few moving parts as possible. And since we were sort of working under the assumption you're building around Kafka anyway, given obviously what our company does, we're not requiring a sort of additional storage service is kind of a no

brainer. So then other challenges though, would be,

the high availability and multi DC design.

And there are a couple of different ways that we could have gone here. For example, do you have 1 global schema registry that spans that's like completely actually global and spans DCs across the entire globe? Or do you have multiple and then somehow as you're moving data across DCs you you're able to sort of rectify the 2 and or convert between the format or the

schema IDs for different,

data centers. So our conclusion was that 1 global schema registry would be much

simpler to reason about. You don't have to think about where your data was produced just to figure out how to decode it.

This is actually particularly important because usually when you're getting into multi DC

solutions at some point you're aggregating the data into a single DC in order to do sort of, additional analysis that requires the global data set and that would get a lot messier if we had gone with

a sort of multiple

separate deployments approach. You obviously can still deploy it that way, but we designed around the fact that

you would want to deploy a global schema registry. 1 of the important out like results of designing for the global schema registry mode is that,

we made sure that you could serve reads even if you get sort of disconnected from the source of truth.

And so by, what I mean by that is that you

the schema registries across all the data centers will load up the entire data set and the assumption is that it basically fits in memory so that we can hold on to it in memory as encashed

which isn't crazy because the schemas are not gonna like your total schema data set is not gonna be very large. And then even if those nodes get disconnected from the sort of central Kafka cluster that stores the data that they're loading the data from, they're still able to serve

requests to fetch schemas by ID. So

being you'll still be able to consume data even if that

even if that data center gets disconnected. The thing you won't be able to do is register new schemas but this is sort of a reasonable trade off because I mean you can imagine if you're having some sort of network outage between data centers probably your biggest problem is not being able to deploy an updated version of your code that includes a change to the, format of data.

So more than likely what you would just have to do is hold off on deploying any any changes. But in

in truth, the likelihood of having those 2 things happen at the same time is pretty low anyway and it's not sort of in terms of business terms

not really a significant problem, if you have a decent deployment setup. Yeah. So so that's the the sort of the high availability and and multi DC story. And then more recently, 1 of the things that we've basically it's not really a design decision, but we were originally completely reliant on having access to z k and that's actually how different schema registry instances coordinated to elect that 1 leader that's allowed to actually write new schemas. So from a from a simplicity perspective that's actually something that we weren't necessarily

super happy about because as I said we're trying to have sort of as few moving parts but also as few things that you have to understand and configure the the like connections between. And so having ZK there

wasn't necessarily great but at the time it was sort of a necessary evil. We actually added support now for and this will be coming out in the next release.

Added support for

using Kafka's group protocol,

in order to coordinate and elect a leader.

And this is actually a functionality that I think a lot of people aren't aware

we generalized

in Kafka. It used to be that the consumer group functionality could only work for consumer groups. It was sort of tied to this idea of subscribing to topics.

We actually abstracted out 1 layer of that so you can separate

coordinating groups from then what they are doing on top of like being in that group. And so,

we actually did that when I was writing Kafka Connect

because that's how Kafka Connect workers coordinate with each other. But then we extended we reused that functionality again in Scheme registry so that you can so that

you can basically now configure it. So you just point it to to Kafka,

and the coordination just sort of all works, for free. And so configuration actually got a lot simpler now, in the configuration and setup. So I think those are those are sort of the the high ones. The sort of format was the biggest sort of, I guess, core decision

in terms of what we support first and what we,

what we support first and then what we generally support,

or could support long term. And then the sort of high availability in multi t c. They're both

sorry. That design and the sort of trade offs you make in terms of in terms of,

you know, global availability

and how that sort of failover works.

Oh, actually

sorry. I I should, step back. I I did forget 1 thing about the the multi DC thing.

The trade off, with the multi DC approach with the 1 of the trade offs with the global approach is that it does actually make

a failure event a little bit harder to work with. There's a bit of a manual process here.

Where

in order to when when you have a global cluster basically

if you wanna ensure that 1

the leader is always in a specific,

data center, you have to configure the other data centers to effectively be not

master eligible and this is commonly used if you have a

like a multi DC setup where you actually want to be able to fail over all of your systems dynamically

in order to to handle some outage.

But the process for doing that is a little bit manual and, part of the reason for that is because going with the sort of global cross DC approach makes it, a little bit more complicated. You have to, actually reconfigure

the nodes in the in the

backup cluster in order to make them

eligible to be leaders again, in order to serve rights again.

So there's a little bit of a trade off there design wise.

And is there

any particular tipping point in your opinion in terms of

the scale or complexity of a system or organization at which point it makes sense to invest in the

you

know, smaller than a given,

you know, size and complexity

of system and number of people that having the schema registry doesn't really provide any,

large benefits

as compared to the effort of deploying and maintaining it?

So I guess there are a couple of ways to to

approach this question.

So

1 is whether you have schemas at all.

And,

if you're if you don't have schemas, you're basically just continuously building up tech debt and honestly it's just very risky. So I would discourage ever not having schemas in the first place. And then it's actually pretty painful because then when you decide you do need schemas, you gotta transition from a schemaless format to a format with schemas. So that's a mess. So

I guess,

you know,

the point about then complexity though, kind of ties back to what I said earlier where if you wanna scale down the cockpit deployment, you can do that in a way that actually makes Schema Registry pretty cheap and easy to maintain anyway.

But let's say that you aren't even comfortable pulling up like the 2 or 3 nodes and that's just still too much.

I think the primary alternative honestly is just using your own using version control to act as your central

repository. And that can work. It'll work for small teams. I think the major drawback here and this this ties to sort of it being worth thinking upfront about your about your data formats. The drawback is that if you're not careful about building in some sort of versioning and ability to migrate to an updated format eventually,

it can be very difficult to then transition over to using something like a schema registry

as you grow. And and that actually ties to I mentioned earlier with the format of the data. The re 1 of the reasons we have the magic bite there is not only to validate that we're looking at the format of data that we expect,

but also so that we can change that magic bite to something else eventually if we need to in order to update the format with some new features. For schema registry, that's kind of unlikely but it's sort of a key thing when designing these serialization formats or protocols anything like that to build in that versioning from the get go.

And so if you do this if you do decide to use sort of a central repository,

the other thing that I would recommend is making sure you have that version and basically just use the same format that we do, which is documented in the schema registry docs. That makes it at least possible to migrate over eventually.

Let's see. So and then in terms of size,

you know,

the thing is,

I don't wanna put like a say a concrete number because it obviously depends on the organization.

Like, for example, your familiarity with, with Avro and that ecosystem or but even for,

a small like say 5 to 10 person organization,

the, like, discoverability

that you get from having

having that centralization

can actually be a pretty significant win just in terms of reducing coordination. So basically once you're past the point where everybody

looks at all the code, it's probably time to make sure you have some,

some organization at least set up even if that does end up

being put at some place central in version control and make sure it's all sort of shared library of those schemas.

That helps actually also in terms of not just,

not just being able to discover the format of data, but it encourages reuse. So if I have to go into that directory and see,

you know, that, oh, somebody already added

a serial like

a a schema for

this type of data and I just wasn't aware of it because I hadn't started on it yet. It gives you the opportunity to reuse that and sort of standardize and end up with less schemas overall or fewer schemers overall

Even if you're, you know, pretty pretty small. You know, 5, 10, 15 people.

So I'd encourage yeah. In terms of tipping point of

central repository, the I would say it's very, very low before you start getting value out of it. In terms of schema registry itself, maybe it's a little bit higher if you're, you know, not super comfortable running

zookeeper Kafka and then also adding on the schema registry.

And also having that canonical source of what the different schemas and data formats are across your applications and organization

helps with onboarding for any company that is small and is looking to grow in any way, shape, or form because then somebody who's new to the application can come in and just very quickly see these are the types of data that we're dealing with, and these are some of the places that it might be

used. Exactly. And and so, like, even let's say you're not even using it with Kafka. Let's say you're using it.

I mean, this is probably less likely, although you could do it. Let's say, like, if you're using with, like, HDFS or something like that, you can have an intern come in and very easily figure out how to get data,

and what the format of the data is. And then and like, I I think of this in terms of interns because I think it's 1 of those examples where it's sort of like,

oh, you know, we might give them a project that we think is, you know, a good idea to do, but we haven't found time to do it. And it's like leveraging this data set that we have from some existing application.

That's sort of a classic example of rather than

having

them have to go to that other team who maybe isn't doing the team, you know, managing the intern and coordinate with them to figure out all of that stuff.

They can just go I mean, I you know, their manager can just direct them to the central data and say or to the central sort of registry

and then they just have to figure out, okay, well, there's this topic and it has this or directory or whatever and it has and it has these schemas in it. And, you know,

with Aper schemas in particular,

they can and should include documentation for all of the fields and everything. So it's basically

a fully documented

index of all of the data that's available in your company and makes it super easy to get up and running.

And what are some of the upcoming features or enhancements that you have in mind for the project?

Sure. So I I mentioned 1 of them already, which is, making it so that you don't actually have to use z k, zookeeper for coordination between the different instances when you're running in

mode. Let's see. So, other ones would be,

so more security features. So this is actually I mean, not not super important when you're getting started with schema registry but if you're really trying to scale it up in organization,

having access to

or having control

over sort of access control lists. Basically if I want to

restrict schema registration

to only happen for my CICD pipeline. Or you can even, I think this is less common, but you could even lock down read access if you have some sort of sensitive data in your in some of your schemas. So those are things that are definitely coming up soon. There's,

I mean another thing that would be on my roadmap would be the support we talked about earlier for other serialization formats.

I personally would really like to see protobufs but I think JSON schema is probably more likely in terms of, popularity.

But, yeah, protobufs, Thrift, XML, all those could potentially could potentially be included. And then actually outside the service itself, we kind of lump these together frequently.

But there's sort of the service itself and then there are clients to that API and,

and sort of serializers built around those clients.

So, I'd like to see better,

or improved language support. We currently have support for Java, c and c plus plus and Python. We would like to fill out support for other languages. I think dot net would be a popular 1. Potentially Go, there are probably a few others there as well. These obviously from from my perspective, I'm thinking of it in terms of clients that we, Kafka clients that we also develop. So those are the languages we tend to be, a little bit more focused on. I think those are

probably the major features

and the the major directions.

To be honest,

schema registering in a lot of ways is like a actually super simple project because

the core of it really is just registering and reading schemas. It's only a few of, different

APIs. So there's there's not actually like a huge amount of direction you can go. It's super important for your, like, I guess data sanity

but, the service itself actually doesn't have have too much, too much going on there. So, and I guess actually I should also mention, since you brought up the storage back end, it's something I'd like to see.

I don't know that it makes sense in

from a, like, Confluent product perspective, but it's definitely something that, you know, if we had some,

contribution or something from open source contribution,

we'd be more than happy

to get that in there.

Are there any other aspects of the project or risk related topics that you think we should talk about before we close out the show? I think we hit the major ones. I guess the the only other thing I might bring up is, you know, this this actually

obviously, this is part of the bigger Kafka ecosystem.

We tend to, when we're talking about schema registry, get, sort of a little bit stuck in the in the sort of low level details. Like, oh, this is what the serialization format is and this is how a producer and consumer interacts. We're actually trying to make, to move sort of the level of abstraction that people are operating at a little bit higher. And I guess 1 1 thing we haven't really talked about is the how schema registry helps you in terms of your, like, sort of larger data pipeline. I I talked about it in the large in the sense that we

how it helps, you scale or an organization

and scale their development. But there's also a matter of sort of like how the different pieces ultimately tie together in terms of

I guess I tend to think of it in terms of ETL pipelines currently because that's how I I in my mind how I break down the,

the sort of ways that you use Kafka.

So you know this ends up integrating up into Kafka Connect which is sort of the the Kafka import export framework. Or I refer to as import export framework. It's basically connecting to other data storage systems or other systems generally.

So it allows you to get data into Kafka or out of Kafka. And obviously an important consideration there is

the actual serialization format and keeping track of schemas,

especially in systems where the schema may not be fixed. So a good example of this is if you're loading data from a database

into Kafka. So you basically wanna see every change as an event. Like literally every row that changes comes in as a separate event into Kafka. And so that is effectively a dynamic schema because the schema is defined somewhere else and it may be updated over time. There's sort of effectively automatic integration via

the serialization layer that we have in Connect. But 1 of the nice things there is that it also means that schema registry actually in a sense ends up documenting

other system schemas for you when they're not doing it for you. So instead of having to go to your database to to get the schema, you might actually just look at some of the c some of the data coming out of or sorry, look at the schemas registered in the schema registry

for that sort of import

process. And sort of same thing with going into with taking data from Kafka into other data systems.

And so it I think the thing that sometimes gets lost in the in the bigger picture is that it's actually documenting your entire pipeline even beyond

sort of what you hold in Kafka itself. Which can can actually be very powerful because it gives you

a single place even if

even if not everything is stored in Kafka. If it's passing through Kafka at some point you get sort of automatic documentation of your entire data pipelines, data formats

even when they end up, you know, not

necessarily.

The thing that you might be going to look up with a schema for maybe like in a database but you can still do it through the schema registry. And then of course it ties together with the sort of transformations and and all of that so it integrates,

well with sort of seamlessly with all of your like stream processing applications for example. So and I guess maybe the 1 last thing is you know I think it's really important to think about the sort of upfront think about

data design, serialization,

how you deal with compatibility,

how you're gonna evolve schemas over time. And it sounds like it can be very daunting,

But in practice actually,

a lot of what deploying the schema registry means is run it and then use the right serializer.

Then basically don't think about it again until you see something, like, incompatible failing. So it sets you up to be in a good place, but it doesn't necessarily have to be a huge investment upfront.

Alright. Well, this is a very fascinating topic and a very interesting tool and approach to a widespread problem that some people may not realize that they have yet. So for anybody who wants to follow the work that you're up to and keep in touch, I'll have you add your preferred contact info to the show notes. Sure. And as a parting question to give the listener something to think about, from your perspective,

what is the biggest gap in the tooling or technology

that's available for practitioners of data management today?

First of all, I think it's it's a little bit hard to to pin it down to 1 specific gap because data management actually covers sort of a wide variety of technologies,

sort of tools, and and I guess types of systems as well. So

right. The sort of big data space, I think a lot of tooling has at least been filled out recently. So, obviously and this is, you know, clearly biased given, given the company I work for. But our biggest gaps tend to be around tooling for

real time systems.

And, in particular, I think 1 of the big gaps is in basically visibility into these

visibility and understanding.

We build a lot of tools that and this is not surprising because we're still sort of early in the stream processing space, but we build a lot of tools that solve

1

sort of very narrow specific problem. And while I think that's the right place to start, what you end up with is a lot of different pieces,

a lot of different applications in a in a in a large organization, you'll also end up with, like, yeah, just a lot of applications

deployed in in parallel and possibly depending on each other in ways that are actually pretty hard to understand. Basically, nobody ends up. You have nobody in the company who can tell you if I change this thing, what is the impact gonna be? And so I think that's sort of biggest gap. This is sort of open ended and it's not a specific

tool per se that's missing. But being able

to understand all of those dependencies

and how things interact, I think is probably the biggest gap because it's the the sort of biggest risk for making any any change. I think schema registry is 1 part of addressing that problem. But schema schema registry is totally focused on sort of serialization format and compatibility of data. Whereas I think there's a sort of bigger tooling problem where

even being able to understand

how data flows through all of those systems

is is really challenging. I think there are probably some, I guess, initial stabs or very, like, very company specific

solutions

to that.

I think distributed tracing systems are an example of where, like, they actually do allow you to get a little bit better

information about how,

say, a particular request flows through various systems. But I think we're still sort of taking the initial baby steps with with tooling like that. And there's probably a lot of a lot more to do there that would help people understand. I I mean, it sounds a little bit crazy just to help people understand what they've built because it's easy to to get lost in

the complexity.

Alright. Well, thank you for your time and the work that you've been doing on the schema registry.

It's definitely a very interesting project, and I'm curious to see how it evolves and grows particularly

if and when you have time for adding support for additional

schema mechanisms.

So thank you again for taking the time out of your day, and I hope you enjoy the rest of your evening. Thanks. It was a pleasure.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links