Easier Stream Processing On Kafka With ksqlDB - Episode 122

Summary

Building applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems that were engineered in isolation. The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. In this episode Michael Drogalis, product manager for ksqlDB at Confluent, explains how the system is implemented, how you can use it for building your own stream processing applications, and how it fits into the lifecycle of your data infrastructure. If you have been struggling with building services on low level streaming interfaces then give this episode a listen and try it out for yourself.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Many data engineers say the most frustrating part of their job is spending too much time maintaining and monitoring their data pipeline. Snowplow works with data-informed businesses to set up a real-time event data pipeline, taking care of installation, upgrades, autoscaling, and ongoing maintenance so you can focus on the data.

Snowplow runs in your own cloud account giving you complete control and flexibility over how your data is collected and processed. Best of all, Snowplow is built on top of open source technology which means you have visibility into every stage of your pipeline, with zero vendor lock in.

At Snowplow, we know how important it is for data engineers to deliver high-quality data across the organization. That’s why the Snowplow pipeline is designed to deliver complete, rich and accurate data into your data warehouse of choice. Your data analysts define the data structure that works best for your teams, and we enforce it end-to-end so your data is ready to use.

Get in touch with our team to find out how Snowplow can accelerate your analytics. Go to dataengineeringpodcast.com/snowplow. Set up a demo and mention you’re a listener for a special offer!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Michael Drogalis about ksqlDB, the open source streaming database layer for Kafka

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what ksqlDB is?
  • What are some of the use cases that it is designed for?
  • How do the capabilities and design of ksqlDB compare to other solutions for querying streaming data with SQL such as Pulsar SQL, PipelineDB, or Materialize?
  • What was the motivation for building a unified project for providing a database interface on the data stored in Kafka?
  • How is ksqlDB architected?
    • If you were to rebuild the entire platform and its components from scratch today, what would you do differently?
  • What is the workflow for an analyst or engineer to design and build an application on top of ksqlDB?
    • What dialect of SQL is supported?
      • What kinds of extensions or built in functions have been added to aid in the creation of streaming queries?
  • How are table schemas defined and enforced?
    • How do you handle schema migrations on active streams?
  • Typically a database is considered a long term storage location for data, whereas Kafka is a streaming layer with a bounded amount of durable storage. What is a typical lifecycle of information in ksqlDB?
  • Can you talk through an example architecture that might incorporate ksqlDB including the source systems, applications that might interact with the data in transit, and any destinations sytems for long term persistence?
  • What are some of the less obvious features of ksqlDB or capabilities that you think should be more widely publicized?
  • What are some of the edge cases or potential pitfalls that users should be aware of as they are designing their streaming applications?
  • What is involved in deploying and maintaining an installation of ksqlDB?
    • What are some of the operational characteristics of the system that should be considered while planning an installation such as scaling factors, high availability, or potential bottlenecks in the architecture?
  • When is ksqlDB the wrong choice?
  • What are some of the most interesting/unexpected/innovative projects that you have seen built with ksqlDB?
  • What are some of the most interesting/unexpected/challenging lessons that you have learned while working on ksqlDB?
  • What is in store for the future of the project?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:10
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage, 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances. Go to data engineering podcast.com slash linode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Are you spending too much time maintaining your data pipeline? snowplow empowers your business with the real time event data pipeline running in your own Cloud account without the hassle of maintenance snowplow takes care of everything. From installing your pipeline in a couple of hours to upgrading and auto scaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready to use behavioral web and mobile data delivered into your data warehouse, data lake and real time data streams. Go to data engineering podcast.com slash snowplow today to find out why more than 600,000 websites run snowplough set up a demo and mention your listener for a special offer. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media chronium Global intelligence, od sc and data Council. Upcoming events include strata data and San Jose and pi con us in Pittsburgh, go to data engineering podcast.com slash conferences, to learn more about these and other events and to take advantage of our partner discounts to see money when you register today, your host is Tobias Macey, and today I'm interviewing Michael Drogalis about k SQL DB, the open source streaming database layer for Kafka. So Michael, can you start by introducing yourself?
Michael Drogalis
0:02:11
Yeah. Thanks for having me, Tobias. Yeah, my name is Michael Drogalis. I'm a product manager at confluent where I work on stream processing kind of the direction strategy of both k SQL DB and Kafka streams.
Tobias Macey
0:02:24
And do you remember how you first got involved in the area of data management?
Michael Drogalis
0:02:27
Yeah, so kind of goes back to college. Actually, I think my first interest in distributed systems was in some of the papers around earling at the time, which weren't necessarily new, but we're kind of cool because they were new to me. And so I think kind of one thing led to another with with different papers and areas of research, and a couple areas. A couple of years after I finished college, I was excited enough about stream processing, which was pretty nascent at the time in 2012, or 2013, that I started to build an open source streaming platform enclosure, named named onyx. And so I kind of worked on that for a couple years, which was pretty cool because there wasn't a whole lot of motion going on and stream processing at the time storm was kind of new, but there wasn't much else. And I ended up working on that for a number of years with a guy named Lucas Bradstreet. And eventually he and I ended up founding a business on top of that name distributed masonry, where we built what was effectively tiered storage for Kafka on top of onyx. And a couple years later, confluent acquired us and that's kind of how we got to now.
Tobias Macey
0:03:27
Now you're overseeing the development and product direction of K SQL DB. So can you start by just giving a bit of a description about what it is?
Michael Drogalis
0:03:36
Yeah, I think I think the best way to understand what it is is to take a look at the problem that we we saw. And so, you know, we look at a lot of applications of streaming and sort of the overall trend in the world is that everyone wants things to be more immediate, more real time, not just on a personal level, but you also see this in a more pronounced way in businesses. You know, the way that you you shop and you bank and you Hill rides is actually very different than it was like 10 years ago. So, and it kind of gives you this illusion that a company is like all hands on deck just for you. It's very, very special. And lots of businesses are trying to pivot to this model, which is effectively businesses trying to scale in a near unlimited way with their market by taking humans out of the loop. And so we think that the answer doing this is events and stream processing. That's kind of what's powering this whole evolution of businesses being powered by software. But the problem is actually pulling this off is really hard. If you look at what a typical event streaming architecture looks like. It's many distributed systems almost every time you have separate distributed systems for event capture, for event storage for string processing, and then you have like a database for taking your processed events and aggregates and storing them. And so you can issue answer queries for applications. And the problem is when you you're trying to build one of these systems, you have these three to five separate distributed systems and they all have separate number models for doing everything. They have different ways of securing and scaling and monitoring. And it's up to you, as the developer to operate them as if they're one. But this is actually really hard in practice, because you know, you as a developer, every time you want to make a change, you need to think in these three to five separate ways all at the same time. And it's just completely exhausting. I like to say that's like you're trying to build a car on the parts, but none of the manufacturers have talked to each other. And so we wanted, we wanted this to be a lot simpler, we actually wanted everyone to be able to do event streaming much, much in the way that everyone is able to build a web application with rails or microservices with what spring. And we and we understand that people sort of wanted to get back to something a bit simpler, like they had with, you know, just the database and these three tier applications. And so that's kind of what led to us building k SQL DB. And so what it is, is it's an event streaming database for building stream processing applications, and has sort of very few moving parts and has everything that you need to be able to build these applications. So it has the ability to To connect two different data sources on the outside, it uses Kafka for storage. It has its own Stream Processing runtime, and it's able to serve queries to applications. And so it's very lightweight. And it sort of acts as a database for stream processing to get you something a lot simpler.
Tobias Macey
0:06:16
And just to give some context, about how long ago did this project get started, and what was the landscape like at the time, as far as the availability of querying streaming data using SQL interfaces,
Michael Drogalis
0:06:30
I think the project itself is maybe two or two and a half years ago, it actually predates my time at confluent. And so you know, early on k SQL, the successor to K SQL DB was sort of this sequel dialect that just compiled to Kafka streams. And that was actually a pretty good time because people like 2016 2017, were still figuring out how do you do this stream processing thing? The API's were pretty complicated. There were different sorts of API's sort of being fostered in parallel and try to figure out like, what's the best way to work with streaming and batch data at the same time? So sequel was a nice improvement because it sort of got people away from the complications that were starting starting to evolve at the code layer. And so there wasn't a whole lot of innovation going on here. Like we saw this as a good incremental step in the right direction. But k SQL DB we feel is actually the right, you know, major step in the next direction, where it's kind of something categorically new that builds on what we already had.
Tobias Macey
0:07:22
There are a few different products that are available now that at least at surface level, are similar in terms of the capabilities to what k SQL DB is offering as far as being able to use a sequel interface on different streams of data. So notably, thinking about the pulsar sequel layer, or there's another product called pipeline dB, which I know got acquired. So I'm not sure what the current status of that is, or most recently, there's a product that got launched called materialize, that offers the ability to generate streaming materialized views of information as it's being piped from relationship databases using DBZ. So I'm wondering if you can just give a bit of a comparison in terms of the capabilities and primary focus of Casey cool DB related to some of those other projects?
Michael Drogalis
0:08:13
Yeah, I think it's it's good that everyone's realizing this as a problem. I think all these products actually signal that people realize that how important stream processing is and how complicated it tends to be. One of the things that I really like about k sequel is k SQL DB, rather, is that it's built on Kafka streams, which is a very battle tested Stream Processing runtime. And so we're able to take advantage of years of effort that's gone into reliability and fault tolerance and scalability. And so things that we get for free with Kafka streams is that if you know you have a stream processing topology, and you terminate your query, or whether you pause it or you stop it, and then you bring it back up on another node, it's actually able to recover all of its intermediate state. And we didn't really have to do very much work at all on k SQL DB to make that work. That's actually just something that we inherit from Kafka streams. And so there's a lot of history that we're building on. But I think the bigger thing is that we actually want to unify, you know, our entire platform as one, we've actually been working for quite a long time at different pieces of the puzzle to be able to do event streaming. And I think a good example of that is in the Kafka ecosystem, there's a component called Kafka Connect, which allows you to move data into and out of Kafka from a range of systems sort of gives you the framework to do this and exactly once in a fault tolerant manner. And so over the last year conference been working really, really hard to build up the range of connectors, it's super important that if you're going to use a system like this, you got to have waste at easy ways to get the data into an out of the system. And so we built up now more than 100 connectors out of the box, I think we're almost like 110 at this point. And so when you use k SQL DB, we actually have support for connectors and built right in and not just two or three connectors, actually, all the connectors. And so you can talk to a variety of databases or SaaS applications or cloud providers, kind of anything you can think of, we're just trying to make it really easy and so we put a lot of work into really leveraging the whole ecosystem. And so it's not as if we're just like laser focused on stream processing, we're actually very focused on making sure that we're taking advantage of all the other pieces that we're using. And that's kind of one of the ways that we're trying to make string processing easier for everyone.
Tobias Macey
0:10:15
In terms of the motivation for building this unified layer on top of the overall Kafka ecosystem, I'm curious how that has driven the priorities of building k SQL DB and where the area focuses and some of the edge cases that existed in trying to tie all these underlying systems together by yourself that k SQL DB is addressing and some of the challenges that you faced in terms of trying to provide an easy to use interface on top of all those different systems and tie them together in a cohesive whole.
Michael Drogalis
0:10:55
Yeah, I think the driving pain point is is asking the simple question like can a developer do it can a developer, build a stream processing application on a Friday night with a pizza where you just want to like make something fun in a couple of hours. And the answer for the most part is no, you spend a couple of hours trying to like, change the configuration that's supposed to affect like several different systems. And you just tear your hair out trying to make it work. And I've had plenty of nights like this. And I'm actually like, rather experienced with stream processing. And yet it's super hard for me. And so for someone who's a newcomer, how can this possibly be tractable. And so our our motivation is like something as simple that you could trust it in a hackathon, which you know what the bar for that is like, you would never dare use something that is not completely understandable or liable. And hackathon because your life will be really hard. So we wanted something that was easy to understand something that had far fewer moving parts. And another piece of this is when you start to integrate all these systems together. What you often want to do is actually have two kinds of queries. You actually want to have the database like points in time lookup queries where you just asked me a question getting an answer and then You want these streaming type queries that are powering these almost subscription like behaviors in the real time apps that you use. And so when we put all these systems together, like that's often the goal is to actually unify these two pieces. And so with K SQL DB, we wanted one query language that can do both of them. So you're not actually trying to stratify these two different systems together and sort of like make this piecemeal architecture, but actually thought we could make a an event streaming database that's able to do both of them natively. And that's kind of the rationale behind it.
Tobias Macey
0:12:27
So this might be a good time to just talk about the overall architecture of K SQL DB and how it interacts with and interfaces with Kafka and some of the other components in the console and ecosystem.
Michael Drogalis
0:12:39
Yeah, so as I was saying, it's, it's built directly on top of Kafka streams. And so what basically happens is, we have the the notion of persistent queries. And so you can ask query, and every time some data comes in, we keep these views or tables sort of incrementally updated. And we take one of these queries when you submit them to K SQL DB. we compile that directly into account. streams topology and rundown on one of the nodes in the cluster of servers. And so that's actually just like running Kafka streams natively. And then all the communication is actually happening through the Kafka brokers. And so you have the cluster of K SQL DB servers doing all the processing the cluster of Kafka brokers who are doing storage. And that's sort of how they line up. The big question that we get a lot is how do you handle state. And we use an approach where we actually use embedded rocks DB on each of the case equal DB instances, or as we're creating state, we're doing sort of a two phase thing where we're keeping your your state in storage locally on the disk, the rocks DB for fast access, but its transit, we actually store the entire change log that backs the table into the Kafka brokers, which are replicated and fault tolerant and durable. And so if you lose the case, SQL DB server, you can fail over to a different one that's able to replay the whole change log into the new server. We do some optimizations where sometimes we can actually skip that changelog replay process, but that's sort of how it works at a fundamental level if you got to take all bets off the table.
Tobias Macey
0:14:00
You mentioned that the Kafka Connect libraries are built right into k SQL DB. So is that just part of the deployed binary once you get a K SQL DB instance running so that you don't have to manage those separately, or is that still something that you would configure in terms of the Kafka layer, and K SQL DB just takes advantage of those inputs and outputs from the system.
Michael Drogalis
0:14:22
That's the idea. So when you download k SQL DB, what you're able to do is actually add connectors to the classpath. So connectors are are jars that you add to your classpath. And once you have them on the classpath, you can use some syntax in case SQL DB to say create source connector or creates and connector. And we offer that up in two ways. We think like one of the easier ways to use it on the box is actually to configure nothing else, you put the connector on your classpath. And then you're just able to go you're able to say, create this connector and in internally, k SQL DB will actually run the connector for you on its servers. And so there's there's no separate cluster, there's actually just one cluster doing both of these things for you. We do have lots of People who do high volume workloads and once you get into sort of higher volume, territory, you are very mindful of resource isolation. And so the same syntax, being able to say create source connector actually works not only in this embedded mode, it also works on an externalized Connect cluster. And so we actually let you choose. And it's the same program for both. And so we think this is a pretty powerful way to let people go from beginner stages to sort of intermediate, and then all the way to high volume, mission critical use case. When you're dealing with multiple different systems that are all working together, even when they're designed to be consumed as an entire unit. There are some cases where you have some weird edge cases or some design considerations that you might not have done if you were to redo the entire system from scratch. So if you were to just rebuild the entire platform today, primarily focused on the case, equal DB use case, what are some of the things that you think you would do differently? That's a great question. I think one thing we would have done is taken a much harder line about what do things look like at the top level, I think if you read through our documentation, a very fair criticism is that we, we don't really make a call around how much coffee that you need to know. And so you sometimes you see these low level implementation details about coffee that sort of bubbling up all the way through. And a good example is configuring auto offset reset, which is like the lowest layer in the Kafka client libraries for Do you consume from the beginning or the end of the stream? Super important concept, it actually does need to surface in some form at the very top in case equal dB. So you can, you know, really know which end of the log to consume from, but we actually do it in the most low level way. And so I think trying to make a call about how to make that much more unified would have been better, you can probably make similar arguments about product seems around partitions. partitions are very, very important to every layer of the stack. But sort of the way that we handled each layer can be a bit different in a bit confusing, and those are the challenges that we need to work through. I think we've done a pretty good job so far, and making sure that you can sort of not get too much in Louisiana use case equal dB. But there's certainly cases here and there where if we were to reconstruct it from the Start, that would be something we keep a very careful eye towards.
Tobias Macey
0:17:03
And with having everything being vertically integrated and designed to interface together, it's definitely easier to have a system that is pleasant to use and easy to consume. But are there any common abstractions that you think that is an industry, we can orient around where it would be possible to use something like k SQL DB as an interchangeable layer on top of other streaming systems?
Michael Drogalis
0:17:28
Yeah, that's a, that's an interesting question. And so the way that you communicate with K SQL DB in terms of its attraction, our streams and tables, and we think that that's probably the right way to model event streams in the long run, I think academia is still out and like, what is the right abstraction for this? And how do we get towards a unified model for doing all this? I think in the end, we would actually be very happy if there was a standard for how you express these things. And it would just be a lot easier to sort of interchange these these these different components and systems. And so I think because streaming sequel is still So early, there's not really consensus on what that should look like. But it's something we're very much looking forward to,
Tobias Macey
0:18:06
in terms of the workflow of actually using k SQL DB and building something on top of it. What is involved in actually designing and building and deploying an application that's going to use k SQL DB is the execution context.
Michael Drogalis
0:18:20
There's three kinds of queries in case equal dB. There's persistent queries, which are the continuous queries that people sort of know when you talk about pre installing a query as a stream processor. And that's like a server side query. And then there's two other kinds of queries, which are client side queries, their their push grades and their poll queries. And they kind of take the perspective of an application. Sometimes you want to pull the current answer to you on demand once that's a poll query like a database. And then there are push queries when you want to have a subscription to changes as as events come in. That's, that's a push query for us. So when you're designing a program, you're probably starting out with these persistent queries. And so you you boot up k SQL DB server locally, and then you connect to it through Seelye. And this Seelye is actually made to feel very similar like MySQL or Postgres, where you're you're sort of having this interactive repple like experience where you're creating streams, tweaking them, modifying them, tearing them down, really just trying to get it right and like banging on it until it's good, kind of similar to Python in some sense. And as you get these things, right, you'll actually just deploy them permanently. And so you'll send them off to a kctv server, and it will take care of running these things reliably, even if the presence of faults. And so that's that's kind of the first piece where you're figuring out like, what are my core streams and tables? And how do they connect together? And then the second piece is, how do I actually involve my application to leverage some of this data. And today, that is where K SQL DB presents a REST API. And you would sort of design your system to communicate with that by issuing these queries where you know, select from table where he is x, maybe give it to me in the streaming form, maybe give it to me just in a regular form, and you just build your app, then you sort of get the answers back and you live in your own application framework. And so we're trying to be very conscientious of not not making a sequel DB and Island, we actually want to make it very easy for you to use the tools that you do today to build applications. And so that's that's sort of what it looks like in terms of how you put together a full application flow, we are doing some work to make it really easy actually to integrate at the programming language level. And so in a normal database like Postgres, you actually just expect to be able to have like a driver or a library where you, you know, spin it up and give it your your connection details. And then you have a nice, pleasant, you know, application level interface in like Java, or node or Ruby or whatever. And that's an area of work for us where maybe this REST API continues to exist. But we actually have something a little bit more native from the programming language, where you feel even more at home than your your favorite programming language.
Tobias Macey
0:20:46
And another element of that interface is the dialect of SQL. And so I'm curious what you're targeting in KC will DB as far as what constructs are supported and What are some of the types of extensions or built in functions that you've added to simplify the work of operating in this streaming space? And what are some of the constructs that you've had to leave out because they make too many assumptions about the nature of the relational aspect of the database.
Michael Drogalis
0:21:16
We do have something a little non standard, I could be wrong in this, but I think the initial thought was let's take the, the grammar for presto and run with that. Because back in like, you know, when it was started 2016, maybe even like early, late 2015. Rather, there wasn't a whole lot of work into what the standard for this could look like. And so we do have something where it's, it's, it's similar to NC sequel, but with a few few extensions. And so we do have some syntax around creating streams and tables where they sort of wrap normal SQL ends up being pretty easy to parse if you just sort of like, look at the nesting of these and the query bits. The query language itself is actually made to feel as much like Nancy sequel as possible and it's, it's something that we're trying to get to. I do think, as I said earlier, if we can actually standardize on this, it's better Everyone. The trouble is though, there's really no notion of what streaming sequel looks like. There has been some work in academia around a model where we kind of take sequel as we know it today. And then we augment it with a clause like MIT changes, which is essentially what we do, we actually taken a step in that direction as well. And so we actually just want to make it as unmodified as possible. And if you think about why that's good, you actually want to take advantage of all the pre existing work that's going in the sequel. So you have lots of BI tools, lots of graphing tools that already speak sequel through JDBC. And so it's, it's better for everyone actually, if everyone's speaking the same language, because you can you can latch on to this ecosystem of the language that everyone is already speaking. And so yeah, that's that's kind of where we're at with that.
Tobias Macey
0:22:44
Another aspect of the sequel interface is the definition of tables and schemas and the migration of those schemas as they evolve and add new columns or rename things or change types. And so I'm wondering how that is is defined in case who the B and the different layers in the Kafka ecosystem that operate together to enforce those constraints on the data that's flowing through.
Michael Drogalis
0:23:11
So the way that you sort of Express types in K SQL DB is through Yeah, like a schema language read, say, like create stream or table and then you you say something that actually looks very similar to like MySQL or Postgres, where you list off your columns and their types. And so that's, that's the top level layer of enforcement. You asked about how you evolve over time. That's something that we're working on right now. It's it's a really interesting area of research where it's by no means a solved problem. And there's different layers to this, when you think about what goes into actually modifying not just a streaming SQL statement, but really under the covers a stream processing topology, it kind of depends on what is the compatibility with the old stream and the new statement, and is their state. And so what we're working on right now, and we're likely to publish pretty soon as a series of design changes sort of across the ecosystem to support this, not just in case you will dB, it will require changes in extremes actually probably a few changes to coffee itself, which are good for everyone. And so we're we're thinking about this idea of like a user controlled event where you can sort of insert markers into the stream. And this is very early. But the the way that we envision this working is that it's actually very invisible to you at the top, if we get it right, where you can do some sort of schema alteration or an evolution say, you know, ALTER TABLE Add Column. And what we could do under the covers is that we can actually spin up a parallel processing topology, have it churned through your data if necessary, and actually, you know, run it with the new code, and then it just the right time, cut over to the new one. And so some people probably call this like Shadow processing, where you're running the new topology in the background, and you're waiting for just the right moment to transactionally cut over in a safe manner. This is how people tend to do it today. They actually do this by hand or they do this with some amount of programmatic guardrails, and it's complicated. It's sort of the best way people can figure out to do replay which is actually really important to getting the attention Story, right. And so we think we're onto a way to do this in a much more elegant manner. And one that would save a ton of incidental complexity.
Tobias Macey
0:25:07
That sounds somewhat similar to a conversation that I had with one of the developers in the pro Vega platform as far as how they handle things like updates in the schema. And they just treat that as a separate stream with different checkpoints. And then you can rely on those checkpoints and their coincidence with the checkpoints and the other streams for being able to enforce those changes in schema as you evolve the code and being able to process older events based on what the other schema is defined in terms of the other adjacent streams.
Michael Drogalis
0:25:40
Yeah, that is a similar approach. And it becomes a really interesting exercise to think of how do you do this when you are in the precedence of incompatible state? And so maybe I have an aggregation. That looks one way today and then tomorrow, I want to actually take a different shape. You know, you sort of get into this problem of are these two states even compatible, can I actually use any of the private work that he did in the last Stream Processing topology in the new one. And that is actually really unclear to us. And I think it's unclear to most people are researching that today. But I think that'll be a really cool area that that comes out of the work we're doing.
Tobias Macey
0:26:11
Another expectation that people have when they hear anything that has DB at the end of it is the ability to do things like joins or have transaction support. And so I'm curious what level of capability exists in K SQL DB at the moment for those types of operations.
Michael Drogalis
0:26:29
Right now, k SQL DB is an eventually consistent database, which I don't think it's kind of alone in databases are often transactional or have some some level even we consistency guarantees. But eventually consistent databases actually are out there and many people will rely on this day in the form of a key value store in s3. People sometimes forget this, but Amazon s3 is eventually consistent despite being like one of the most used data stores on the planet. It is something that we're actually looking at doing I think a lot of people actually are able to get away with eventual consistency for streaming because their workloads do tend to be shorted anyway. But it's something that we're looking forward to improving over time.
Tobias Macey
0:27:04
And then another question is the aspect of the lifecycle of data as it exists within k SQL DB, and how the streaming nature of it factors into the duration of storage that you might be able to rely on, and being able to lifecycle data out into other systems or have tiered storage or anything like that. And so I'm curious how that's handled in this ecosystem of the K SQL DB platform.
Michael Drogalis
0:27:31
This question comes up a lot with Kafka itself. And really, the root of the question is, how much data can you put in Kafka? Because it's providing the primitive log for you to be able to do these essentially provide a transaction log that you'd have in your normal database? And the answer is kind of it depends on what you're comfortable with. And so the classic example, people tend to refer to as the New York Times who stores all of their data for all their publications in a single topic. And that can work that can work if you have the sizing right and you actually have enough capacity and you have the right volume to be able to do that. And so we sort of inherit this in case equal DB is sort of like what are you comfortable with? One of the upshot is that stream processing applications tend to have a bias to operate over a particular window of time. And so they actually care more about here in the now they may care about things that are week or a month old. But sort of the models that have been developed to handle this with windowing have this this idea of like shedding their, their previous windows out to permanent storage, and so Kafka streams actually can do this as well, where you're sort of like TTL in the window, and you you no longer keep it in storage. And so, yeah, I think I think it's sort of inherits this property in terms of K SQL DB. The one of the more interesting things that we're working on and, and I alluded to this at the beginning for interviews, I, my company have been previously working on tiered storage for Apache Kafka and we now we're here. And so one of the things that we rolled out in confluent platform, our enterprise product around Apache Kafka is shared storage for coffee itself. And so we're really keen to be able to take advantage of that, not just at the User topic level, but actually also for intermediate topics and other places in our system. I think that's going to unlock a huge number of use cases for coffee over the long run. And it's a really exciting development and ecosystem.
Tobias Macey
0:29:10
Another aspect of the use cases for K SQL DB is where it sits in the overall ecosystem of someone's infrastructure or the overall data landscape where I know that it has capabilities built in from the Connect platform for being able to consume data from other systems and deliver it to destination systems. But wondering if you can just talk through an example architecture of somebody who has incorporated k SQL DB into their application lifecycle and their data lifecycle and just some of the ways that it's being used and leveraged.
Michael Drogalis
0:29:44
Yeah, so I guess I'll answer the last part first. There's there's kind of two main use cases for K SQL DB. We see people building a lot of data pipelines with them. And then we see people doing this pattern, asynchronously materializing views. And when you put both of them together, it's actually A great design to be able to build an application on top of for stream processing. And so that's at a technical level, really the use cases that we aim at. And the field, we've seen some really cool applications of this. We've seen people build transactional caches and core banking, we've seen people do event driven micro event driven microservices, for you know, human resources, SAS software, we've seen streaming ETL for ride hailing, and digital music providers and travel and tourism, seen a whole bunch of different things. So if you take one of these, you can sort of think of maybe like an online marketing company who's using data in real time to like, be able to sell things. And so maybe some source systems like you may be picking off messages from customers or calls or events from some some like legacy platform. We often see people putting their data initially into like SQL Server and capturing it with the museum and then putting it into Kafka. And so where K SQL DB fits in is that you may actually want to build an application on top of all this incoming data. And so you can create a connector for the museum that will talk to SQL Server And then move your data into Kafka and then do some processing in K SQL DB, perhaps some aggregations. And then your front end, you can actually use this unified query model for pushing and pulling data. And so you can, you can ask to ask a question about your data and get an answer back. Or you can do the streaming variants of that and subscribe to all the answers. And then the final piece that we see a lot of, it's not just source connectors, it's also synced connectors. And so k SQL DB doesn't want to be the only database, we actually want to make it easier to use other databases to. And so that's another reason that connectors are so important. And so we're k SQL DB is query model isn't quite right. You can spill your data out to something like Elasticsearch, for example, and do more indexing on it and keep it keep it in there for the long term. And so we try to make it really easy to play with a variety of technologies. But really, in the end, you're spanning source systems in moving data into Kafka, doing some processing and then serving applications and then pushing data out into other data stores for other kinds of indexing and processing.
Tobias Macey
0:31:56
And in terms of the capabilities and potential use cases. For k SQL DB, what are some of the less obvious ones or some of the less publicized capabilities or features that you think should be more widely publicized and that people should be more aware of?
Michael Drogalis
0:32:13
I don't think people quite understand the power that this Google query models query model offers you. And so the whole idea that we want to get to is that you can take the same query, asked a question, get an answer, but also just tack a clause on and it changes to get the streaming variation of that, I think it's not going to be like a panacea, you're not gonna be able to ask arbitrary queries like that. But we are sort of working towards making this more and more expressive over time. And then I think the other piece, and we've been talking about it the whole time, because I'm so excited about it, but I think it still has yet to reach its potential is the ability to run connectors right inside of K SQL DB. This is huge. Every time I worked on a system, just getting the data in step one can be so painful. Now we make it quite easy. There's there's a tool confluent hub to be able to grab connectors off of the internet, put them in your classpath and then there's simple syntax on them. This alleviates a huge amount of pain that I've had working with Stream Processing over the years. And so I think that this is one of the most important things that we're doing.
Tobias Macey
0:33:09
In terms of the design and implementation of the streaming applications. What are some of the potential edge cases or pitfalls that users should be aware of as they're designing and implementing their applications?
Michael Drogalis
0:33:22
I think there's there's there's two things. So the first is that k SQL DB is sort of an asynchronous system. And so it's it's good at asynchronously computing views. And it's it is it is eventually consistent. So we don't, the thing that may surprise people is that we don't do read after write consistency yet. And so you wouldn't really want to use this for like an OLTP store for your primary data. It's just not cut out for that. You actually are using this as a replica of your primary data to be able to serve your applications. And then the second piece is this is something that we're working on right now. We've we've just cut out some improvements for this in our last release, and we're continuing to make it even better is the availability of the serving layer were bounded by our fail over time. This is this goes down to the architecture of coffee streams. You have several workers running tasks, and they are asynchronously materializing topics themselves. And so when you do a failover, the secondary worker may not be all the way caught up. And so you actually may have reduced availability because of that. And so this is something again, that we're working on. And we actually think we can improve this quite a bit over time.
Tobias Macey
0:34:19
For somebody who is interested in deploying and maintaining and installation of K SQL DB, what are some of the operational characteristics and considerations of the system that they should be thinking about? Well, designing the deployment strategy, and what is actually involved in getting it up and running if they don't already have Caprica platform deployed,
Michael Drogalis
0:34:41
to probably to answer the last piece First, you need a Kafka and you need k SQL DB to rock and roll with this. And so with Kafka, that sort of freely available on the internet, you can get that in a variety of forms. Docker is a pretty common way to do that complem Cloud is actually like another great way to get Kafka in a fully managed setting. So that is the first piece you need. To do this for storage, and then the second piece is k SQL DB, which today we just ship in Docker. And you can run that in a standalone setting, you can run that in a clustered mode. But the idea is it sort of presents itself as a container. And you can you can link these things together. And then you have a series of remote processes that are available for taking inquiries over the REST API, or just over like a local command line. And I for others to your question.
Tobias Macey
0:35:25
Yeah. Also just curious about things like potential scaling limitations and issues with high availability and any potential bottlenecks that might exist in the deployment infrastructure for things like latency or just overall compute capacity.
Michael Drogalis
0:35:40
I think the biggest thing that I would advise people to think about is like what is the tendency of your queries and so this is a perennial problem with databases and data warehouses were actually doing multi tenancy is extremely hard. And you may have one query that just like nukes, the performance of others. And this is actually something we've chosen not to work on a lot because we don't think we're going to improve On the status quo very much when there's actually a lot of other interesting things to go after. And so you're going to want to think about, like, you know, what is the resource utilization of this set of queries. And trying to group together the queries for a single use case, on a single case, SQL DB cluster, you don't really want to load up all of your queries on one database field done that in the past and my sequel, or they use, like, they use a single table for, you know, their, their online website, and also for the analytics, and then their availability sucks, because actually just like crushing their server all the time. And so it's kind of a similar metaphor, where you actually don't want to put all your eggs in one basket, and you want to localize the queries that are a little bit more performance sensitive,
Tobias Macey
0:36:35
when his case equal dB, the wrong choice, and somebody would be better suited going with just a traditional relational database or a data warehouse or some other type of streaming platform.
Michael Drogalis
0:36:46
I think there's there's probably two pieces there. So first of all, if you need strong consistency, today, k SQL DB doesn't do that. And so relational databases, people often need high degrees of consistency, and so turning towards like a traditional into a database like MySQL Postgres is great a lot of the time. And another thing is probably like an unusual answer. Your question is, if you want to do batch processing, we don't think that batch processing is completely dead. We actually think maybe in terms of implementation, streaming is probably the right way to power batch processing, can you you actually end up solving a lot of similar problems. But that process processing has continued to be useful. And so one thing we try to advise people is like, Don't force fit your problem into streaming, if it's not a streaming problem. many problems are streaming problems, but there's actually a continual use for batch processing out there. And so it Don't try to fit a square peg in a round hole. It just doesn't always work.
Tobias Macey
0:37:38
In terms of the use cases, and usages of K SQL DB, what are some of the most interesting or unexpected or innovative projects that you've seen people build with it?
Michael Drogalis
0:37:47
One of the most interesting experiments that I've seen recently is someone doing anomaly detection on mainframes, and one of the reasons that I liked the story so much is that they're taking really, really old technology and they're using really, really new tech ology to examine it. And so it's this interesting juxtaposition of taking the things that I have never worked with in my life with things that were just starting to come out with. And so it's this really curious combination of the new and the old.
Tobias Macey
0:38:14
And what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of working with and on k SQL DB?
Michael Drogalis
0:38:23
I would say that new categories are hard. And this is a lesson that I've learned several times over. And so when I was working on my stream processor, very early in my career, stream processing, maybe academically wasn't new, but in terms of industry, no one really doing it. And so that that was just confusing to explain. And then I learned this again with the company distribute masonary, building a streaming native data warehouse, where tiered storage for Kafka, again wasn't something that people were thinking about. So it was like, new category, new kind of technology. How do you explain it? And then here again, an event streaming database. It's like a new new take on databases for Stream Processing world trying to explain these these categorically new things is extremely complicated. And so lesson that I continually relearn is how to explain complicated things in a really simple way. And it is probably one that I will always relearn for the rest of my life.
Tobias Macey
0:39:14
In terms of the product roadmap for K SQL DB and the future of the overall caffee ecosystem where it is residing, what do you have planned?
Michael Drogalis
0:39:26
I think the main thing that we want to do is just making it easier for engineers to be able to build the stream processing applications. And so we're laser focused on this. Our internal mantra is that we want to give people one mental model for doing everything they need to to build a stream processing app. And so most immediately, we're focused on reworking this external API layer. I mentioned today, we have a REST API. She wants something a bit more suitable for the long term. So we're working on what a better more robust interface would look like. And along with that, we're working on first class client support. So we're actually rolling up clients for job in JavaScript, and so you can just plug and play really, really fast. And we think we can, we can refine this even further. And so that is, clients are one way that you can expand your footprint. We think another thing that's super popular is graph QL. There's this front end world where everyone's doing Event Driven Programming in, you know, React info. And then in the backend world, we have Kafka where everyone's doing a mentor in programming. And these two communities don't really talk to one another. And anyone who's, you know, in the tough spot of having to make these two things communicate, it has to build this awful middle layer. And so, graph qL has, interestingly come on the scene where it's providing this, you know, data back end agnostic layer for doing queries, not only for regular queries, but also for subscriptions. And we looked at this and we said, actually sounds very similar to what we're trying to do. And so we're looking at just sort of an experimental way, like, what would it look like if you marry these two things up? What would it look like if you had graph qL support over your K SQL DB data and hence your confidence? And so we're really excited about that we're going to sort of put out some work there really soon and see what people think. And then the last major piece that we're working on, as I mentioned, is a scheme evolution. I'm really excited about that that is challenging work that I think it's going to be very much rewarding. But we hope all these things make it easier for people to build stream processing applications in general.
Tobias Macey
0:41:19
Are there any other aspects of the case equal DB project or the ways that it's being used or the overall efforts going into streaming applications that we didn't discuss yet that you'd like to cover before we close out the show?
Michael Drogalis
0:41:33
Oh, I can't think of any. That's a good set of questions.
Tobias Macey
0:41:36
All right. Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
Michael Drogalis
0:41:52
I think the biggest thing is increased retention for streaming data. And so I firmly believe that tiered storage is To change the landscape of how people use Kafka, we really, really think that Kafka is of a similar technology in terms of importance is the relational database. And so it is of utmost importance that people could store more data in it for longer and cheaper. And I think that this is one of the bigger gaps that's about to be solved. And it's going to be quite important over the next few years.
Tobias Macey
0:42:20
All right, well, thank you very much for taking the time today to join me and share your work and expertise on k SQL DB and the landscape of streaming applications, as you said, definitely a very important area for people to be exploring right now. And so it's great to see that there are people out there trying to make that easier to achieve. So thank you for all of your time and effort on that front and I hope you enjoy the rest of your day.
Unknown
0:42:43
Thanks for having me.
Tobias Macey
0:42:49
Listening, don't forget to check out our other show podcast.in it at Python podcast.com to learn about the Python language, its community in the innovative ways it is being used. visit the site at data engineering podcast. com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!