Building The Materialize Engine For Interactive Streaming Analytics In SQL - Episode 112

Summary

Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Materialize is and the problems that you are aiming to solve with it?
    • What was your motivation for creating it?
  • What use cases does Materialize enable?
    • What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize?
    • How does it fit into the broader ecosystem of data tools and platforms?
  • What are some of the use cases that Materialize is uniquely able to support?
  • How is Materialize architected and how has the design evolved since you first began working on it?
  • Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided?
    • What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems?
  • In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize?
    • A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or planned additions for Materialize?
  • Can you talk through the lifecycle of data as it flows from the source database and through the Materialize engine?
    • What are the considerations and constraints on maintaining the full history of the source data within Materialize?
  • For someone who wants to use Materialize, what is involved in getting it set up and integrated with their data sources?
  • What is the workflow for defining and maintaining a set of views?
    • What are some of the complexities that users might face in ensuring the ongoing functionality of those views?
    • For someone who is unfamiliar with the semantics of streaming SQL, what are some of the conceptual shifts that they should be aware of?
  • The Materialize product is currently pre-release. What are the remaining steps before launching it?
    • What do you have planned for the future of the product and company?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and the 40 gigabit public network, you get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances and they've got GPU instances as well. Go to data engineering podcast.com slash linode. That's l i n o d today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. And you listen to this This show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet, listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media chronium Global intelligence, od sc and data Council. Upcoming events include the software architecture conference, the strata data conference, and pi con us go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Frank McSherry about materialise an engine for maintaining materialized views and incrementally updated data from change data captures. So Frank, can you start by introducing yourself? Sure, I'm sure yeah. Chief Scientist
Frank McSherry
0:01:51
at materialism corporate it I used to be I would say much more of an academic did a lot of science research systems. Computer Systems data processing things go back 1520 years or so. So bit all over the place. I did a bit of work with differential privacy back when that was just beginning. And, you know, just a lot of things related to data processing big data processing, mostly understanding a bit about how how computations, moved it around at large scale, what they need to, to do this
Tobias Macey
0:02:22
efficiently, what sort of support thing? And do you remember how you first got involved in the area of data management?
Frank McSherry
0:02:26
Yeah, I mean, it's, it goes back away. So as a grad student, doing computer science II, whatnot, I ended up that was a theoretician originally. So proving theorems and doing math and stuff like that. And the work that I did related to large graph analysis, basically, as we're trying to understand back at the time, this was like 2000 or so PageRank was just barely a thing. And people were pulling up other sorts of analyses about what do you do with really large graphs, the web at the time, but other sorts of social graphs, things that have emerged out of people's interactions, rather than being planned ahead of time, like a network topology, and this led to a bunch of practical questions about like, how do you actually someone give you a billion edges? What would you What would you do with that? How do you even make that work and at the time, I was at Microsoft Research. But at the time people like Microsoft and Google and you know, potentially a few other places, Alta Vista places were the ones working on this skill and has very bespoke technology that people are using. And as part of wanting to, to work with these sorts of problems, got involved in projects, in particular Microsoft with the Dryad project, the traveling project, which are sort of these cool DSL in C sharp languages that the trick people into writing declarative programs in a way that you see now on the spark and stuff like that work that sort of, I think drew drew her back from Dr. Think, but most people to realize, of course, me for sure to realize that a lot of the programming idioms that we use to work with large data really rendered down pretty nicely to did a parallel compute patterns that you might, you might ask a computer to perform a You could manually control where all the bits and bytes go. But through an interface through a through an API that was much more pleasant and more declarative and a sort of lifting the abstractions up to something that looked like sequel but a bit more of a, let me just explain what I want to be true about the output rather than exactly how I need to go into it. And that project burbled around for a little while did a lot of really cool things. And from my point of view, led to the niat project, which was the next thing that that I did, sort of where I got a bit more directly involved in building systems for data manipulation, data management data, multi address a lot of computation, like a lot of a lot of what I do, I would say is more about computing on data rather than than managing it per se. But these other people work a lot harder on on data management. I do. There's a lot of there's a lot of talking, hopefully, that gives some context as to like, you know, the past, let's say 15 years to five years ago, has been just starting to get to get a bit of experience of what people might possibly need when they want to shuffle around gigabytes or terabytes of data.
Tobias Macey
0:04:58
Yeah, definitely gives a good amount of context for where you're ending up now with the materialized project. So I'm wondering if you can describe a bit about what that is and some of the problems that you're aiming to solve by building it.
Frank McSherry
0:05:09
Yeah, no, that's perfect. So materialise is, in some ways, sort of a natural extension of some of the research that we did at Microsoft, and also some stuff in between that, that various academic institutions, we took what we learned when we were putting on our science hats about how to maintain, incrementally maintain computations that people specified, which was great. We felt very smart when we when we did this, but it was something that wasn't super accessible to your average person who just wants to, in principle wants to be able to type some sequel, press enter, see the result? And in principle, if they press up, enter over and over again, that's the same question repeatedly. Yeah, maybe that should be really fast. And a lot of I would say a lot of what materialise is going after is a refinement of the experienced substance of this underlying technology. We're borrowing very heavily from This work on incrementally maintaining computation, sequel style computations but but others more broadly and trying to refine it to the point that you can grab some material is binary, and turn it on, it looks like a standard database experience, use something like p sequel to log into it. And you get to look at all sorts of relations there, you get to ask queries, just like you would normally ask using just standard sequel constructs. And it says some really nice performance properties. If you ask the same questions over and over again, if you create views create indices, you can start seeing results for what appears to be very complicated questions very quickly if we happen to be sitting on the the appropriate data assets before. So you know, it's a problem that we're trying to solve here. You know, the deliberate experiments is meant to be it's a lot like a database that just happens to be fast when you re ask questions you've had before. And one of the problems that we're trying to solve is that a lot of the workloads for people using databases at the moment we ask similar questions or similar questions of questions, like a classic example, they're not the only one is dashboarding where dashboards are repeatedly hitting your databases asking the same classes of questions over and over again, show me stats on how many engagements did we have in the past many minutes, hours, whatever. The only thing that's that's limiting the freshness of that data is the databases ability to handle however many 10s or hundreds of these queries that are repeatedly being asked by the Trump, everyone's got a dashboard open. And if we can make that incredibly fast, then all these things can just be updated in real time, real time in milliseconds, not real time, like, you know, minute by minute or something. So making making a lot of these experiences more pleasant ones where data are fresh down to the second and people's brands are thinking up new questions to ask second by second, removing any speed bumps or roadblocks in either of those two dimensions. Either. It takes a while for data to show up and actually get into the system or possibly with some of the other frameworks out there. It takes a while for you to create a new to turn a new creative into an actual running computation. We can remove both of those and you know, make some some work was immediately much much better and probably my guess is open the door for A lot of new classes of workloads that people wouldn't really have attempted before because they're just so painful. But that's a little little bit speculative.
Tobias Macey
0:08:06
And in terms of how it fits into the overall life cycle and workflow of data, wondering if you can just give an overview of maybe a typical architecture as to where the data is coming from how it gets loaded into materialize and sort of where it sits on the axis of the sort of transactional workload where it's going into the database to the analytical workload where it may be in a data lake or a data warehouse or any of the other sort of surrounding ecosystem that it might tie into or feed the materialized platform.
Frank McSherry
0:08:40
That's a great question. So there's definitely several different stories you can go with here and I'll start with probably the most simple one, but the one that's most natural and least invasive or disruptive, one of the ways to think about what materialise does, it's very much an analytic tool material itself is not but first of this planet acting as a source of truth. You have your database that's up your transactional database that's holding on to all the information that you have about your sales or customers or products or whatever events are going on in your system, and what materialises is attaches to the output of that database. So you your database typically is going to be producing this MySQL, something like a bin log, some change, data capture coming out of it. There are other tools, not ours, but other tools that will transform this, these different for the exhaust from the database into a more standard format, drop it into Kafka, something specifically to museum curriculum amount of this and if not, no worries. It's a thing that attaches to various databases and produces consistently shaped change logs that landed Kafka and then we just pick things up out of Kafka. So we're decoupled from the transactional processor. You can think of us in some ways as a bolt on analytics accelerator. We see what happens in your transactional database, and we give you fast read only access to us over this data. So in some ways, we're like a read replica. Or you know, we are Babe a bit like one of those start the actual protocol reason at the moment, but you install something that materialize just downstream from your source of truth, we don't really hope we don't interfere with the source of truth in any way. So that's that's roughly where it fits into this this ecosystem if you want to eat it with other sources of data. So if you've got capital streams that are being populated by not a database, maybe you have some IoT streams just sort of coming in of random measurements and observations that you've made, probably with a little bit of tap dancing, it's easy to adjust. This is something that we can consume. But we're very much at the moment looking at relational data models and the sorts of properties that they have things like primary keys, foreign keys, the sorts of data flows that come out of sequel style queries. Yeah. Does that. Is that roughly describe where it fits into sorts of tools you would think? Yeah,
Tobias Macey
0:10:47
yeah. So my understanding of it from your description here and also from the presentation that you gave at data Council, which I'll link in the show notes is that in some ways, it can be thought of as a better use case. For read replica where you can get some actual improvements in the overall capabilities and characteristics of your workflow, as well as being able to incorporate additional data sources, as you mentioned, rather than it just being a, an exact copy of what's in your transactional system and being accessed in a way that's not going to impact it directly. So seems like there are a lot of performance and capability gains, well, at the same time, sort of being able to access the same transactional records without it impacting your application that's actually feeding that data.
Frank McSherry
0:11:33
Right? That seems right. I mean, when we think about it is that if you were going to build an analytic data processor, you wouldn't need to architect it in the same way that you build a transactional data processor. So, you know, of course, people were building transactional data processors or they've been here for a long time, they've been a lot of expertise. But if the only thing you needed to do is support analytic queries, or streaming, streaming updates to analytic queries, you have the ability to use a totally different design. And indeed, this is data warehousing tools. Totally different designs from relational databases. And what we're doing. I mean, think of this as streaming data warehousing or something like this Yeah, analytic queries over continually changing data, we've picked a different architecture for how to do this than a traditional database would would use and just dropping it downstream of the traditional database. Let's let's use this. It's totally different architecture and decouples, the resources you might invest in the transactional side persons versus the analytic side and lets you materialize This is this scale up scale out sort of solution where you can throw a bunch more resources at it if you have a large volume of queries, and you don't actually need to go and get a bigger database to build at the same time? You know, it does some cool things that data warehouses, traditional data warehouses don't do with respect to frequency updates, or this continuous data monitoring stuff.
Tobias Macey
0:12:47
And so this is a good opportunity for us to dig a bit deeper into the overall system architecture and maybe talk about how you're able to handle these analytical workloads and ingest the data stream in a near real time fashion for being able to continually update these queries.
Frank McSherry
0:13:04
Yeah. So materials as it several components. So just like the high level architecture, there's several different modules that you might not think of as part of the data processing plan. So I just want to call those up ahead of time this there's for sure components that mediate client access to the system. So things that handle sequel sessions coming in and make sure that everyone's got all the right state stashed in the right places. This is this is control plan stuff, but we'll get to the sort of the juicy data plane stuff in just a moment control plane when people connect in or people or tools connecting through sequel sessions, they have a little chat with the central coordinator for for materialism, this is where we, you know, we figured out which which relationship which schemas, do you want to see the columns on, on particular relation to all this data sort of managed outside of the scalable data plan, so that we can do things like query planning, and just this sort of more interactive, tell me about the structure of the data, the meta data interactively. But But at that point, this coordinator with with the help of each of the users who have actual questions, assembles queries, see these big sort of sequel queries that we're thinking of as probably going to be working on lots of data and subjected to continual changes in their inputs, going to take this planet out as a as a data. So if you're familiar at all with tools like, like things like Flink, or Spark, and sort of big data tools that reimagine a lot of sequel style queries, or data preliminaries that's directed data flow graphs, where nodes represent computation edges represent the movement of data. This is exactly the same sort of take that we're going to we're going to have we're going to reimagine these queries as a data flow, which is great for streaming because what we really want to do in these streaming settings is just sort of chill out. Well, one of the things happening if no inputs have changed, which is sort of hang out, and as new changes arrive at various input collection, so let's say we want to prompt computation really only where We need to do work, right? So we'd love to be told, essentially by the movement of data itself, well, what operators what bits of our query actually need to be updated. So if we've got a career that depends on five different relations, it's great to be told, well, this is the only one that's changed, like your customer file is the only ones it's changed. So push forward changes to this customer violent and stop as soon as you know if a customer change their address, and that's not actually a field that's picked up for the query. We go quiet at that point and tell people, the answers are correctly updated. Nothing changed. So the coordinator is thinking through how do I plan on your experience as a data flow competition, it gets deployed now to this data parallel back end, that is where all the work happens is where you have, let's say 32 concurrent workers or hundreds of concurrent workers. It turns out the numbers get to be a fair bit smaller with materialize and timely data flow and stuff because we've made some different architectural decisions from products like spark and Flink that allow us to move a fair bit faster. So we don't normally at the moment, don't really thousands of thousands of course, we can The same sorts of work intensive course. But the data flow is is partitioning across each of these data Perla workers, they all share the same set of data flows. So think of the data flows on the map for the computation, like what needs to happen when changes show up. Where do we need to direct them to next, all the workers share the same map, they all collaboratively do work for each of these operators. So we got to join somewhere in here that there's two streams of data coming in, we're supposed to join them together. The work associated that join gets sliced up and shared across all of all of our workers. So the joint if we have 32 workers, each worker performance roughly 130 second of the join for is responsible for the keys, what 30 seconds the keys in that joint. And I guess it's really important architectural difference from from prior work here is that each of the workers in this this system is actually running its own little Scheduler. Each of the workers is going to be able to handle each of the operators in the data flow graph. And rather than spinning up threads for each of these operators and letting the operating system itself figured out what gets done. We're actually totally damn close bouncing around between operators, and sub microsecond timescales to to perform this work. And this gives us a really big edge, other bits of technology in this space. Other I would say like scale out data, parallel processing platforms, things like spark and Flink that partition away bits of data flow to allow them to share state basically, that's the main thing that's going on to materialize. It's exciting news is sharing a state between between Greece happy to dive into that, but also just want to pause for a moment make sure that I was still on target for the question
Tobias Macey
0:17:37
that you had. Yeah, no, that's definitely all useful. And as I was looking through, I noticed the timely data flow repository that you have, and I noticed that it in the documentation, I mentioned that it was based on your prior work on nyad, which you mentioned and I was also interested in the fact that you decided to implement it on top of rust where a lot of the existing and prior work For distributed computation, particularly in the data space is based on either C and c++ or JVM languages. And so I'm curious what you found as far as the existing ecosystem in rust for being able to implement things like timely data flow and some of the surrounding tools and libraries for being able to build something like materialize and the overall benefits that you've been able to realize from using rust as your implementation language.
Frank McSherry
0:18:27
Yeah, that's a great question. Actually. The historical explanation for using rust isn't nearly as compelling story as you might think it was originally. So like five years ago, we were all at Microsoft and Microsoft shut down the library working. I was like, Oh, I should learn I at the moment. That's right at the time, everything we'd written was in C sharp. So now it was on C sharp. We were collectively thinking like, oh, maybe we should learn a new language and you know, sort of grow a little bit I probably not all of my, all my co workers, I was going to go there and go and because they're very supportive people they told me that was done in particular, In particular, there's like a cool blog post called go is not good that I think is still relevant now is the author basically talks about like their languages that are good program. And it's not that goes bad but but it lacks several good things that languages have been introducing. And they called out two languages they thought were good, which were Haskell and rust. And that led me to pick up rust rust instead. And these early days, this is before rust was 1.0 things are still changing. And basically, I think I got pretty lucky rust worked out pretty well. In particular, I would say one of the main pain points that people have with rust is this lifetime and borrowing system which which I think is great, but is really not that stressful in data processing frameworks. So particularly all these data flow systems move responsibility for data around between workers, at which point there's there's just not very much borrowing going on. So the pain point for a lot of people using rust doesn't manifest nearly as much as it might be a different domain. And the the rails that rust puts on everything to make sure Don't screw things up. Just been wonderful. There's a hilarious, hilarious joke. But like, one thing that I tell people is, you know, timely data flow, differential data flow. All these things seem to work pretty well. They are actually used in production by some people. I don't actually know how to use gdb or lm DB debugger. It's just never come up these, Russ has done a great job from my experience of removing the class of error, which is the pull your hair out, not understanding, like, why is it even doing that thing? I didn't ask it to do that. So it's my experience. So far, Russ has been like 100% of the errors are just sitting there right in front of you in the code. And are you know, your books for sure. But they're delightfully debuggable as opposed to like someone somewhere else on the code did a crazy thing which over at some of your memory and you just can't relate to you figure stuff out. But so the flip side of that question, I think you're actually asking was more about giving the rest of such a new language and you love to just pull random support things off of the shelf adapters to different databases or support for Pro and JSON and stuff like that. But what's what's there? What's missing? Yeah, there's, you know, there wasn't nearly as much stuff when folks are just starting up, you look at, are there is there nice MySQL depth? There was there were a bunch of issues like this, they're starting to sort themselves out. And I would say that there's still some some gaps, though. At least, we haven't really run hard into any of these, like, we've found stuff that we can use for each of these, each of these problems. And, you know, I think in some cases, people would love to just grab a thing off of off the shelf. So like an example, we wrote some of her own career planning stuff, because the career plan that we have, we're familiar with stuff like calcite just isn't available in rust. And yeah, you know, if we were doing if we were a Java shop, we would possibly just borrow that and have skipped thinking about some of these things. And that would be interesting. You know, we wouldn't we wouldn't have been doing different sorts of things. Now, we would have spent this much time thinking about career planning, but instead we hit her head with a lot of performance debugging. The anecdote, I tell people from nine was that like getting getting the system up and running took forever, just like a few months. And from that point on everything was is a year and a half of performance engineering working around the various vagaries of memory management systems where they don't do exactly what you what you've asked them. So you know, you end up doing doing different sorts of things I think was great. We've locally the people also think growth is great, but there's some selection bias going on here. People who are working a job is important. I think several of them did so but they're interested in Rustin. And making it a swing at that.
Tobias Macey
0:22:31
And then in terms of the system architecture as it stands now, I'm curious how that compares to your initial design and how it's been updated as you have dug deeper into the problem space and started working with other people to get your ideas and do some maybe alpha or beta testing as the case may be.
Frank McSherry
0:22:52
It's a good question. Um, let's see, there's been a few evolution. So let me just mention a few of them. Not all of them are materials really. So I've been premature unless you have things like timely differential data flow, which are essentially libraries that you can link against, you can read the rest code link against them and you get a fairly performant outcome and I thought this is great. work wonderfully for me, I was totally fine with that. So I would say the main architectural change was in starting up, materialize and bringing working with a bunch of people who came from a real database backgrounds, understanding a different architecture that is going to require you know, not having compilation, for example, on the on the critical path, you know, figuring out how to design things so that when a person types of coordinate presses enter the the result flows as quickly as possible into execution without wandering through the day, 30 seconds of figuring out the right pile of code to drop down onto into a binary and run architecturally, like from that point on, I don't think and we've had, we've had minor architectural revisions where we've moved some other responsibilities for certain bits of thinking between different modules that we have and we've sort of broken things up to slightly more modular But I don't feel that we've had some central architectural revisions, we've had a lot of features that we weren't initially thinking would be crucial. And more and more people said, like ALC is going to be important for you to not just pull data out of relational databases, you know, people are always going to show up, but just sort of like plain text, log files, like don't make sure that you understand exactly what you should do. When someone points you at one of these and I got credit, we gotta go figure out how to do column extractors from from from plain text. And, you know, it's not fundamentally complicated, or, you know, lots of people have done these sorts of things before, we didn't need to change the architecture of the system. But we needed to be careful about exactly how we want to position it, or what what features we wanted the system to have, but I don't think there's been substantial architectural revisions that have gone on it's still very much. There's a what we think of as a fairly performance data processing plane, and the coordinator control plane that drives this and the exact words it uses to drive it a lot but we will model the little bit and exactly what the coordinator does in terms of showing people sequel has as Richard changed at the same time, but there's not been any fundamental architecture yet. And I, my guess is based on the customers are interacting. There's a lot of feature requests as opposed to fundamental redesigns in the in the near future, like the on the surface six, six month time frame, we have a list of just things to go and implement as opposed to things to fundamentally rethink. But at the same time, it's also it's early days for us like it's not, it's not the right time to do a redesign. Anyhow, like even if we're worried that something was was not set up the way it should be. You know, the right thing to do at the moment is to actually try it out on the domains that it works and see if we get traction there as opposed to trying to be all things to all people.
Tobias Macey
0:25:36
And as you've mentioned a few times the primary interface for materialize is through sequel and I know that you have made pains to ensure that you are fully compliant with the anti sequel standard for the sequel 92 variant. I'm curious what the sort of edge cases or pain points were that you've had to deal with and being able to support all of that syntax. Possibly any extensions or built in functions that you've added that fill the specific use case that materializes targeting? Yeah, so
Frank McSherry
0:26:08
there's there's a few and they have different different sorts of answers so that some things just require a bit more work. So yes, you will need to has correlated subqueries is the thing that you can do. And this, of course, can correlate, it's Aquarius. And this is especially painful in a Dataflow setting because you need to turn all of these queries into actual bits of data. A lot of databases can, they might think about building data flow to describe the course of execution or you know exactly where they should pull data from. But they can always some level below if a girl gets complicated and just do some nested loop joints that will just iterate through all all the whatever table you've put together. So there's there's a nice escape hatches that exists in standard relational databases that we don't really have access to, because we actually do need to turn every sewing query out there into something that we can run us as data flow, the standard papers that you would read about D correlated subqueries, some from the early 2000s. Yeah, they didn't cover all the cases. This, and you may have to track down some more work that came out of Munich about how to de correlate everything up just the most appealing sorts of Greece. So that's something where we just, you know, we have to do more work. That's fine. I mean, it's good. You know, it's good to understand. We learned a lot by doing this. And it's good to have learned that because I understand better what's what's easy, what's hard, what are the best? What can we discourage people from doing petition? There's other cases where things are just weird, like SQL has runtime exceptions, right? If you use a table load expression, you know, if you have a subgrade, that is supposed to only have one row, and the result, you can, you can ask is my row equal to that row. And if that query actually has either zero or multiple rows, supposed to be around their throne time exception, and you would suspend query processing, and we're trying to control back to the user. And that's not really a thing that we can do in this, this incremental data plus anything like we can't really suspend execution, you know, the digital supposed to keep running. If we suspend execution, we're killing the data flow. And that is disruptive for you know, we have to think about what's the right way to communicate back to the person that they did some They like divide by zero, or access to an array out of balance. And, you know, it's quite possible that we might change the spectrum a little bit that you know, an array out of balance is not an exception, but isn't all is one thing. So I think what Ghostbusters, or if you divide by zero, maybe that should be a model rather than a runtime exception. And we will do the query processing. There's a great analysis right ahead of time to try to confirm that your denominators might be nonzero, in which case we can confirm that they are, this will never be no but but if you just might be dividing two random things that you produces and put one by the other, we need to think of a thing that we can do that will not take down the data flow when you give us weird, weird input. But at the same time, hopefully it doesn't totally confuse people who might have expected to see a runtime exception you have been using their database to do data validation if you have a miss first integer What should we do? Yeah, good question. But so so those are some examples where we might have made a bit of a departure from from the spec just because
Tobias Macey
0:28:55
the query processing idioms of sequel beckoned engine into aren't the same as if you maintenance requirements that we have no games. Yeah, it's definitely interesting to see all of the different ways that people will deal with sequel because it's used so widely and so broadly. And there are so many different customizations that are necessary to fit a particular use case or edge cases that don't make sense for the original intent of where sequel was designed to run because of the ways that we have expanded the space of data processing and data analytics.
Frank McSherry
0:29:27
So I think for the moment, we're definitely trying to do some things well, and trying to find the customers, you know, sequel them to do is, was quite a while ago, you know, trying trying to do those things well, and seeing how far we get with that, as opposed to racing ahead and bolting on all of the newest extensions. And features of like, we're looking at, for example, the moment of adding and JSON support, which will probably do because enough people, enough people are looking at it, but we very much want. Personally this very much want us to be being very good at specific sets of things that we understand and we can tell people you If you want this, we do this very well, as opposed to having 27 different features that are all sort of half implemented them, and may or may not work in some corner cases, the local local ventures that we sort of want our README page explaining what materialize can do for you to have no asterisks on it that say like, yeah, except actually this doesn't this doesn't really work. You know, a lot of other systems out there, say like, Oh, yeah, absolutely, you can join your data, but Astra says Christina like you can only use append only streams are like, you need to make sure the partitioning is correct. You have to understand what a table is versus a stream. We don't want any of those. So we want it to be dead simple to use for people whose problems fit in admittedly restricted class of problems but and then grow up from there as we have competence that we are providing an experience that is predictable and surprising, and performance.
Tobias Macey
0:30:49
And then for people who are using materialize, can you talk through what's involved in getting it set up and talk through some of the life cycle of the data as it flows from the source database or from the source data stream into materialize and how you manage the overall lifecycle of the data there to ensure that you don't just expand the storage to essentially duplicate what's already in the source and you're just making sure that you have what's necessary to perform the queries that you care about.
Frank McSherry
0:31:20
Right. Great question. So, so what happens at the moment, and I should say, the offices are subject to change as we learn that, that people hate It or Love it or want something slightly different but but the way the world works at the moment in materialize is that we presume that you have some, let's say, my sequel or something, you have a well accepted source of truth database. And at the same time, somewhere nearby you have let's say, Kafka, a place to put streamy ish ish data that has persisted performance. There's a tool out there that we that we use and so recommend people use at the moment called the museum that attaches to greatest databases as essentially like a read replica or reads the bin log or as a few different strategies. Based on the databases, and you turn this thing on, you pointed at you pointed at your database. And it starts emitting a couple of topics for each of the relations that you've named, while you when you turned on when you turned out to be easy, and the topics that prism to cover basically contain before and after statements about various rows and various relation. So I say like, you know, change it up in a row used to be this before now it is this afterwards, and the timestamp and what we do you turn on materialize, and dematerialize, you start typing things like create source from and you announced the topic there. And when you announce a topic there or a pattern for a class of topics, materials will open up all these topics, start reading through each of them, and start pulling in these changes and presenting each of the topics now as a relation that you can query. start mixing and matching all of these different queries together. So this is this is roughly sorry, this is half answered the question. I'm going to continue this one but this is this is roughly what you need to do to get started with With materialize, you have relational database that's holding on to your data, you transform it using a tool like the museum into a change log of with a particular structure and Kafka. And then, while using material just pointed at the Kafka topics, and we'll start slipping into it for you in terms of how do you avoid being an entire replica of the entire database, and in particular, the full history of all the changes changes to the database. So what you can do in materialize is you obviously have the ability to select subsets of the relations that you want to bring in. I've got the ability as you bring them in to filter down their relations to based either on predicates or projections to filter down just to the data that you need. So if it turns out that you're only interested in analyzing customer data and sales data, and on the five of the columns from it, you're more than welcome to slice those things down and materialize has through differential data flow, which it fills some compaction technology internally that that just makes sure that we're not using any more footprint than the size. So the resident size of whatever relation here you're sitting on, so Although the history might go back days, weeks, whatever we don't actually let unless you, you ask for it. We don't need to keep the days and weeks long history around. And we'll just give you the answers on now going forward. It's flexible, though, like you could in principle, say, please load the whole history in and don't come back to me, but at which point, we look a bit more like a temporal data processor. So we'll show you the full history of your query. Going back as far as we have, as far as we have history, basically empty instrument back. But but it is the case definitely, that if you have, let's say, you know, few gigs of data that you're planning on analyzing interactively, we will have a few few gigs of data live in memory. If your data is 10 terabytes, and you want to just do a random access to it and play around with it, we're going to try to pull in 10 terabytes of data and we might need to tell you about the cluster mode at that point, or try to give you some advice on on pinning down the records a little bit so that you don't have quite so much of a footprint, but material is going to manage all of its own data, it's not going to return to the core database, and dump any of the analytical workload back on it. So we're marrying this stuff so that we can we can handle all of our over workloads without either interfering with things upstream or without finding ourselves off footed and not actually having the data we need to index correctly. We'll see how that how that works. Right? At the moment, this has been fine. Like a lot of people who talk to us when they actually tell us what do I need interactive access to it, it's a surprisingly smaller volume of data that everything that kept in their source of truth
Tobias Macey
0:35:23
for everything they've ever dumped in their in their data lake is much smaller, upset that they're there just didn't work over. And as far as the scaling and storage of data within materialize, you mentioned being needing to keep the data in memory for being able to run these analyses. And I'm curious what the strategy is as far as spilling to disk for when you exceed the bounds of memory and what the scaling strategies and scaling axes are for materialize. And then in the process of describing that, maybe talk about where it falls in the CAP theorem.
Frank McSherry
0:35:58
Yeah, good, good question. With respect to the Kepler admits it sacrifices, availability, that's easy. All the time I did a lot of stuff going back to sacrifice availability, they're all fail stuff. So if a thing if something goes wrong, you lose access to some of the workers or something like that we stop giving answers basically because we can no longer confirm their correct that's that's where that was. Materials is definitely not up there with often applied to like geo distributed systems that are likely to suffer partitions and you have to you have to say what's what are these genomes here, nyah. and timely data flow and materialize are definitely thinking these more as living literally possible in a single CPU to start to start with the large multi core system or rack or something like that, but much more tightly coupled sorts of systems where if you experience a network partition, that's very surprising. You can happen but like this is why we choose to sacrifice availability rather than consistency in terms of spilling, memory and spilling. All the internal data structures are although there in memory They're log structured merge tree type things. There's a bunch of just big slabs of allocations and implementations. But just naturally, if you run it on an operating system that doesn't have a killer, it just pages out of the virtual memory, it's totally fine. You can one can manually drop each of these slabs of the log structured merge tree, into onto onto disk and just remembering that come back in that super easy to if you rather that they literally be on disk and memory map didn't as it was to in memory and then pitched up, there's a bit of work that were we still have to do and are planning on doing that there's we're going to materialize his own persistent story. And it's almost certainly going to involve taking these slabs of log structured merge tree stuff and shipping them
0:37:41
off to s3 and bringing them back when appropriate.
0:37:43
But yeah, there's not unlike a lot of the JVM style systems have big hashmaps and shoot themselves on the head when they hit 64 gigs that you know this this these systems are super happy to just use the native mechanisms provided by the operating system performance doesn't seem to be impacted particularly
Tobias Macey
0:37:59
and material itself is a business and the materialized product is your at least initial offering. So I'm curious if you can talk through your motivations for forming a company around this and some of your overall business strategy.
Frank McSherry
0:38:14
Yeah, sure. So motivation is super simple. And there's, I'm not going to source this correctly. But there's there's an expression of if you want to move quickly, and you go by yourself, and if you
0:38:23
want to, if you want to go far you go with your friends. And
0:38:26
Arjun, the co founder didn't use those those exact words, but pointed out that, well, me hacking on time the data flows was super interesting. If we actually wanted to see if this had legs could go anywhere. It was going to need to involve other people, to people to
0:38:39
records, adapters, right documentation,
0:38:41
exercise parts of the system that I had no idea how to work, the sequel components and stuff like that. And a company is the right mechanism to bring together a bunch of people make sure that they get paid and
0:38:51
are looked after.
0:38:54
So that they can actually build something that's larger than just a research platform that's fun for writing blog posts and stuff like that. So part of the motivation of the company. And there's a few different dimensions here, but part of the motivation was, if you want to do something interesting with all of this work time identified or whatnot, you need a framework different do to some people make sure that they get paid. And that's going to involve building thing that other people might want and making sure that we get paid for, for doing it. There's other motivations, like arts and animation, I think is much more than like this is absolutely an unlined area. So the work on nyad and the differential data flow is great from his point of view is great and under capitalized upon. So why not try to actually take this and show people what it can do. And at the same time, make a lot of people in the enterprise infrastructure space, really happy and quick paychecks from them. Just build something really cool and impressive. This is the motivation for the formation of the company. Which isn't isn't anything with the business model yet, which is still my understanding is that enterprise Tech has has this very honest healthy business model where companies that You're building technology for are relatively well funded. And if you actually do something valuable for them, they will pay you. And if you haven't done something particularly valuable for them, you probably won't get paid. So this
Tobias Macey
0:40:09
is not nearly as bad as their consumer facing technology where you have to hope to ride some Zeitgeist of excitement. Now, if you show up and you make a large, fortune 2000 companies like much bigger and much better. They're delighted. And we just need to make sure that we actually do that for enough people. And the state of the product is currently pre release, and you have a signup option on your landing page for being able to receive news and I'm curious, what are some of the remaining steps that you have before you're ready to go with a general launch?
Frank McSherry
0:40:42
Yeah, so we're basically at the moment just spending like another month or two, just sending off some some rough edges to make sure that the plan is absolutely to a few months time to throw this out there to make it publicly available. So people should be able to grab the source code binary, use it on their laptop, just to try it out and see
0:40:59
do it Like the look and feel of this sort of thing,
0:41:01
might I want to tell my boss that this is really cool, and we should
0:41:03
get some of this? And the steps that I mean, we have we have a roadmap for this, just just a few months, like in the two month timeframe, and mostly we're doing now are just making sure that these various known issues like what you know, what is the look and feel of loading up CSV files from your file system, as opposed to change logs from Kafka like does that actually work properly? And are people delighted by how that works as some other integration with some existing BI tools, for example, so if everyone had to just type raw sequel, and I you know, that's, that's fine. A certain cluster person likes that, but but other classes of people like the look and feel of Tableau and Looker a bit more. And there are some open source BI tools that we're making sure that we're compatible with so that people can point that materialize and get a bit more of an interactive query building experience rather than having to build their queries in some large text buffer and swap them in. So there's supposedly some some finish 10s of issues like that. It's a good Currently in a state that we handed it to people, they could try it out. As long as they understood that it has works, it'd be fine. And there's just a little bit of management of expectation management,
0:42:07
basically, when we hand out to people if they go and try it, and they realize that
0:42:11
that it doesn't exactly do what they need. Well, we probably should have fixed that beforehand. So we're doing a bit of that the next next month or two,
Tobias Macey
0:42:17
are there any other aspects of your work on materialize or the underlying libraries or the overall space of being able to build a real time analytics engine on streaming data that we didn't discuss yet that you'd like to cover before we close out the show?
Frank McSherry
0:42:32
Oh, there's also another really cool stuff for sure. I don't want to force anything on people like this for sure. A lot of thinking that that one needs to go through and understanding. I mean, just like the semantics of streaming data versus so what materialises for example is incremental view maintenance. It
0:42:47
takes a sequel query, it maintains it as your data change
0:42:50
that doesn't cover the full space of everything that person might
0:42:53
possibly want to do with streaming data. So there's definitely like a hurdle that I'm conscious constantly anxious about is How many of the things that people actually want to do, of which there's unbounded numbers of things
0:43:04
actually look like incremental view maintenance of
0:43:06
sequel. And in many cases,
0:43:08
people have something totally different in mind, and you tell them like, Oh, if only you had said that slightly differently, we could totally just be a perfect fit for you. Is there any negotiation available here? So there's occasionally a bit of getting people to try to think through what they really need to see out of their streaming data. That's been pretty interesting. The big the big shift that they're getting people to move from thinking about points in time point in time queries, to yet to something that they're actually going to want to see change as time moves on is probably the hardest part. getting other people to change their brains is really hard, possibly not going to happen. We'll have to see. But that's one of the main things I'm most worried about in terms of are we going to we're going to fit them click with a lot of people is can we get them to understand
0:43:53
the sorts of things that we're good at doing? Can they change their their news basically
0:43:56
do their needs to line up with what we're capable of doing? We think so. We think there's enough people there. But this is a conversation that we have a lot of focuses. Well, let me think for a second, just get my head around what it is that you actually do. I think that's probably going to continue to be the case for streaming sequel for a while is what computers actually able to do is is pretty interesting. How do we wrap up in the language as pleasant as sequel, an easy way to think about what do I want to have happen to my streaming data as my data change? But there's an unbounded amount of other stuff to talk about, too. But But let's throw that out as the main one that I did. I think about
Tobias Macey
0:44:29
well, for anybody who does want to get in touch with you and follow along with the work that you're doing and maybe have some follow on conversations to the topics we discussed. today. I'll have you add your preferred contact information to the show notes. And as a final question, I'm interested in getting your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
Frank McSherry
0:44:49
Oh, yeah. So that's tricky. So I'm for sure I have a bias take on this just based on what I've been doing for the
0:44:55
past few years. I don't know I will say that like a lot of what we've done. In the past year at materialists and before that has really been fighting with a lot of impedance mismatch between different tools and technology. So for example, we're using Kafka to get to get our data out, around and Kafka just doesn't provide the right information about the state of the of the streams. It's, you know, this is that's fine. You can work around that, but but a lot of the tooling and tech dimension involves a lot of working around things. Even though in many of these cases, we sort of know what the right answer is. Just not everyone, you know, got on board with things at the right time. So you know, Hadoop took off in early days, for example, because it was so easy. And even when Hadoop took off, people already knew that this was a great way to be doing things, but so fast on the doorstep, that people just piled onto it. And there's a lot of other techniques like that. So, for me, the thing I was struggling with is the right way to figure out how to, if possible roll back some of the early tech out the gate that missed important design elements and swapping things that once we figured out how to do things a bit better. It's hard to do like a Hard to swap performance and sort of better, better architected tech in for things that missed miss important things. But we're sitting in still super valuable for people until I don't know, I feel like if if that were actually resolved in a pleasant way where there was a lot more off the shelf tooling that you could just take down and slot in for one class of tools, like if all databases use the same protocol to talk, talk to each other have the same interpretation of sequel, our lives will be a lot easier. They don't write they all have crazy decisions that each of them make different compliant with all of them. This is takes a lot of time when we could be doing more cool things. But I'm sure that's not the biggest, biggest issue that tech faces, but it's the one that you know, if you look at the dent
0:46:39
on my forehead from what I've been getting my head against, it has
Tobias Macey
0:46:42
that shape. All right. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with materialize. It's definitely an interesting project and one that fits a need in between the sort of transactional engine and a lot of the analytical engines that we've got as far as being able to keep real time information for Are people who are looking to do fast iterative queries or keep tabs on the current state of affairs of their data. So thank you for all of your work on that front. And I hope you enjoy the rest of your day.
Frank McSherry
0:47:09
Thanks so much. I appreciate the time to talk and all the questions have been very thoughtful. Appreciate it.
Tobias Macey
0:47:20
Listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways that is being used. visit the site at data engineering podcast.com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!