Build More Reliable Distributed Systems By Breaking Them With Jepsen - Episode 143

Summary

A majority of the scalable data processing platforms that we rely on are built as distributed systems. This brings with it a vast number of subtle ways that errors can creep in. Kyle Kingsbury created the Jepsen framework for testing the guarantees of distributed data processing systems and identifying when and why they break. In this episode he shares his approach to testing complex systems, the common challenges that are faced by engineers who build them, and why it is important to understand their limitations. This was a great look at some of the underlying principles that power your mission critical workloads.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


The need for quality customer data across the entire organization has never been greater, but data engineers on the ground know that building a scalable pipeline and securely distributing that data across the company is no small task. Customer data platforms can help automate the process, but most have broken MTU pricing models or cloud security issues. RudderStack is the answer to those customer data infrastructure problems. Our open source data capture and routing solutions help you turn your own warehouse into a secure customer data platform and our fixed-fee pricing means you never have to worry about unexpected costs from spikes in volume. Visit our website to request a demo and get one free month of access to the hosted platform along with a free t-shirt.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
  • Your host is Tobias Macey and today I’m interviewing Kyle Kingsbury about his work on the Jepsen testing framework and the failure modes of distributed systems

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what the Jepsen project is?
    • What was your inspiration for starting the project?
  • What other methods are available for evaluating and stress testing distributed systems?
  • What are some of the common misconceptions or misunderstanding of distributed systems guarantees and how they impact real world usage of things like databases?
  • How do you approach the design of a test suite for a new distributed system?
    • What is your heuristic for determining the completeness of your test suite?
  • What are some of the common challenges of setting up a representative deployment for testing?
  • Can you walk through the workflow of setting up, running, and evaluating the output of a Jepsen test?
  • How is Jepsen implemented?
    • How has the design evolved since you first began working on it?
    • What are the pros and cons of using Clojure for building Jepsen?
    • If you were to start over today on the Jepsen framework what would you do differently?
  • What are some of the most common failure modes that you have identified in the platforms that you have tested?
  • What have you found to be the most difficult to resolve distributed systems bugs?
  • What are some of the interesting developments in distributed systems design that you are keeping an eye on?
  • How do you perceive the impact that Jepsen has had on modern distributed systems products?
  • What have you found to be the most interesting, unexpected, or challenging lessons learned while building Jepsen and evaluating mission critical systems?
  • What do you have planned for the future of the Jepsen framework?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know and I need your help. Go to data engineering podcast comm slash 97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow. So try out the latest Helm charts from tools like pulse our package arm and daxter. With simple pricing, fast networking, object storage and worldwide data centers, you've got everything you need to run a bulletproof Data Platform, go to data engineering podcast.com slash linode. That's l i n od e today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. If you've been exploring scalable, cost effective and secure ways to collect and route data across your organization retter stack is the only solution that helps turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization like marketing and product management. With their open source Foundation, fixed pricing and unlimited volume they are enterprise ready but accessible to everyone. Go to data engineering podcast comm slash writer today to request a demo and get one free month of access to the hosted platform along with a free t shirt. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management for more opportunities to Stay up to date, gain new skills and learn from your peers. There are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com slash conferences to check out the upcoming events being offered by our partners and get registered today. Your host is Tobias Macey, and today I'm interviewing Kyle Kingsbury about his work on the Jepson testing framework and the failure modes of distributed systems. So Kyle, can you start by introducing yourself?
Kyle Kingsbury
0:02:24
Oh, my name is Kyle Kingsbury. I work on Jepsen, which is a project to test distributed systems.
Tobias Macey
0:02:29
And do you remember how you first got involved in the area of data management
Kyle Kingsbury
0:02:32
out of college? I couldn't get into grad school actually happened during the financial crash in 2009. And I was working up a startup in San Francisco and doing distributed database work
Tobias Macey
0:02:42
there one thing led to another and so can you dig a bit more into what the Jepsen project is and some of the backstory of how you came to create it and your original inspiration or motivation for digging into this particular area?
Kyle Kingsbury
0:02:56
Yeah, so Jepson is a project to helped make distributed systems safer by testing them and seeing what they do under real world failure modes. And back around 2010, its distributed systems were becoming popular industry. But a lot of folks had sort of optimistic expectations. But what was possible, and there were some claims flying around that seemed a little bit irresponsible. There was a good deal of argument over this back and forth on Twitter and blogs, mailing lists. I was active on the React mailing list for quite some time. And eventually, I started Jepson to demonstrate that these problems were actually things that could happen in the real world. They were just theoretical issues.
Tobias Macey
0:03:39
Yeah, I know that particularly when you are just testing and building a system. It's easy to see Oh, everything works great. And kind of push off the fact that there are going to be network faults and there are going to be clock skews and clock drifts that causes things to not operate the way that you want them to. And I'm curious what you have seen As far as some of the common misconceptions or misunderstandings of people who are both building and using distributed systems and the types of guarantees that they provide,
Kyle Kingsbury
0:04:10
yeah, that couldn't be started. There were all sorts of interesting expectations. I remember MongoDB, for example, wrote something about how users shouldn't expect to preserve data when a process crashes or is killed with kill dash nine. And as a result, I think, sort of throwing some sort of cocktail party called kill dash nine. Meanwhile, React was busy falling over anytime you tried to list keys in the database. So there was there's a lot of rough and tumble action in those days. Nowadays. I think people take partitions crashes, that sort of thing more seriously. I see a lot of folks at least tried to do some degree of testing for those. And yet, when I do consulting work, often I'll find databases where it looks like nobody tried to verify behavior during faults.
Tobias Macey
0:04:56
And aside from Jepson, what are some of the other matters methods that people typically use for stress testing or evaluating the guarantees of these systems
Kyle Kingsbury
0:05:06
and look at Jepsen, because it's an experimental approach as being one part of a complete breakfast continuum of quality methods. One of those is to start with a proven algorithm, some sort of abstract model of your system and either to have a hand or machine check proof of it, or at least a model checker, maybe with TLA. And if you are lucky, you can do some sort of code extraction with caulk or high level or cipher f HL with Isabel. But that's, that's not I think how most production systems get built. Most systems get built by pointing at a whiteboard and writing a bunch of code. And then your tools for Safety Analysis tend to be more along the lines of unit tests and integration tests. Jepsen takes a blackbox approach, which puts it sort of at the far end of integration testing other systems like Cassandra's D test suite Take a similar approach. You'll see people who do sort of manual testing where they do fault injection via somebody just running commands on the cluster or toggling some sort of hardware power switch, or network switch. That's the approach that's taken by foundation DB foundation. Devi also does a lot of automated simulation testing, where they will run a virtualized network with many simulated nodes in the same process and they can intercept messages and reorder them to get interesting results. So I think there's there's a lot of approaches people take to testing systems Jepson tries to general and who to work across many types of systems, which often means that it's not as deep or as capable as exploring the state space as a say a simulation test that's written specifically for the database that you're working on might be,
Tobias Macey
0:06:46
and then as far as some of the real world usage of things like databases and other mission critical distributed systems, what are some of the real world impacts that can occur as a result of these failures that aren't properly vetted or properly guarded against,
Kyle Kingsbury
0:07:05
oh, you can get all kinds of faults, obviously, data could be lost. So you, you write something and then it just disappears later, you can have different nodes that compete on some version of the history. So maybe one person thinks that the list you're adding things to is 123, and other things, it's four or five, six, you can have temporal anomalies where you read some information from the past, instead of the present. You could have transactional violations where two transactions kind of land on top of each other or interleave, in some weird way. And all these systems that I've tested exhibit,
Tobias Macey
0:07:40
typically one or more of these behaviors. The question is sort of how hard you have to push them to get there. And then in terms of the actual testing approach of using Jepsen for being able to find some of these error modes that exist in a particular system. What is your overall approach for business? Finding the test suite and customizing it for a given system. Because all of these different platforms will have different guarantees that they're trying to provide, or different approaches to linearize ability or serializability, or the strictness of the isolation that they provide. And I know that that provides a fairly extensive surface area to try and cover in any test suite. And I'm wondering just what your approaches for determining what are the key points to interact with. And then also any variability in the actual interfaces that are exposed by these systems for being able to interact with them and validate the correctness of the operations that are being performed?
Kyle Kingsbury
0:08:38
Yeah, you're quite right. There's there's sort of two huge surface areas to cover. One of them is the breadth of the API, all the different things databases do. Modern SQL databases have so many functions and so many interacting components that it's really difficult to predict, you know, how they're going to interact. And it also is tricky because when you get awkward misers into the mix. You know, depending on something like the size of your data, you might have two very different code paths that can evolve with different safety properties. So I don't have a good answer for that, you know, I write as many tests as I can in the time that I have. But that's usually just a tiny fraction of the overall API. My guiding principle is to look for operations which are general and which will be sort of the most difficult for the system to do safely. As an example, there is a property called the consensus number, which sort of tells you how difficult it is to achieve a sort of operation internet atomic way using consensus. So when you have a database, which offers say, reads and writes over a register, or offers compare and set operations, it turns out that they compare and set operation is actually more difficult to provide safely than just reads and writes over registers. And so I'll test compare and set to try and stress the system. More if I have an operation that works like appending, something to a set or to a list, well, the set is order independent, whereas the list retains order. So I'll try to do something on a list because it will expose more possible histories in which the outcome can discriminate between correct and incorrect executions, if that makes any sense.
Tobias Macey
0:10:20
Yeah, that definitely makes good sense. Because as you said, the ordering is explicit in those types of data structures. And when you have these multiple concurrent clients that are interacting, and you're dealing with a system that is trying to provide maybe snapshot isolation, you ensure that the right and then the subsequent read produces the expected result given the point of time and the other clients that you're connecting with. And that also leads me to wonder how you sort of synchronize the state of those multiple clients to be able to determine what the correct output should be at the end of a test suite to be able to say Writing all of these values, it's being run in this multi threaded fashion. And when everything is said and done, this is what the actual output should be assuming that everything is running properly, and then also, particularly for eventually consistent systems, identifying how they're resolving those potential conflicts that are being created by those multiple clients.
Kyle Kingsbury
0:11:22
Yes, yes, obviously, real world problems with databases are typically concurrent problems and Jepson tests concurrently. So we have to be able to understand exactly what the concurrency structure was of all the clients that are interacting with the system to do that. Jepsen puts them all in the same VM, the same JVM process, which means that we have a local sequentially consistent store that we can use to record the actions of each client and that tells us what began first. But of course, there's still concurrency there, even though we have tight bounds on it to transactions are likely running at the same time. And that means that when you're trying to verify the system, Seems correct or not, you have to consider both possible orders one transaction executed first or the other transaction executed first writing efficient checkers for those permutations of operations turns out to be really tricky. Eventually consistent systems, like you mentioned, actually have this nice property typically that they should commute that lots of different orders are legal, any any possible ordering should be okay. And so when you test eventually consistent systems, it's often a little bit easier to design the test for because all you need to do is apply the operations in one order, and they should be equivalent to any other order, at least for something like a CR dt. If it's not, well, then you have a combinatorial explosion again.
Tobias Macey
0:12:37
And given the massive surface area of the problem domain that you're trying to validate, what is your heuristic for determining the overall completeness of your test suite or determining a useful definition of done for the purposes of a particular engagement?
Kyle Kingsbury
0:12:54
Right, right. It's, it's such a huge problem and the fact that any of this works at all Sort of astonishing. One of the things that I use to guide my work is the idea of a falsifiable result, I want to design a test which yields a failing result before it passes. And I should be able to turn up and down a concurrency knob or turn on and off some feature in the database. Say, I enabled the linearizable mode and it should pass linearize Ability Test. And if I disable it, it should fail those tests if it still passes with linearize ability disabled, I know the test wasn't powerful enough to find some fault. So that's that's one notion of completeness. Another is just looking at the API and kind of staring at the algorithm and trying to guess all the ways in which could go wrong and then designing faults for those but ultimately, the space of faults is very large. And we're never going to be able to cover all of it meaningfully. The only reason this works is because the state spaces often degenerate, and that many different permutations of failures yield similar outcomes. Another metric I use to guide my work is I watched the logs and I look at visualizations of the history very carefully. And I try to Intuit have the behaviors that I have introduced to the system, the network partitions, the process crashes, have those faults created some sort of phase change in the system? Do I see a characteristic shift in latency? Or do I see a transition between returning linearizable nonlinear results, even if both are legal, I want to make sure that those transitions occur, because that tells me my test is actually measuring some sort of event in the system like like a failover, or promotion of a new node or some sort of reconciliation process. If I don't see that happen, I have a hint that my system isn't being tested rigorously enough.
Tobias Macey
0:14:42
And then another element of complexity in dealing with testing these types of systems is being able to actually get them deployed in some representative distribution of making sure that there are multiple instances running that they're all communicating properly and clustered together and then for printing. regularly systems that are intended to be operated on global scale, maybe displaying them to different geographical regions. And I'm curious what your experience has been as far as being able to do this in a reasonable fashion without having to do everything manually, particularly given the different ways of deployment and sort of different levels of maturity in terms of the operability of the systems and the configuration and availability and things like that.
Kyle Kingsbury
0:15:26
Yeah. Luckily, most computer systems are meant to be deployed on Unix computers over IP, which is good because that's how Jepsen works. But you know, as as modern deployments show up, like like serverless things or hosted services, those are much harder for me to test because I don't have control over their infrastructure. The good news is that when something is meant to be installed, I can typically write some sort of automation for it like downloading a package and installing it and running some shell commands to configure it. All of that is automated by Jepsen. So Jepsen has a whole deploy system built into it, which pushes out different binaries and run shell commands and uploads files and gets the cluster running ready that automation is the first part of any test process.
Tobias Macey
0:16:12
And because of the fact that the primary focus of Jepson is on the distributed systems guarantees another level of complexity and getting the system set up properly is dealing with things like encryption and authentication and authorization. I'm assuming that you just completely leave that to the side so that you can focus on the durability and serializability guarantees of the system.
Kyle Kingsbury
0:16:34
Yeah, I typically ignore those because they're often orthogonal. It's typically the case that there's like an authentication layer at the top end of the database when you first connect and once you're past that, internally, it's just the same function calls the we get either way, that's not always true. Some systems like in cryptocurrencies, will have a sort of signature verification, like baked into the very core of the consensus protocol. And there it makes sense to test a sort of adversarial method But generally with Jepsen, you look for getting the most bang for your buck, you look for the key couple of functions of the basic data model, which will give you the best insight into how the concurrency control mechanism works as a whole. And then hopefully, that generalizes. So you might miss data type specific or authentication specific problems. But those things can be incrementally added later, if you have time to add those tests.
Tobias Macey
0:17:24
And then for the actual process of building the deployment and running the overall workflow of setting up the test suite, executing it and then evaluating the output, can you just talk through that whole process and some of the challenges that you face particularly at the end in terms of interpreting the results and being able to provide useful feedback to the people who are building these systems and engaging with you to find errors that they need to correct?
Kyle Kingsbury
0:17:53
Yeah, my overall process when I quote clients, I typically say there's a there's a four week minimum to get a record out. The door and that gives me enough time to get the system installed, which usually takes the first couple days to a week to get it installed reliably. It turns out a lot of systems when you install them don't necessarily set up the reliably the first time, then there's a lot of documentation reading that usually takes a couple days. And I try to understand the model of the system, what kind of API calls are involved, what kind of invariants to do measured, then I will design the the workload and the client that that interprets that workload and actually goes and makes network calls and gets results and understands them. And then once that's done, I'll start introducing faults and that usually is the the kind of process of weeks two, and three is adding different faults and running the tests. The actual test workflow is you run a shell command jumps and starts it installs all the stuff on the nodes. starts making requests introduces faults. Once it's recorded history for a few minutes or a few hours, it will analyze that history and spit out some files and some textual output. So I'll read this those explanations What happened? I will look at the graphs that produces, I'll look at the charts and try to interpret those and tell the people who built the database. What I found.
Tobias Macey
0:19:11
And in terms of Jepsen, itself, how is it implemented? And how has that overall design evolved and changed since you first began working on it. And as you iterate on all these different systems,
Kyle Kingsbury
0:19:22
originally, Jepson was built quite quickly, just to demonstrate a couple of bugs in one or two particular systems. There was an early prototype in Erlang, which was actually intended to be a proxy that would sit between all of your nodes and disassemble their network flows and cut them together in new orders. That was lovely idea. But it turns out that writing protocol level decoders for every database is a really tricky it. And it was much more interesting and effective to just focus on running systems, sort of in their native environment. So round two of jebsen is a JVM program which would run some basic workloads and requests against the servers concurrently. And then I did automation All the setup introducing faults separately using a deploy tool. I wrote code salticid. So I would be manually sitting at the keyboard triggering a fault and watching what happened. Round three of Jepson, which I think started around 2014 2015 brought those things together and said the operating system and database deployment tools should be an integral part of the test. So when you're writing a Jepson tests now you define what kind of operating system Do you want to use? How do you set up your database? What kind of schedule Do you want to perform? How should that schedule be applied to the database and the results recorded? And then how do you want to analyze that abstract history and that decoupling between this abstract model of the world like right three to the register named x and read the current value of register y and observe to that abstract world is what the analysis tools and Jepson written to understand, which means you can reuse the same analyzers from many different databases. A lot of the interpretive work in Jepson is now Well, what does it mean to read the register in food TV? How are how am I going to encode this particular abstract operation? And often there are many different ways to do it. So I'll write lots of different families of tests, which encode that sort of abstract operation differently in the database itself.
Tobias Macey
0:21:15
It sounds like from your description of actually running through the workflow that Jepson actually has components that sit on the database nodes themselves in order to be able to communicate back and forth and execute these operations and record different data points. And so given that it is at least on some level, a distributed system in itself, and I'm curious if you have created a Jepson test for jeppeson.
Kyle Kingsbury
0:21:37
It's a good idea but there's no agent there's no local processes and default and Jepson try to keep things as simple as possible. Everything runs in one bog standard JVM program on your local node. All interaction remotely is done via SSH or Kubernetes or Docker interfaces.
Tobias Macey
0:21:55
another level of complexity in some of these systems is that some of the failures modes may only trigger when there are certain volumes of data in the system or if you have a large volume of data in a cluster and one of the nodes goes down and then it creates a cascading failure of trying to rebalance information and then introducing new data into the cluster. And I'm curious if you explore those types of problem spaces as well.
Kyle Kingsbury
0:22:20
Yeah, this is something that Jepsen is really bad at Jepsen excels at this kind of sweet spot of moderate concurrency, short temporal duration, medium sized data sets. We're talking about histories of maybe 100,000 to a million operations tops. But it's very true that a lot of times systems only fall over. For large datasets, or for super high concurrency or super high note counts. Those are important to test but they're the difficulty in testing those systems with something like Jepsen is that trying to analyze the history for correctness might be computationally intractable. It could take you millennia, maybe to analyze it today. In the property you're trying to look at. So I think what often makes sense is to run Jepsen as a part of a suite of tests or Jepsen likes system. You know, if you if you have distributed safety tests in Cassandra, for example, that exists in conjunction with other tests, which do long runtimes high data volumes, but don't try to do rigorous Safety Analysis. They're just asking, does the system continue to respond to requests and is latency? Okay? They're not trying to understand what's linearizable because that problem is NP hard.
Tobias Macey
0:23:28
And then as far as the overall pros and cons that you've seen of building Jepson and closure, I'm curious what you have found in that regard as far as the accessibility of the system and making it easy for other people to be able to use Jepson for their own purposes or contribute new tests for new distributed systems.
Kyle Kingsbury
0:23:49
Yeah. You know, choosing a lisp is not necessarily the idea you want to adopt if you're trying to get a big community, although I've done it twice now. remonda Jepson converted systems administrators and database engineers to closure wrists. Strangely enough, and when I teach classes on Jepson, I find that people are usually able to get closure within a few hours and at least write some sort of basic tests following along, you know, obviously it takes longer to become a fluent engineer. But the language isn't necessarily the barrier that I initially thought it was in terms of suitability for somebody who's proficient with a language is closer to the right fit for this problem. I think that the JVM is a good candidate for testing databases in general. So doing something in Java, Scala, groovy kotlin closure, it's nice because you have access to JDBC. Almost every database publishes some sort of Java client of some type. And as long as you have interop, between those languages in the JVM, you can call those clients. So from closure, I can call anybody's Java client, and that generally works pretty well. You also want to have reasonable performance. It doesn't have to be super fast. You want to be able to go somewhat quick and you want real threads. So that rules out a single threaded runtime like JavaScript, it would rule out, you know, a slower language like Ruby or Python, you want to have something more on the side of Haskell or C or rust or the JVM something with a reasonable profiler and some attention paid for the performance. And then because this work is experimental, I think you want a language which is concise and expressive. One of the pieces of feedback I get from people, when I teach Jepson in classes is, I'm really confused by all the parentheses, which is, of course, the classic Lisp critique. It's absolutely valid. It's a new way of doing syntax. And it's confusing, but people really like the Jetsons API is so compact, and that it is easy to read, understand and explore new schedules. So if you like the API design has worked out well there. And if I were to design a new version of Jeff's in another language, I would look for something which makes it easy to write compact code. And you mentioned that there are some considerations for what you would do if you were to rewrite Jepson today. From the language perspective, I'm curious if there are any other elements that you would either redesign or re architect, if you were to start over today and build a similar set of tooling. Yeah, at the risk of grabbing the traditional third rail of language arguments, a type system would be nice. Obviously, closure is a dynamic, semi unit typed language with protocols and basically JVM dynamic type checks underneath the hood. But there are some instances in Jepson where you're going to manipulate a whole bunch of data, which has really similar representations for something with the same name, like a node might be a string, an internal identifier, a node number, a handle a client, and you want to discriminate between those things reliably. Having a type system helps there. So I've been using chord types in some parts of Jepsen for places where it's hard to keep those things straight. But it might be nice to write it in something like kotlin or Haskell. Where I could have some sort of nominal typing instead of just structural. I think in terms of architecture changes that I would make to Jepson. One of the problems that I had for a long time with Jepsen was the generator system, the part that schedules operations and emits concurrent things to do for the database. That system was initially built as a mutable object that every thread every worker in the system would ask for an operation. So each thread says, Hey, generator, it gives me something to do, and the generator would, you know, mutate some state and spit out an op. Internally, if you want us to do something like wait for two minutes, you would just call thread sleep, and then the thread scheduler would take care of making sure that the thread went to sleep and woke up at the right time. This is intuitive, and it's easy to write. But it can lead to really tricky cases of deadlock. Because if you want to interrupt a thread that sleeping or a thread that's engaged in some sort of barrier where it's blocking awaiting other threads, you run the risk of maybe double interrupting and knocking a threat Out of the interrupt recovery code itself, it becomes devilishly hard to read correctly.
0:28:05
The other big problem I had with the generators was that they weren't responsive to things that happened to the world. So the generator was kind of this pure source of information. It told you what to do, but it didn't take any feedback. And that meant that you couldn't write something like try writing values until you see at least one success. And that's an important thing to do. If you're going to be testing database, you might want to say like, keep reading until you actually get something so that your test means something at the end, the way that you would work around that limitation would be to have some mutable state. So you would have the client and the generator share a piece of mutable information, some variable that they would both update and read in order to change the generator havior. But that introduces this kind of ugly coupling, and it was confusing for users. So I've redesigned the generator system as a purely functional effect system with an interpreter. I thought that this was going to be a terrible idea, and it may yet prove to be but so far, it seems to be better. It means that we can do all the mutation in a single thread, which eliminates all the threads scheduling ugliness and deadlocks that we had with the original generator system. And it also means that you can take feedback, the generator can be updated with each operations, completion as well as the invocation this might be way worse that leads me on to talk about, but it's my my big architecture change and Jepson for the time being.
Tobias Macey
0:29:19
It's always interesting getting behind the scenes of the problems that exist in a particular tool. Because as a user, it's easy to gloss over those things or not understand the problem space effectively enough to be able to know what is wrong with a given implementation. But as somebody who spent so much time in the code, there are always going to be wards or things that you don't like or things that you would do differently. And it's so interesting to get some of the perspective on that.
Kyle Kingsbury
0:29:46
You know, it's funny, Jetsons architecture has been basically unchanged for about five years. I take backwards compatibility really seriously. And this generator change just went live about a month or two ago and it was the biggest kind of break at maintenance. absence API. But even then it doesn't change the structure of tests. It's still a generator, it still has this handy composition. I feel like this is the third iteration and it has been reasonably stable. It does the right thing, the right way to break up the problem. But there's a lot of experimentation and bad design choices that I'm still making in things like the checkers, the bits that verify histories and tell you if there are property problems. So I am currently knee deep in the weeds on a new checker based on l for doing set verification and transactional systems and that puppy, I have no idea how to write it correctly. This is my third attempt, maybe this time it will work.
0:30:40
That's tends to be a never ending process.
Tobias Macey
0:30:43
As somebody who has been using and working on Jepsen for quite a while now and you have worked with a number of different companies and open source projects on various types of distributed systems. So you have possibly the best perspective in terms of The overall problem space and the ways that it's being tackled. But I'm wondering what you see as some of the most common failure modes that exist in platforms and some of the ones that continue to crop up and that continue to be overlooked, maybe in initial attempts at building some of these types of platforms.
Kyle Kingsbury
0:31:20
Yeah, and this has shifted over time, the industry is getting a lot better at building distributed systems. This is a little tricky. There is a class of what you might call an error, where people rely on the well ordering or the key ordering of clocks. They assume the clocks are roughly on time, they track time at the same rate. And then they use that to build, say, leases for leaders. This is safe and correct, so long as the clocks are correct, but we don't have a lot of empirical data on how often clocks are correct and how badly they go wrong. I would like to see more of that. In the meantime, it seems valid that you would write a database based on these assumptions and make the clocks skew thresholds to double. But there is the sort of uncomfortable feeling like well, eventually the clocks go wrong. And then what happens to my data? Am I okay with that? So this is sort of an active field of development. I think all people who are choosing to do clock stuff at this point, and databases like to be up by DB cockroach dB, I think it's all blurring together at this point. They all rely on some clocks, I believe, for leader leases in different ways. And they're making that choice in an informed way. They know that the cost could be bad. So I don't think that's necessarily an error, it's it's maybe more a choice. One of the more serious errors that I see nowadays tends to be around understanding the determinant results. So oftentimes, there'd be some sort of RPC mechanism inside of a database where one component is supposed to send a message to another and get a response. And if it doesn't get a response that expects that's an error, but a lot of times people will interpret that error as Meaning that the operation did not take place and it is therefore safe to retry it. Sometimes it's safe to retry operations. But sometimes if you retry them, you wind up introducing some sort of anomaly like a double execution or violating transactional isolation. So several of the bugs that I encountered recently had to do with internal retry mechanisms, or client level retry mechanisms, which didn't understand that they weren't allowed to retry in that context. tracking that stuff safely, is something that I talk about with most of my clients, and most of them understand it, and they're trying to do it. But it's hard to keep track of all the different ways in which things can fail, because there's just so many different, you know, terrible things that computers can do to you.
Tobias Macey
0:33:41
Yeah, computers are wonderful and awful, all at the same time. And generally more awful than not were the people who are trying to actually build them and work on them. Oh, for sure. I have
Kyle Kingsbury
0:33:51
so much respect for for distributed systems, engineers, people who are trying to build databases because it is this terribly complicated problem where you have to think of everything And here I come like some sort of ramp and toddler knocking over all.
Tobias Macey
0:34:08
Yeah, and yet they pay you to go and break things for them and tell them what they did wrong.
Kyle Kingsbury
0:34:11
I am weirdly effective at it. I don't know, the computers around me seem to break more than other people. I in fact, I'm currently in some sort of bureaucratic hellscape because of some sort of bank computer problem. And it seems like every week I discover some sort of new, you know, fresh way that computers have have gone terribly wrong. I wasn't able to use my browser for like two weeks last
Tobias Macey
0:34:36
week. Well, perhaps when you're done with distributed systems research, you can move on to security.
Unknown
0:34:41
Yeah. Don't get me Yes.
Tobias Macey
0:34:44
And then in terms of the problems that you identify what are some of the classes of issues that you have found to be the most difficult to resolve in some of these distributed systems,
Kyle Kingsbury
0:34:55
even the ones that are difficult to fix once you understand them are generally architectural problems. Words like, Oh, we have to rip apart the whole thing and replace it with some new algorithm. But I do think that folks are using more and more proven algorithms. Nowadays. I see a lot of use of raft I see more deployment of CRD T's, I see people using Paxos. And that's, that eliminates a whole class of errors where people just made up some sort of consensus system and and hope that it worked. The thing that we don't know how to do yet is to build transactional protocols on top of those individual consensus systems. And there's a lot of active research and a lot of industrial effort going into that. And I think that's a place where people will make sort of tricky mistakes. which is again, not to say that the engineer is doing this or bad, it's that they're tackling an almost impossible problem like the huge you know, failure space and are encountering the kind of obvious difficulties in solving that impossible problem.
Tobias Macey
0:35:55
And what are some of the areas of development or interesting new entries to distributed systems design and research that you're keeping an eye on and that you think are going to have a meaningful impact on the industry. As far as industry stuff,
Kyle Kingsbury
0:36:11
it does seem like we we've all sort of caught up to Leslie lamport in the late 80s and early 90s. We now understand that the state machine approach to consensus systems where you, you run consensus in each operation, you form a log, you execute the operations in order, that is an effective way to build a state machine. It's how most people use raft. That seems well understood. But what's not well understood is how do you do operations between those state machines and still make them legal under the laws of your system? So one approach to that is Calvin, which is implemented by fauna dB, and there's some recent work that I think anybody's been doing is it called s log, which looks really cool is sort of a generalization of this. This idea For, oh, gosh, it's been so long over the paper. I'm not even going to try to explain it. I will put
Tobias Macey
0:37:06
a link to the paper in the show notes.
Kyle Kingsbury
0:37:08
But it was cool. Yeah. Another thing that I'm really excited about is revisiting core ideas about consensus Heidi Howard has been doing really cool research into, do you actually need majority quorums do consensus we all thought so. But it turns out that that's an assumption it's not required for the proof of Paxos. And there are generalizations of Paxos where maybe, instead of agreeing on one value, agree on a set of values and you can skip around during contention. So I'm excited to see those ideas come into play because in some cases, it looks like you could do fewer round trips or shorter round trips, more local round trips, and exchange maybe for a longer trade off during failover between domains. Another thing that I'm really excited about is CRD T's I know I've been plugging this for a decade now but I think industry hasn't really caught up to this idea of communicative programming and it is At least the cob conjecture suggests the way that we need to handle geographically replicated systems where latency per head prevents us from using a consensus mechanism or some other sort of strong consistency. So I'm excited to see people doing work with causal consistency, having commutative operations which get merged together out of the operational transforms or CRD T's or some related structure. I want to see more of that. And
Tobias Macey
0:38:28
you have been running the Jepson tests for a while now you have come to be seen as a sort of very strong signal of correctness in a system or reliability in a system where different database vendors will publish the fact that they have completed a round of Jepson testing and that they resolve these different bugs that have been identified by it. I'm curious what you see as being the overall impact that your work on Jepsen has had on some of the modern distributed systems products that we're using today.
Kyle Kingsbury
0:38:59
This is just tricky because, you know, obviously I have the desire to toot my own horn. Like I am the most important person in my own story. But I think realistically speaking, I am a small part of a very large movement to do more correct distributed systems. People are advancing formal methods. Hello Wayne, for example, has been pushing the use of TLA chrismukkah. JOHN has been working on caulk and other ways of doing verified systems. All people in industry who were writing distributed systems are reading literature and advancing and I guess I can maybe claim some small portion that along but I think it's a lot of people doing a lot of work
Tobias Macey
0:39:40
as you have been working on building Jepson and evaluating some of these mission critical systems. What have you found to be some of the most interesting or unexpected or challenging lessons that you've learned or outcomes of the work that you've done?
Kyle Kingsbury
0:39:54
One thing that was tricky for me to understand when I first started this work, It still surprises people when I teach classes is this notion that you have to record a concurrent history when you're doing a test. A lot of people are used to doing tests of stateful systems by performing some series of operations in sequence and then asserting the state is a certain way. So I want to test a counter, I add 123, I do a read and the value should be three. And that's how a lot of distributed systems tests are written. At the very first pass, you know, when you're just trying to figure out does the thing turn on and do its job. But when you're trying to do more sophisticated verification, you have to do this operates concurrently. And maybe if the read is concurrent with the right, you could either observe it or not observe it. Now there are two possible outcomes, realizing that I had to track that concurrency and that it had to track indeterminacy to have a notion that an operation may or may not have completed and both things have to be taken into account. That was a really subtle and tricky problem for me to wrap my head around initially. The other thing that I've really struggled with and this is more recent, something that I've been trying to internalize over the last year is, what is it exactly that you are proving when you say such and such an anomaly exists in a system? Because anomalies, at least in the academic sense are properties of a formal system, like, like you have a set of like, like a history is a set of operations, and each operation has the start time and an end time and, and they had some sort of internal order. That's not how real systems work. Real systems are computers. They're, they're sand that we tricked into thinking there are quartz crystals, we imprisoned and electrocute and urge washed and dance. That's, you know, there's a mapping between the abstract system and the real computer system you're testing, but that mapping isn't always faithful. Nor is the system you're measuring the actual observation you have. There's a third thing which is the history you record as the client and So when you say, I see this particular anomaly, I see a dirty read or I see a GE one see information cycle and transaction. What I'm saying is that we're there to be an abstract system and abstract execution in this audio formalism, say, which was executed by this real physical computer system. If those two things were truly isomorphic. And if all the information that I got from the system is accurate, if I if I wasn't lied to about the results of any of my rights It reads, then there cannot possibly have been an execution which did not contain this particular anomaly. I know this sounds really pedantic. But being forced to write it out to develop a formal model of that separation between abstract model, whatever the computer did, and what you observe of the system has been critical in the research I've done with hell to do transactional isolation, checking, and it's good The way that I talk about anomalies reports, I can say something like this appears to events, a geomancy anomaly, but it could also be a dirty read, and I'm choosing the most charitable interpretation of that vessel.
Tobias Macey
0:43:12
Yeah, it's interesting trying to reason about these systems that have potentially unreasonable behavior because of the fact that physics is a thing. And we have to contend with that in these sort of idealized systems that we're trying to build and rely on.
Kyle Kingsbury
0:43:27
Yeah, physics is the thing. I have a sort of pet peeve there. This is a bit of a tangent, maybe not related here. But people often say like, oh, relativity means there's no such thing as synchronize clocks. This is bullshit. relativity gives us the equations for doing clock correction and figuring out what time it is. The problem is not the fabric of space time. The problem is that the clocks are wonky in general.
Tobias Macey
0:43:54
And in terms of the work that you're doing on Jepsen and your overall work with distributor systems, what are some of the things that you have planned for the future or projects that you look forward to being able to start or continue on?
Kyle Kingsbury
0:44:07
There's an open problem in my head around L, which is my checker for looking at transactions and telling you if they interleaved and properly if they were not serializable, or non repeatable, read or violated snapshot isolation. And that is that I, I understand the auto formalism for individual registers. So I have variables X, Y, and Z. I want to do transactions over them. That's great. What about a transaction which says, read the current value of all registers which have an odd value in them right now? That's a predicate and predicates are a part of the odra formalism which I completely ignored in my initial pass, because I had no idea how to solve them. It would be nice if I could model predicates in history and analyze them in a way which lets me tell if there are anomalies in the same way that I've done for individual registers. I have no idea how to do this thing. It's an open research and
Tobias Macey
0:45:04
as you continue to work on Jepson, I'm curious what you have planned for the future of that project or other projects or research in terms of distributed systems that you're looking forward to either continue or start a new.
Kyle Kingsbury
0:45:20
The transaction checker, called l has been remarkably productive in finding bugs. It's uncovered all kinds of behaviors that we couldn't see earlier attempts tests and Jepson. However, there's a key problem we can only identify anomalies at the key value level, so Jetsons or Elle's, use of the formalism from Adda works over keys values, like x equals two y equals three you can do transactions over x and y. But you cannot express a transaction over say, all currently odd registers. That's a predicate. And I'll just remove them includes predicates and talks about anomalies over them, it would be great if I had some sort of first class way to model those predicates and check their correctness. But I don't know how that's going to happen yet. That's my current research project. That doesn't mean we can't test predicates real databases, you can still make a request to the database, using a predicate in order to perform what looks like a key value read or an abstract sense. And that lets us test whether say secondary indexes are correct. But we've got no first class representation of them at the chapter level. And I'd really like to get that far.
Tobias Macey
0:46:31
And for people who want to dig deeper into some of the transactional guarantees or problems in distributed systems, what are some of the resources that you have found valuable or that you tend to recommend to people who are getting into this area?
Kyle Kingsbury
0:46:46
I've actually written a class which I teach professionally to organizations, but the outline for it is remarkably comprehensive and has lots of links to other resources. So if you search for decisis dash class, My GitHub, which is a pH y, or a for there's a whole introduction to the field. I also like mixes book and my ex you. And
Tobias Macey
0:47:13
Chris McLellan has a really wonderful reading list for papers and introductions. All right, I'll add links to all those for people who want to dig deeper. And for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll help you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Kyle Kingsbury
0:47:36
Right now, databases are often tested via ad hoc tests via some sort of generative property based testing like Jepsen or maybe Jepsen itself. But there are fewer tests of the abstract underpinnings, either via simulation or via model checking or proof system. I'd like to see Model checkers and proof systems become more tractable. For real industry people, maybe who don't have the advantage of mathematicians sitting at their site to help them write the proof. I myself struggle a lot to write proofs and Isabel, or to write models and TLA. And I would love it if these systems which I think have a lot of promise for industry are more accessible.
Tobias Macey
0:48:23
All right. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Jepsen, it's definitely a very interesting problem domain and a useful approach that you've built to it. And it's something that I see mentioned very frequently throughout different database platforms and distributed systems. So I appreciate all the time and energy that you've put into that and I hope you enjoy the rest of your day.
Kyle Kingsbury
0:48:44
Thank you to those useful.
Tobias Macey
0:48:51
Listening, don't forget to check out our other show podcast.in it at Python podcast comm to learn about the Python language, it's community in the end Native ways it is being used and visit the site at data engineering podcast calm to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!