Pay Down Technical Debt In Your Data Pipeline With Great Expectations - Episode 117

Summary

Data pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult to test, leading to a significant amount of technical debt which contributes to slower iteration cycles. In this episode James Campbell describes how he helped create the Great Expectations framework to help you gain control and confidence in your data delivery workflows, the challenges of validating and monitoring the quality and accuracy of your data, and how you can use it in your own environments to improve your ability to move fast.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at dataengineeringpodcast.com/linode or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing James Campbell about Great Expectations, the open source test framework for your data pipelines which helps you continually monitor and validate the integrity and quality of your data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Great Expecations is and the origin of the project?
    • What has changed in the implementation and focus of Great Expectations since we last spoke on Podcast.__init__ 2 years ago?
  • Prior to your introduction of Great Expectations what was the state of the industry with regards to testing, monitoring, or validation of the health and quality of data and the platforms operating on them?
  • What are some of the types of checks and assertions that can be made about a pipeline using Great Expectations?
    • What are some of the non-obvious use cases for Great Expectations?
  • What aspects of a data pipeline or the context that it operates in are unable to be tested or validated in a programmatic fashion?
  • Can you describe how Great Expectations is implemented?
  • For anyone interested in using Great Expectations, what is the workflow for incorporating it into their environments?
  • What are some of the test cases that are often overlooked which data engineers and pipeline operators should be considering?
  • Can you talk through some of the ways that Great Expectations can be extended?
  • What are some notable extensions or integrations of Great Expectations?
  • Beyond the testing and validation of data as it is being processed you have also included features that support documentation and collaboration of the data lifecycles. What are some of the ways that those features can benefit a team working with Great Expectations?
  • What are some of the most interesting/innovative/unexpected ways that you have seen Great Expectations used?
  • What are the limitations of Great Expectations?
  • What are some cases where Great Expectations would be the wrong choice?
  • What do you have planned for the future of Great Expectations?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:12
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with a worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances and they've got GPU instances as well. Go to data engineering podcast.com slash linode. That's l i n o d today to get a $20 credit and launch a new server and under a minute and don't forget to thank them for their continued support of this Show and you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media chronium Global intelligence, od sc and data Council. Upcoming events include the software architecture conference, the strata data conference, and pi con us go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing James Campbell about great expectations, the open source test framework for your data pipelines, which helps you continually monitor and validate the integrity and quality of your data. So James, can you start by introducing yourself?
James Campbell
0:01:50
Absolutely. It's great to be here. Like I said, my name is James Campbell and I am currently working at superconductive we are really pivoting at this point to focus pretty much full time on great expectations and helping to build out data pipeline tests. My background academics wise was math and philosophy. And I spent a little over a decade working in the federal government space on national security issues focusing on analysis, first of cyber threat related issues, and then political issues.
Tobias Macey
0:02:21
That's one an interesting combination of focus areas between math and philosophy. And then I'm also sure from your background of working in government and national security issues, that there was a strong focus on some of the data quality and some of the problems there. So it seems logical that you would end up working in the space with with great expectations as something that would be a primary concern given your background.
James Campbell
0:02:44
Absolutely. I mean, for me, I've always been very interested in understanding how we know what it is that we know and how we can really convey the confidence that we have in raw data and assessments on that data to other people. been a really continuous thread for me, especially like you mentioned in the intelligence community, there is a tremendous focus on the term often used as analytic integrity, or ensuring that the entire process around understanding a complex situation is really clearly articulated to ultimate decision makers. And I think that's an important paradigm for any analytics or data related endeavor, because we need to be able to ensure that we're picking the right data for the job that we've got all the data that we need to be able to understand what we can and of course, that's often not the case. And so we end up needing to find proxies or in other ways compromise. And so then we need to figure out how we can convey that effectively to the people who will ultimately be using the data for some sort of a decision.
Tobias Macey
0:03:46
And do you remember how you first got involved in the area of data management?
James Campbell
0:03:49
Absolutely. I think for me, there was a pretty pivotal time when I was leading a team of data scientists who were training models and and sharing them I'm out with their colleagues. And in addition to the actual model building process, we had a whole focus on education and training for people who would be able to use these kinds of more sophisticated analytic models in their day to day work. And there were a lot of discussions on the team about how we could effectively help them understand what kinds of data requirements were in place for the kinds of questions that they wanted to ask. And what I found was that that really meant that we needed to be able to not only understand the data that we had available, but also understand the kinds of data that other people had available. And that meant that there is a huge amount of essentially metadata management and knowledge management and sharing that needed to happen. And when you do that across teams and across organizations that really amplifies the challenge, but of course, I think that also makes it a much more interesting in exciting area to get to work in.
Tobias Macey
0:05:02
And so, in terms of great expectations, it definitely helps to solve some of these issues of the integrity and quality of the data. And I know that particularly with some of the recent releases that helps to address some of the issues of communications around what the data set contains, and some of the metadata aspects of where it came from what the context is. But before we get too deep into that, I'm wondering if you can just give a broad explanation about what the Great Expectations project is, and some of the origin story of how it came to be absolutely
James Campbell
0:05:33
great expectations. We use the tagline always know what to expect of your data. And it's a project that helps people really clearly state the expectations that they have of data as well as communicate those to other people. I mentioned that experience I had of leading a team where we were doing model building and one of the key parts of the origin story was wanting to basically Be able to say, you know, this data should have a particular distribution or particular shape around certain variables if you want to be able to ask a question. And I found that again, and again, what I wanted to do was not just ship people, you know, an API where they put data in and they get an answer back, but rather ship them an API that they could use responsibly and confidently. That was a key part of the origin of great expectations for me. At the same time, my colleague now a gong was working on healthcare data, and he again and again was observing breakages in data pipelines that they were building, where an another team, usually because they were making an improvement or catching an error would make a change to an upstream data system. And that would trickle down and cause breakages throughout the rest of the data pipeline. And he and I were talking about these related question. problems. And of course, both of us had experienced the other one. But we realized, well, you know, both of these would be really well served by this sort of a declarative approach to describing expectations. And, you know, that's that's the origin of the project, just realizing that this problem was so general that it was coming up, not just in different places in different industries, but with different manifestations right? The same, the same underlying problem has these kind of different symptoms.
Tobias Macey
0:07:29
And about two years ago, now, I actually had both you and a, on an episode of my other show where we talked about some of the work that you were doing with great expectations. And I know that at the time, it was primarily focused on being used in the context of the pandas library in Python, and some of the ways that people were doing exploratory analysis within pandas and then being able to integrate Great Expectations into that workflow and have a set of tests that they could assert at runtime later on. Once they got to production with So I'm wondering if you could just talk a bit about some of the changes in focus or the evolution of the project since we last spoke?
James Campbell
0:08:07
Absolutely. It's amazing how much things have evolved in in those two years. And especially in the last six months or so, just like you said, originally, our focus was very much on validation. And in supporting exploratory data analysis, and that sort of a workflow and also very much we were panda centric, there's been an evolution in a lot of different dimensions. First one is just in terms of the kinds of data that great expectations can interact with. From the very beginning of the library. We tried to ensure that expectations were always defined in a way that wasn't specific to pandas. It wasn't about any particular form of the data. It was really about the semantics of the data, how people understand it, what it means for the context of a particular analysis. So we've been able to realize that goal a little bit by expanding to now support sequel alchemy. and by extension, all of the popular big SQL databases. So we have users running great expectations on Postgres on redshift on Big Query. Also, we've expanded into Spark. So whether that's Cloudera clusters or Spark clusters that are managed by teams are data bricks, we've got users being able to evaluate expectations on all of those. And I think, again, one of the things that's really neat is actually it's the same expectations, right? So the same expectation suite now can be validated against different manifestations of data. So if you have a small sample of data that you're working with on your development box and pandas, and then you want to see whether that is the same expectations are met by a very large data set out on your Spark cluster. You can just seamlessly do that. The next big area that we've we've pushed is in terms of integrations. It was a it was a big pain point for users, I think to figure out how can they actually seem, you know, stitch Great Expectations into their pipelines. And so we've done a lot of work in creating what we call a data context, which manages expectations, sweets, manages data sources. And it can bring batches of data together with expectations, weights to validate them, and then store the validation results, put them up on Cloud Storage. So you could you know, have all your all of your validations, for example, immediately loaded s3 and available to your team. So that's been another big area of development. And man, again, I know I'm just going on and on here, but there are a couple other big areas. So one of them is, you know, been the thing that I think really people have been able to use to resonate with great expectations and see things move forward, which is data docs, we call it or the ability to generate human readable HTML documentation about the artifacts have great expectations, so about expectations, sweets about validation results, and So we basically generate a static site that you can look at. And it really helps you get a quick picture of what you you have in your data as well so that you can share with your team. And then the last area is profiling. We've done a lot of work to make it so that you can use Great Expectations the library before you've really zeroed in on what the expectations are. So it becomes this iterative process of refinement, where in an initial profiling, we basically say, you know, I expect these hugely broad ranges, I expect the the mean of a column values, for example, to be between negativity and positivity. Well, obviously, that's true. But as a result of computing, that we give you the metric, what the actual observed being is, and you can use that to, especially when you're combining that with documentation, profile and get a really robust understanding of your data right away. So there's a lot there. There's a lot of innovation and work that we've been able to do and it's been a really fun thing to get to focus more on the project. Yeah, the profiling in particular, I imagine is incredibly valuable for people who are just starting to think about how do I actually get a handle on the data that I'm using and get some sense of what I'm working with, particularly if they're either new to the project? Or if they've just been running blind for a long time and want to know, how do I even get started with this? Absolutely. I think one of my favorite things that I see in our Slack channel a lot is when somebody will say, you know, I ran this expectation and you know, it failing and I don't know why. And then they look into their data. And it's well, because it's not true. And and I just never, never cease to love that sense of surprise and an excitement that people have when they really encounter their data in a richer way or in a way that they hadn't seen it before. What profiling does is it just makes that happen across a whole bunch of dimensions all at the same time. Exactly. Right. I think more and more what we're finding is when a user is first getting started with great expectations, it was intimidating to sit down at a blank notebook and figure out Where do I go? Where do I get? How do I get started. And so now what they can do with profiling is start off with a picture of their data set. You know, they get to see some of the common values. And you know which columns are of which types and distributions, and it really gives them a way to dive right in. And then we can actually generate a notebook from that profiling result that becomes the basis of a declarative exploratory process. So we actually can sort of guide you through some of the initial exploration that makes sense, based on the columns and types of data that you have.
Tobias Macey
0:13:38
And going back to the beginning of the project at the time that you were working with a to define what it is that you were trying to achieve with great expectations. I'm wondering what your experience had been as far as the available state of the art for being able to do profiling or validation and testing of the data that you were working with, and maybe any other tools or libraries that are operating in the space, either at the time or that have arrived since then? Great question.
James Campbell
0:14:07
I think there's there's a lot of really good practice that I've seen, you know, in this broader space, I think one of the things that's important to mention is there's a huge amount of this kind of work that just happens out of band. So it's not so much that there is a tool for it even it's that what we see are people having meetings, coordination, meetings, data integration meetings, big word documents. So I you know, even though it's not an exotic thing to say, I think it's really important to mention that and remember that as a kind of core part of the original state of the art for how people work on this. Other thing is there's a lot of roll your own. A lot of people who are writing tests for their data, just put it into their code, and it's kind of indistinguishable from their pipeline code itself, which was one of the key problems that we wanted to solve with great expectations is making sure that tests could focus on day Instead of being part of the code and have a strong differentiation there, but it's still an important part. So a lot of times when I talk to people about their their current strategy for pipeline testing, and they say, Oh, yeah, we absolutely do test, it's just that we've written a lot of our own. There definitely are some commercial players in this space, a lot of the big ETL pipeline tools, you know, whether that's open source, you know, Nephi, or some of the big commercial players have data quality components or plugins that you can use. And and I think that's really, you know, that's obviously, I think, a best practice to have some some sort of tests. I think that's also really important. You know, I think there's, there's a lot of things that people do with just metadata management strategies. So, you know, capturing whether that's in a structured way or, or a little bit less structured, you know, a metadata store for data set. So I see a lot of work there, which is, which is really exciting. And then the other thing I'll say specifically to your kind of thing, How things evolved is there are two other open source projects that I see having come out of some really big companies that I think it's exciting, you know, the investment that they're making in their willingness to open source that really demonstrates the scale of the problem and interest here, which and the ones I'd mentioned specifically are dq from Amazon, and TensorFlow data validation, which, of course, is very TensorFlow specific, but also gives you a lot of ability to have that kind of insight into the data that you're observing and how it's changed.
Tobias Macey
0:16:31
And one of the values specifically that Great Expectations has, which I think is hard to overstate is what you were saying earlier about being able to build the expectations against a small data set, and then being able to use those same assertions across different contexts because of the work that you've done to allow Great Expectations to be used on different infrastructure, whether it's just with pandas on your local machine or on a Spark cluster or integrated into Some of these workflow management pipelines. And so more broadly, I'm wondering if you can just talk about some of the types of checks and assertions that are valuable to make and that people should be thinking about and some of the ones that are notable inclusions in Great Expectations out of the box.
James Campbell
0:17:18
Yeah. Great, great question. And I think, you know, to your point about the ability to work across different sizes of data, I would also add really quickly that I think one of the neat ways that we see that is just making sure that teams are working with the same data. So sometimes it's it's not even about that there are different sizes, but you know, when when, when data got copied from one warehouse to another, did all of it move, for example. So to that end, I think one of the most important tests that I think is easy to overlook is missing data, you know, do do we have everything that we thought we would have, both in terms of in terms of columns and Phil as well as in terms of You know, date or deliveries, one of the things that is, you know, really powerful, of course about distributed systems is a very failure tolerant. But one of the things that that can mean is they're silent too, especially small batch delivery failures. You know, I think set membership is another area that there's like, you know, really basic, but really important kinds of testing. And then I think the more exotic things around distributional expectations are also really important with respect to some of the notable expectations that are are Ingrid expectations, I think I would actually highlight things that people have added or extended, using the tool as probably the most exciting and innovative parts. So one of the things that I've seen that I really thought was was neat was a team that basically just took a whole bunch of regular expressions that they were already using to validate data, but that were basically inscrutable, and use those to create custom expectations that were very meaningful to the team. So for example, you know, we could say that I expect values in this column to be part of our normalized log structure. And that corresponds to being able to be, you know, matched against a bunch of regex is or have some other kind of parsing logic applied. But what it's doing is translating between something that a machine is really, really good at checking, but is very hard for a person to understand what it means to something that a human really understands, immediately, intuitively, it has all the business connection for them. But that would have been very hard for a machine to verify. So it's really, I think, exciting to be able to help link that up.
Tobias Macey
0:19:43
And I think that goes into as well some of the non obvious ways that great expectations can be beneficial to either an individual or a team who are working with some set of data. And I'm wondering if there are any other useful examples of ways that you've seen great expectations used that are not necessarily evident when you're just thinking specifically in the terms of data testing and data validation?
James Campbell
0:20:05
Absolutely, I think
0:20:06
one team that I saw do something I thought was really clever was they effectively built an expectation suite that didn't have most of the actual expectation values filled in. And then they they took that kind of template. And effectively, it became a questionnaire we have what is the minimum value for the churn rate that would be unusual and sent that to their analysts, the people who are consuming reports at built from the data that they managed in order for them to fill it in. And so it became a way for there to be a really structured conversation around what these what these two different teams understood the data domain and how it should, how it should appear that helped the engineering team understand the business users better and the business users understand and The kinds of problems that the engineering teams were facing. So I thought that was a really, you know, a fun use case. Another one that I that I have seen that I think is really neat is what I call the patterns are fit to purpose. And the basic idea is, you know, we talked about pipeline tests is something you can run, you can run in a pipeline, but usually, you know, before your code or after your code, and you know, just like with a pipeline, there's a lot of branching with expectations, there are the same, you know, there are teams that are going to use a data set in one way. And so they have some expectations, and another team may use the same data in a very different way and end up having very different expectations around the same data so that that ability to elucidate that realization or that reality of data is really fun.
Tobias Macey
0:21:44
Yeah, definitely see the fact that grid expectations has this sort of dual purpose of one the operational characteristics of working with your data of ensuring that I'm able to know at runtime that there is some error because either the source data has changed or there's a bug in my manipulation of that data. And so I want to know that there are these invariants that are being violated, but at the same time, it acts as a communications tool, both in terms of the data docs that you mentioned, but also just in terms of what tests am I actually writing and why do I care about them, and then using that as a means of communicating across team boundaries so that everybody in the organization has a better understanding of what data they're using, and how and for what purpose? Exactly. Along those lines, too, it's interesting and useful to talk about some of the aspects of data pipelines, in terms of what are some of the cases that can't be validated in a programmatic fashion and some of the edge cases of where Great Expectations hits its limitations, and you actually have to reach to either just a manual check or some other system for being able to ensure that you have a fully healthy pipeline and to end and just some of the ways that Great Expectations either can be bent to fit those purposes or needs to be worked around or augmented. Yeah, that's
James Campbell
0:22:58
it. That's a tough question. Obviously, for somebody Who's building a building something because I love to think that it's great for everything. But you know, of course, of course, there are a lot of really challenging areas. One of them that I think is really interesting, and that has been a lot of discussion inside our team is the process of development of a pipeline. So while you're actually still doing coding, it's very tempting, I think to blur the line between what is effectively a unit test or an integration test around a pipeline and and a data pipeline test. So I think Great Expectations is probably not the right choice when you're when you're doing that kind of active development, because you may don't you may not yet know what the data should look like. And while you're doing that, very rapid iteration, that's probably even too much about that. But another area that I think is is interesting for this reason is anomaly detection. anomaly detection is more really tempting with great expectations. And I want to I want to love how, how strong of a tool we are there the end of the day, though, I think anomaly detection is sort of always a very, there's a huge amount of precision recall, trade off that needs to be dealt with. And so when you have a situation for great expectations, like where you want to use great expectations, and what you're observing is that your underlying question is, well, you know, what is my precision and recall trade off that I have for violating an expectation? I think that's another area where what I would just say is we still have a lot of work to do, because insofar as what we're trying to do is communicate things. Well, there are solutions. And one of the solutions that we're working on right now basically, is the ability to have multiple, like I was describing earlier sets of expectations on the same data set that reflect different cases for different points on that precision recall trade are in that space. And then I guess there's kind of another area, which is the overall coverage or when you want to use Great Expectations in the context of sort of acceptance testing for a final product, I think there's, you know, it's always good to make sure you have some point in the process where a human is in the loop for high stakes decision. And so I, you know, I would be wary of somebody saying, Well, we've we've got all of our expectations encoded, it's, you know, good to have some meta process around the use of expectations at a pipeline that allows you to check that and and really assess the coverage that you have.
Tobias Macey
0:25:38
Yeah, that's definitely an interesting thing to think about. Because people are used to using unit tests in the context of regular software application where you can theoretically think about there is some sense of completion of everything in this application has been tested in some fashion. I'm wondering how that manifests in the context of data pipelines, and how you identify Areas of missing coverage or think about the types of tests that you need to be aware of and that need to be added, particularly for an existing pipeline that you're trying to retrofit this on to.
James Campbell
0:26:11
I think the issue of coverage in general is just fascinating in the context of data pipelines, because it really opens such a broad question of what it even means to have asked everything that you can about a data set or a pipeline and a set of code. So I think what Great Expectations is sort of helping you do is complement the process of one kind of testing where you're asking Have I been able to anticipate everything that I see with another kind of testing where you're asking if what you see is like the kinds of things that you anticipated, then what that what that means is, I think it's it really comes back to that question of fit for purpose. So for example, if you are going to eventually be using a data set that you're creating for a machine learning model, you know, there are a variety of features with different life. levels of importance to that model and how well each of those features over the space of your classification reflects the kinds of things that you saw in the training data set. You know, how many elements you see in clusters that you observed in the context of an unsupervised learning problem, are good ways for you to diagnose that there may be a problem, but they're really just part of the overall analytic process. And maybe I think the right way to to spend this really positively is that just like what we were saying, at the very beginning about some of the more basic tests, you know, are these values, the ones that the data provider said they were going to be providing, what we're really allowing is you to have a robust conversation and process of exploring and understanding and getting new insights out of data.
Tobias Macey
0:27:47
And now I'm wondering if we can dig a bit deeper into how great expectations that self is actually implemented, and particularly some of the evolution of the code base moving from your initial implementation. Focusing on integrating with pandas to where it is now and how you're able to maintain the declarative aspects of the tests and proxy that across the different execution contexts that you're able to run against.
James Campbell
0:28:14
Yeah, this is a really fun question. This is what I've been spending a lot of time thinking about right now. And actually, we're in the process of launching another major refactor to the way that this works. The basic idea of for us is that expectations are named. And that's what allows us to have the the declarative syntax. So there's this concept of of the human understandable meaning what it is that you expect, and as you recall, these expectations, we always use these very long, very explicit names, expect column values to be inset and we then have a layer of translation that is available per expectation and per back end. One of the key changes that's happened is we've introduced an intermediate layer called metrics that allows expectations to be defined in terms of the metrics that they rely on. And then the implementations can translate the process of generating that metric into the language of the particular back end that they're working on. So for example, if you have an expectation around a call a minimum, then rather than actually translating that directly into a set of comparisons, what we're doing is asking the underlying data source for the minimum value and then comparing that to what you expected. There's, there's of course a little bit of magic around ensuring that that works and scales appropriately on different back ends, and that we can bring back the appropriate information for different kinds of expectations, you know, some of which look like I was just describing with men and in an aggregate way, and some of which look across rows of data at one at a time. But that's been a really important part of the evolution and then where we're going next is to have expectations be able to even encompass a little bit more logic so that they will also in the in the same part of the code base contain the translation into the full verbose language you know locale specific documentation version of an expectation. So if the name expect column values to be less than is a fixed and and that gets translated in one way to the back end implementation Now on the other way that can get translated into English and French and German and two different versions depending on the parameters like it usually should be more than 80% of the time or whatever additional parameters are, are stated all that is coming together in the code base to make it much easier for people to extend.
Tobias Macey
0:30:41
And on the concept of extensions. I know that one of the things that there's been a recent blog post that you released is in terms of adding a plugin for building automatic data dictionaries. But then there's also the extensibility of it in terms of the integrations that are built for it for particularly for a project Such as Daxter airflo are the different contexts that Great Expectations is being used within or how it can be implemented as a library. So I'm wondering if you can talk about some of the interfaces that you have available for both extending it via plugins as well as integrating it with other frameworks.
James Campbell
0:31:16
Yeah, great. That's really fun area too. So a lot of that resides in the concept of the data context that I mentioned at the beginning of our conversation. The data context makes it really easy to have a configuration it's a YAML base configuration, where you can essentially plug different components together. So for example, you can plug a new data source that knows how to register with air flow, or that knows how to read from a particular database that you have or an s3 bucket that you maintain. And so there's this composition element that the data context provides and then in addition to the Each of the core components of G are designed to be really friendly for subclassing. And the data context allows you to dynamically import your extensions of anything. So the data dictionary approach, for example, was effectively a custom renderer. So it took the way that we were building the documentation pages and it just modifies a little bit so it's still using the same underlying great expectations, artifacts, the validation results and expectation suites but it's it's combining the data in a different way and building some new elements and putting them on the top what we're really hoping is that as the community gets more and more engaged with great expectations, we can start to develop effectively a gallery or this location where people can post the extensions that they're building and begin to share the you know, whether it's a new page or artifact or a new data connector or something else like actually expectation sweet, so a domain specific set of knowledge that they have around, you know, for example, public data set that they've been working with, what are some the expectations that were used for For their project, and even those can be shared and hosted in the gallery.
Tobias Macey
0:33:03
Yeah, I definitely think that the extensibility, particularly for integrating with some of these workflow management systems is highly valuable for how great expectations is positioned by I remember talking to Nick Schrock about with the Daxter project, and I was seeing that there's an integration with that, and then also things like airflo, or Kudrow. And so the fact that great expectations as a standalone application works with sequel alchemy, and pandas and spark is definitely valuable. But I imagine that by being able to be plugged into these workflow engines that actually expands the overall reach beyond just those data sources, because of the fact that it's operating within the context of what that framework already knows about as far as the data sets and the information that it has in the context of processing that data. So I'm curious what are some of the ways beyond the surface level of great expectations as its own application? how those integrations expand its overall reach and some of the potential that it has?
James Campbell
0:34:06
Well, I think the key thing that it does is it really opens up the idea that great expectations can become a node itself in an overall data processing pipeline. So rather than, you know, I run great expectations, then I run my pipeline, then maybe I run great expectations. Again, you have a orchestrated pipeline that can have dependencies that can have conditional flow. And great expectations can really play happily in that ecosystem. It can signal back to airflo that a job has successfully produced an artifact it can put documentation automatically build and put documentation for example, in a in a public site and fire off a slack notification that says that this validation happened and it was successful and then immediately work together with air flow to continue processing or delivering data to other team so it's It's that ability to have great expectations be a part of this broader toolkit available today to engineers that I think is where we're going with these plugins. So you get that advantage of the declarative and the human understandable expectations also of of the of the, you know, just critical, you know, machine verifiable expectations, and then it becomes a part of your overall deployed pipeline.
Tobias Macey
0:35:24
And I know that we've spoken a bit about some of the interesting or innovative or unexpected ways that Great Expectations is being used within these different contexts of communication and execution. But I'm wondering if there any other areas that or any other interesting examples that we didn't touch on that you think are worth calling out?
James Campbell
0:35:42
I think the most interesting examples for me are about the different domains where Great Expectations is being used. You know, frankly, I think we've really covered a lot of the the ways that it gets used and it's more that people are using great expectations from all kinds of teams actually Let me revise that I'll give you I'll give you another specific one that I thought was was fun. We had a team that came into Great Expectations slack recently and talked about how they were using g in a way that I thought was really also, you know, another fun one, which is have a centralized data science team, because you know, it's a scarce resource to be able to ask for our specific deep dive analysis, and a lot of distributed business units, which all have similar but not the same data. And so the data science team was spending a lot of time translating very similar analyses that all these different business units had, you know, geographically different, but from the business perspective, they were doing very similar things. And so they needed to find a way to reduce the time that they were spending, doing translation for each of these different business units and Great Expectations is really a very neat fit there because it allows this this one centralized team to be able to effectively talk to a lot of different people. So it's not about about about change. It's about a lot of different teams that are doing similar things with slightly different data. And really quickly articulate what it is that they need to see and how that compares to what they're getting from each of those teams.
Tobias Macey
0:37:13
And we've already talked a bit about some of the limitations of great expectations of where there are areas in a pipeline that just need to have some sort of manual intervention or human input. But I'm wondering what are some of the other areas where Great Expectations may not necessarily be the correct choice, and some of the ways to think about how best to leverage Great Expectations or where it fits into your workflow.
James Campbell
0:37:36
So I think that we talked a bit about anomaly detection, and you know, the case where you have very dynamic data that, you know, is kind of changing or evolving over time, and that's okay. But as long as it you know, effectively, you're doing some sort of change point detection, and that's an area where I think critical patients doesn't have a lot of support yet another area that is one where I think we will have a lot of support, but it's just not built out yet. But it but it comes up a lot is in multipatch expectation. So expectations that are basically using metrics or statistics that are noble only when you look at a whole bunch of different batches of data, I believe that question, really into maybe a more forward looking positive thing of saying there's a lot that we're doing in that space right now, it's really fun to get to think about those kinds of multi badge problems or even even change point problems. And so we've got some design work that we've been doing recently that can help with with both of those. For now, though, I think those are areas where I would certainly keep a fairly high touch analytic process involved in processes or a data flows that have those characteristics.
Tobias Macey
0:38:44
And as you look forward to the next steps that you have on the roadmap for great expectations and some of the overall potential that exists for the project. I'm wondering if you can just talk about what you have in store for the future and some of the ways that people can get involved and help out on that mission? Absolutely.
James Campbell
0:39:01
Well, we talked a little bit about this gallery idea, the idea that there will be progressively more information about how to solve particular problems and how to integrate with different, you know, additional back ends or pipeline running systems. So I see a lot of work that we're moving towards focused on the space of really unlocking a kind of like a flywheel or accelerating a flywheel of community involvement, so that more and more people are able to help contribute to driving the project to get there, one of the things that we're going to be I think focusing on and where I think people would be really welcome to continue to engage a lot is basically a quality and integration testing program where you know, as more and more different kinds of data and data systems are plugged in with great expectations. You know, it's really useful. I mean, it's, I think about this all the time with sequel where, you know, there's this kind of expected homogeneity of translation, but then that runs it into the reality of variety different back ends and syntax and so forth. So we're doing a lot in order to kind of ensure that when you state an expectation that you're going to get exactly what it is that you stated. On the other hand, we're doing a lot of work just in feature iteration, improving the expressivity of expectations, to making it possible to have expectations about time that are immediately or that are, you know, first class that adding some features around you know, support for what I call teams with skin in the game, you know, things where you know, you're going to get a pager call that that was dating me, and you know, you're going to get a slack message if something goes wrong. And so, you know, that's alerts and notifications of different kinds, ensuring that there are additional actions available to validation operators and things like that. So that's the other kind of big areas just adding, adding new features and support Now eventually, what I'd love to see is that the There are opportunities to make it just as easy as possible to get great expectations in your environment, you know, some sort of just one click spin, spin up my expectations store validation store and all that. And I think we'll get there too.
Tobias Macey
0:41:14
And one of the other things that I think is interesting to briefly touch on is the types of data that are usable with great expectations where a lot of times people are going to be defaulting to things that are either in a SQL database or a textual or numeric data. But then there are also potential for things like binary data, or images or videos. And I'm wondering what are some of the ways that Great Expectations works well, with those are some of the limitations to think about what types of data sets are viable for this overall approach to testing?
James Campbell
0:41:46
That's a great point and frankly, probably should have touched it more on the on the limitation side. I think there are a couple of areas where it's challenging these great expectations for those reasons. One is if you have non tabular data and This point expectations really are oriented around columns and rows, obviously, there are a lot of data sets that are, you know, denormalize in some way or another and have record structured records or so forth. So that's an area where there's, you know, kind of some transformation, usually, that would need to happen before you run Great Expectations to the point of binary data and images. There aren't actually any of our core expectations that natively speak about those kinds of data sets. Now, what you can do and what we've seen done a little bit as were effectively Great Expectations will become a rapper layer around a more complicated model. So you so you effectively have one model that is checking data characteristics that are going to be used in another one. And so in that context, it becomes more partner becomes possible to use Great Expectations with those kinds of data, but that's another area that's pretty challenging at this point.
Tobias Macey
0:42:57
And are there any other aspects of the Great Expectations project or ways that people should be thinking about testing their overall data pipelines and data quality or effectiveness of the communication around how data is being used, that we didn't discuss yet that you'd like to cover. Before we close out the show,
James Campbell
0:43:14
I think the only other thing I would touch on is just to reflect a little bit on how much time we spend communicating about data and all the ways that that happens, and thinking about how that could be made more efficient. You know, and I think there are two big sources for inefficiency in that process of understanding and communicating data. And one of them is called round trips to domain experts are time where you're going back and forth, or sitting down in meetings and having to bring a whole bunch of people together collaboratively. And the other one was what I call the cutting room floor, or you know, time where there is some understanding of a data set that you've built up but it dies because it doesn't get written down or it doesn't get acted on. And so I think Great Expectations is really good. For helping to address those things,
Tobias Macey
0:44:02
all right? Well, for anybody who wants to follow along with the work that you're doing or get involved in the Great Expectations project or just get in touch with you, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
James Campbell
0:44:20
Well, despite all the work that we're doing, and that a lot of other people are doing and data quality, I still think the biggest gap is in bridging the the world of the data collection and processing systems. So the computing with the world of the people I mean, at the end of the day, I think it's all about what we as humans understand and as decision makers, and how we see the world that we're trying to do in this whole field. And so, I still think having people understand how models work, you know, explain ability and having machines be able to understand intent are the things that you No, just going to take a huge amount of work in a variety of different fields. And there's no course silver bullet on that. But I think that's going to be the big project area that I'm excited to get to continue working in and contributing toward.
Tobias Macey
0:45:13
All right. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with great expectations. It's definitely great to see that it has continued to progress. And it is growing in terms of the mindshare of what people are using for testing their data and the fact that that's something that needs to happen. So I appreciate all of your efforts on that front. And I hope you enjoy the rest of your day.
James Campbell
0:45:34
Thank you so much, Tobias. It's a pleasure to get to talk to you and to get to be back on your show.
Tobias Macey
0:45:44
Listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language, its community and the innovative ways it is being used. And visit the site at data engineering podcast.com Subscribe to the show. Sign up for the mailing list and read the Show Notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering podcast calm with your story. And to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!