Summary
The most important gauge of success for a data platform is the level of trust in the accuracy of the information that it provides. In order to build and maintain that trust it is necessary to invest in defining, monitoring, and enforcing data quality metrics. In this episode Michael Harper advocates for proactive data quality and starting with the source, rather than being reactive and having to work backwards from when a problem is found.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
- Your host is Tobias Macey and today I’m interviewing Michael Harper about definitions of data quality and where to define and enforce it in the data platform
Interview
- Introduction
- How did you get involved in the area of data management?
- What is your definition for the term "data quality" and what are the implied goals that it embodies?
- What are some ways that different stakeholders and participants in the data lifecycle might disagree about the definitions and manifestations of data quality?
- The market for "data quality tools" has been growing and gaining attention recently. How would you categorize the different approaches taken by open source and commercial options in the ecosystem?
- What are the tradeoffs that you see in each approach? (e.g. data warehouse as a chokepoint vs quality checks on extract)
- What are the difficulties that engineers and stakeholders encounter when identifying and defining information that is necessary to identify issues in their workflows?
- Can you describe some examples of adding data quality checks to the beginning stages of a data workflow and the kinds of issues that can be identified?
- What are some ways that quality and observability metrics can be aggregated across multiple pipeline stages to identify more complex issues?
- In application observability the metrics across multiple processes are often associated with a given service. What is the equivalent concept in data platform observabiliity?
- In your work at Databand what are some of the ways that your ideas and assumptions around data quality have been challenged or changed?
- What are the most interesting, innovative, or unexpected ways that you have seen Databand used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working at Databand?
- When is Databand the wrong choice?
- What do you have planned for the future of Databand?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Databand
- Clean Architecture (affiliate link)
- Great Expectations
- Deequ
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Michael Harper about definitions of data quality and where to define and enforce it in enforce it in
[00:02:05] Unknown:
the data platform. So, Michael, can you start by introducing yourself? Absolutely. Hey, Tobias. Thanks for having me on. My name is Michael Harper. Feel free to call me Harper, though. There's simply too many Michaels in the world. It's much easier for me. I promise you. I'm the data solution architect at databand.ai, where we're building out a data pipeline observability platform. Really enjoying tackling the new data observability problem. And do you remember how you first got involved in data management? Totally out of the blue kinda came up as an opportunity for me. Data management is actually my 2nd career. In a previous life, I worked as a operations manager at a transportation management center, but I didn't feel like I had much mobility there. So I was like, okay. How do we find a way to see more of the country and explore ideas that are around there? And so for myself, like, my education is in accounting and mathematics and economics, and I hopped around between these things, but none of those subject areas ever really fit. I never could see myself being an accountant at 40. I'm probably not the only 1 that feels that way. And I had a friend of mine who was in business intelligence for a long time and said, hey. You know what, man? You'd be really good at this. You should just go learn SQL. So I was like, oh, okay. Let's give it a go.
And I got an opportunity to work in customer support for a small software company called Pragmatic Works in Jacksonville, Florida, and I just kinda dove in and started, teaching myself. So, yes, for all the listener questions out there, I am that polarizing self taught dev on your team. I promise I listen to your suggestions, and we do the good pull requests. But it's been a lot of fun since then. I've I worked kinda came up through, like, the data modeling, data warehousing, ETL, and then moved on to Cvent where I worked as a data quality engineer, kinda got exposure to enterprise architecture, distributed systems, then kinda dug into, like, the Python and object oriented world, got involved with orchestration and big data, like, learning those things and consuming all that information that I could. Working consulting for a little while. I had some fun with a natural language processing application, but, ultimately, I missed the product driven experience. And so that's how I ended up at at Databend.
I still, to this day, laugh about the fact the first time that Josh, our CEO, and I talked, and he told me about the product. And and the first thing I said to him was just, this is the product I've been looking for for the last 5 years. As a data engineer, I wanted to be able to track the things that are going on there. But we'll get to that in a little bit, but that's kind of the long and short of how I got into data engineering. And I would just like to note the irony that you felt the lack of mobility while working at a transportation company. You know, I I'm gonna never forget that now. I don't think anyone's ever pointed that out to me. I appreciate that.
[00:04:33] Unknown:
And so bringing us to the topic at hand, so the the overall idea of data quality is something that has been, you know, paramount and important ever since we've started working with data. But in recent years, it has be taken on sort of a new life with the number of different companies and projects that have been built out to be able to sort of support and reveal these data quality issues that have been plaguing us as engineers forever. And I'm wondering if you can just start by giving your definition of the term data quality, particularly in your experience as an engineer and somebody working at a company with a product aimed at this particular problem space and some of the implied goals that are embodied in that definition?
[00:05:16] Unknown:
You know, I love this question because there's no way to have a right answer. So I can just wax poetic all day long. Because, I mean, there's hundreds of articles that have been written about the thousands of dimensions, and there's probably gonna be 100 more about those different dimensions that you can measure data quality on. I always talk about data quality as being something that you know it when you see it. And it's always difficult to abstract that into a general concept because at the end of the day, data quality requires context. That context that you understand about your domain that is going to say whether your data is meeting your needs or not, that context is what informs those goals that you talk about. So for me, data quality really comes down to a measure of fitness. You know, given a particular use case or business requirement, how well does the data that I'm bringing into my system meet my expectations for that data in that use case? You know? And it's kinda funny because, usually, it's measured against a source of truth. And whether that's defined by the engineering team or your stakeholders or your product owners, if you really think about it, it really opens the door for, like, this referential loop, right, where you you have the domain knowledge. You talk to people about the business requirements. You say, okay. I expect my data to look this way. Okay. Well, then let's create this expectation of the source of truth. We'll measure our data against this. And then as that data comes back in, that feedback loop kinda feeds back into the business cycle. And then, ultimately, a data quality framework requires a certain degree of intentionality.
If you're not careful, then you'll fall into this kind of referential loop where you aren't actually improving your data quality. You're just reaffirming your assumptions that you had when you started building that data for all quality framework. I usually like to talk about this in terms of, like, heuristics. Like, how do I measure the degree of success that I've achieved with this data quality framework? And, ultimately, if no one's getting paged and no one's getting notified about the alerts that you have on your data quality because there's no issues, then you know that you have achieved data, like, a good data quality framework. It's I'm a sports geek at heart, so data quality is kinda like the offensive lineman of a football team. You know that it's doing a good job when you don't hear them call their name. Right? So, hopefully, you never hear about your data quality framework because you've already resolved all your issues as the data comes into your system. The issue with data quality is that, as you said, it's sort of in the eye of the beholder where each of the different stakeholders within an organization might have different
[00:07:35] Unknown:
sort of metrics or ideas of what quality data looks like or ways to identify when data is wrong or sort of what the desired state happens to be. And I'm wondering what your experience has been as far as different ways that, various stakeholders and participants in the data life cycle might disagree about the definitions and manifestations of that data quality as it goes, you know, throughout the various paths through the data platform?
[00:08:04] Unknown:
Yeah. This is definitely a conversation you wanna make sure you have a nice big lunch to talk over because you'll always get a different opinion when it comes to the engineering side of the house versus the business or stake or the product owner side of the house. 1 thing I usually see is engineers look at their data, and they want to understand if it is meeting the correct structure of the data that they expect. And what I mean by that is have the number of columns that they're expecting. Are those columns of the correct data type? Are they meeting the certain values that they're expecting to be there? And things of those nature. Those are very quantitative characteristics of it. Whereas if the business is looking at it, they're usually wanting to understand, like, the qualitative nature and okay. Is this data coming in actually describing the entity that I expected to describe? Do these order details that are on this order sheet actually make sense for the order for this particular entity or this particular customer or store that's placing it? So you've got the qualitative versus quantitative conversation.
Closely behind that is usually, like, correctness versus accuracy. You know, engineers are really wanting to ensure that the data is correct. So, again, meeting their expectations of, is this numeric value between 0 a100? Like, are these floats to the correct precision point? Is it correct in that sense where the business is gonna look at it and say, okay. How accurate is this data? Like, is this sale line item actually recorded in US dollars, or is this actually need to be converted from British pounds? And so you have this competing dichotomy that neither is wrong, but there's different ways of crossing the threshold of being high quality data depending on if you're talking to the engineering side or or the business side. And as we're talking about these sort of tensions
[00:09:46] Unknown:
and varying viewpoints of data quality and how to address it and how to identify it, it brings to mind the same principles as concept drift in machine learning contexts where as you're training the model, you say, you know, this is my data that I have. These are the outputs that I'm expecting. Okay. Everything looks great. Now I'm going to put this into production, but now the data that I'm actually feeding into this inference is different than my training data. So in the data quality space, it's, you know, I have these expectations. This is the data that I'm looking at to identify. These are the checks that I'm going to create. You know, these are the boundary thresholds and error conditions that I'm going to identify and code into my pipeline. But now as time goes by and as our data sources change, schemas evolve, requirements around reports and accuracy are, you know, either changed or, you know, maybe we're adding a new product line or whatever. All of these things contribute to the concept drift of your data quality definitions and how you want to enforce them and alert upon them. And I'm wondering what you've seen as far as the ways that that might manifest as people are initially adding quality controls into their data pipelines and then being able to maintain them over time as their business evolves?
[00:11:00] Unknown:
I think it's really easy to become complacent with a data quality framework whenever it isn't detecting errors. Because as a data engineer, you want to make sure that you're delivering and meeting all of your data SLAs. And so if you have the correct data according to the framework that you have or the the data quality framework that you've laid out, then it takes, again, that level of intentionality to evaluate whether your previous assumptions still meet the current state of the business and finding a way to adapt over time. And as you mentioned in the idea of concept drift inside of a machine learning model, you have the same idea of concept drift that exists inside of data quality in the sense that do we need to focus or optimize around those qualitative statistics or the qualitative data quality facts, or do we need to focus on those quantitative facts? And what's gonna help us ensure that we're delivering data to our consumers in the format that they expect it to be. So as you mentioned, you have a business and you bring on a new product line. That product line may be dependent on a preexisting product, and that actually is part and that product line could feed information back into the preexisting product.
And when you introduce this new input into your system and that may become part of aggregates further down the data management life cycle, how do you account for that information? As that feedback loop becomes part of the data management life cycle for your first product line, how do you ensure that you're accounting for the changes in the shape of your data as it moves to your system?
[00:12:38] Unknown:
Another interesting aspect of the current point in time is that the market for data quality tools has been gaining momentum and attention because of the number of different companies and open source projects in the space. And I'm wondering if you can categorize the different approaches that are taken by these various players in the ecosystem and some of the trade offs that they're each taking on in the way that they approach this problem, whether they're focusing on the data warehouse as the choke point. And so that's where we're going to focus all of our efforts or being a framework that operates at various points within the life cycle of the data or something that needs to be able to collect and aggregate information across the entire lifespan.
[00:13:21] Unknown:
I I think the thing that's interesting here is how different data quality tools decide at what granularity they want to apply the data quality checks that they're going to provide. And what I mean by that is we can take a look at the open source space, and you see products like grid expectations come out that provides you with this framework to not only do column level profiling and understand the general shape of your data and how the summary statistics of those of that data set is gonna change over time, but also provide you with the ability to unit test your data and do that row validation. But then so focusing in on that row level granularity for data quality checks, but also providing enough of a flexible abstraction so their tool can be used across various platforms. And that's 1 thing that I think really makes us a distinguishing factor between the open source tools, data quality tools, and the commercial tools out there is open source really focuses on ensuring they have the flexibility to meet a ton of different use cases for various data teams that may be looking for a pre structured data quality framework that allows them to insert their own business requirements and error handling and data quality checks. And then they also focus on the integration side of the house too. Right? Like, open source tools wanna make sure that they are staying up with the most current trends in the data management landscape. And, you know, we can connect to Snowflake and pull that information automatically, or you can run your expectation suite on any of the commonly used cloud platforms.
And I think that's another distinct part of the, the open source focus when it comes to data quality tools is that they really focus on being cloud native first. They wanna make sure that they have the ability to both build and run on scalable, modern infrastructure, and that's not going to be a limiting factor in the evaluation process for data teams that are determining what the best path forward is for these different tools. And I think the other thing that's that I really like about the open source tools that are out there is they tend to be more prescriptive due to their flexible abstraction that I talked about earlier, and this this allows for creativity. Right? And I think that if any engineer was honest with themselves, the part of the thing that draws us to solving these problems is is our ability to get in and be creative about the way that we want to address the issue at hand. Sometimes it's fun to go back and evaluate a problem that I solved 2 or 3 months ago and say, well, if I apply this little tweak here, then how does it affect everything downstream? And suddenly I have a code base that's 10 10% faster than it was before just because I I decided to play with it a little bit. Right? That's really like the open source side. But when it comes to the commercial products, I you seem to see less of that flexibility. They're really focused on their ecosystem, like whether that's your master data management or whether that's your Informatica tools. They want to make sure that you are going to be working within the same ecosystem and that they're going to support the use cases for their clients in that ecosystem. So they tend to be a little bit more, I hate I got caught on using this term earlier. So I am nervous to even say it. Don't ask me to define low code again. That's for you, Nick. But, no, the commercial products tend to be a little more, like, low code, no code, drag and drop, UI based. They want the user experience to be as easy as possible. And along with that comes a more prescriptive approach. And so you end up with these commercial data quality tools that have a lot of really great out of the box use cases that can be applied to what the company has seen their clients struggle with when it comes to data quality and use the feedback from their clients to inform the way they build out these data quality frameworks. But then you lose some of that flexible abstraction. You lose the ability to be creative in your solution. And I think that's kind of underpins the reason that we see such a vibrant open source community, not only in the data quality tool space, but across the entire software ecosystem as a whole. Another interesting aspect of the sort of variation in approaches is the idea of being
[00:17:09] Unknown:
very sort of software focused and engineering heavy on explicitly defining the expectations and quality controls that you want to enforce or identify or collect data on in your data pipeline, your data life cycle versus the very automated approach that you were describing a lot of the commercial players are focusing on. I'm just going to connect up to your data warehouse. I'm going to identify the baseline and then alert you on any anomalous occurrences or anomalous outliers in your data so that you can, you know, not have to think about what your business requirements are. I'm just going to say, this doesn't look right. And so then you have to, you know, either say, no. That's totally fine, or I'm going to do something about it. And I'm wondering what your experience has been both as a consumer and builder of data quality controls and as somebody who's working at Data Band and who is 1 of the vendors who is offering a more sort of explicit approach towards these data quality checks, what you see as some of the potential pitfalls in either of those situations where, in the 1 hand, you're putting a lot of the power but also responsibility in the hands of the data engineers and the data platform owners. And on the other hand, you are saying, you don't have to worry about it. We'll take care of it all for you. But at the same time, you're kind of sacrificing a lot of the potential control over what you actually care about.
[00:18:29] Unknown:
Quick note on when you talk about automation and, like, that being the focus of these commercial products, it feels weird to me, I guess. And, like, this is no reflection of the quality of these products. Like, every data quality tool out there does a really good job of addressing the needs of their target audience. But by having these products that say, I'm just gonna connect to your data warehouse. I'm gonna pull in that data. I'm not gonna concern myself with your particular use case. I'm just gonna see, oh, this kinda looks weird because 10 weeks ago, you know, the mean was 10% lower. Right? Like, useful information to have, but because it does it without you thinking it, it kinda creates this place of complacency. Right? Like, it scares me to use these tools because that complacency is what I kinda talked about earlier when we talked about and that you brought up when you were talking about the concept drift and machine learning models. Right? Like, if you have these tools that are automating the process of doing data quality, it's awesome, and it's easier for you to get working with it. And that's absolutely what people need is just to get started with data quality. But how do you ensure that you don't fall into that complacency of the the product just taking care of it for you? Right? So quick note on on that thought. I just can't do a head while you're talking.
So coming back to what you're talking about on, like, kind of the pitfalls on the different approaches between how you decide where to focus your greatest effort on building out a data quality framework. And, you know, I think the biggest thing here that I've seen while working at DataVAN is there's no winner in this debate. Right? Like, there's no way to say that, like, focusing on this approach versus that approach is going to deliver more value to you. It just comes down to whether that approach makes the most sense for your use case. If if you're an analytics engineering team and you're focusing in on making sure that the dashboards that you are providing to your end users are actually meeting the requirements that they have, having something that is profiling and connecting your data warehouse and looking at your data as it rests in your warehouse and ensuring that those summary statistics and the shape of your data continues to match your expectations there, like, that's gonna be super valuable for that analytics team. But if your main concern is bringing in data and ingesting data from a ton of external sources that you have no control over, you're gonna get more value out of a product like like DataBank where we really have identified that shifting the focus of data quality to the left of this kind of pipeline cycle where we focus in on the ingestion side. And we say, okay. We have these external sources. Usually, they're APIs of some sort that we have no control over. Those schemas can change without us ever having any notice of it, And applying data quality checks to the data as soon as it comes into your system, it gives you the ability to be proactive as a data engineer or as a data platform team and identify the data being not meeting your expectations as soon as it enters your system and allowing you to quickly get to a root cause and identify how to resolve it before it even reaches that analytics team. Right? And, hopefully, they have a good data quality framework set up that's profiling and looking at the warehouse. And, honestly, that framework never has to work very hard because you've already taken care of the, quote, unquote, bad data that's coming into your system by applying simple checks as soon as it comes from that external source.
So for us at DataBank, like, what we've seen is the difference between being able to have, like, a proactive versus a reactive approach. I like being proactive and being able to quickly identify those issues. And I think that really aligns with the DevOps, like, observability culture that the data observability space is trying to replicate. Right? Like, the greatest way for you to succeed in implementing an observability framework and and achieving observability on your data is starting by establishing that culture of saying, we need to identify signals, and these signals need to come in early. And we need to understand how we're going to react to these signals and make sure that we track those all the way from our SLOs up to our SLA and make sure that we're meeting that service level objective and service level agreement.
So having that culture in place really allows for early detection and expedited resolution of any SLA violation that may come across.
[00:22:31] Unknown:
In terms of being proactive, as you put it, rather than reactive, where once the data is in the data warehouse, in some cases, it's already too late. You know, there are ways to mitigate that, but, you know, you've sacrificed the opportunity to be able to identify and correct these issues at the point of collection. And in that regard, I'm wondering what you see as some of the difficulties that engineers and stakeholders encounter when they're trying to sort of identify what the quality checks need to be at that stage because it might be before the aggregation point where you're able to say, okay. This is what I'm actually aiming for. So now I need to be able to work back from there to, you know, maybe 1 or 10 or 15 steps removed from that sort of end goal to be able to say, okay. This is what's coming out of the source. I know that this is what I want to be able to check for to make sure that this end result is what I ultimately want.
And then to the point of being able to apply those checks at the source, you know, in some cases, you don't have any control over that source if you're pulling from, you know, the Facebook API or the, you know, the Google Ads API. But if it's pulling from, you know, the database of an application that your team builds and maintains, you then maybe have some capacity to be able to push even further back to say to the application team, I want you to be able to provide me this API so that I can just query the application rather than, you know, digging into your database and trying to reverse engineer your schema to be able to get out the information that I want. And I'm wondering just sort of what the complexities are that come about as you're trying to take this proactive approach of identifying those errors at the point of collection and, you know, seeing how far to the left you can sort of push that control.
[00:24:20] Unknown:
I love the reverse ETL movement that's going on right now. But when you talk about reverse engineering your schema and I hope that does not become a thing. Right? Like, I don't look forward to the reverse schema migration tool that's gonna come out in the near future. But that's the concern that these teams have. Right? You talk about it's 1 thing if you have an external source that's changing without your control. But if it's a database that your team manages and owns, it doesn't mean you necessarily are maintaining every table or every schema that's in there. And so how do you ensure and how do you guarantee that the team next to you, when they modify their tables that actually feeds your system, how do you ensure that that's being communicated? And so while we do see that there's a higher likelihood of having an unpublished change to the external systems, you still see these same issues that occur when you're talking about ingesting from internal sources as well. But I really like what you point out at the beginning of that statement when you say and I think it's a good thing to call it, is that it's not that taking the approach later in your pipeline means that you're not going to be able to resolve the issues, like monitoring the warehouse after the raw inputs have come through. It's kind of the trade off between that proactive and reactive approach where it's more expensive at that point. Right? Like, it's more expensive to resolve that issue. It could take longer to find the root cause. It could take longer to actually apply the change. It could take longer to find a way to change your infrastructure in a way that prevents that error from occurring again. If we draw parallels to clean architecture and we talk about monitoring the data warehouse at the, like, analytics life cycle, that's similar to looking at the application layer of your clean architecture.
But if you recognize the issue in the warehouse and it turns out that it's actually because it's been coming from your source system the whole time, well, then you're saying, okay. I have to change the entity that existed. Unless you've done a good job of decoupling everything and using proper, like, dependency inversion principles, you're gonna have a bad time. Right? Like, isn't that the meme that existed for so long out there? So I can talk about it all day as far as, like, the trade offs that exist between those 2 problems. But coming back to your original question about the difficulties that are faced between engineers and straight stakeholders, I think the first difficulty that I would speak to is being able to actually identify root cause. Right? And I think that that's always going to be an issue. I mean, it's the whole reason we have data quality in the first place. Like like, let's understand why this issue occurred.
And the further down the data management life cycle you are, the more possibilities exist for what that root cause could be. So inherently, like, just from probability, it's going to take longer. So, again, when we talked earlier about the difference of perspective on what data quality means to engineers versus stakeholders, the same thing exists to root cause. Right? And you can have different perspectives from an engineering perspective, like, we need to fix our code, we need to fix our architecture, whereas the stakeholders would say, oh, no. We just need to fix the dashboard. We need to change the definition here. But, again, the further down the data management life cycle you are, the more possibilities are to throw out different solutions. Right? Like, the more opportunities are to throw duct tape on something.
Whereas if you're focusing earlier in the data ingestion cycle, it's easier to understand if the errors that you're reporting are a symptomatic issue or a holistic issue. Right? Like, if you're monitoring your warehouse and you see that a column suddenly has an influx of null values, like, is that a symptom of poor business logic? Is that a symptom of access? Is that a symptom of schema change that has been issued to you? Whereas if you are looking at your sources as soon as they come in and you notice that certain column is a 100% null, well, now you know it's a holistic issue. Right? Like, you know right at the cause. Like, okay. It's coming from my root system my external source. I need to prevent this from coming into my system or at least handle this in a way that doesn't throw off the rest of my data pipeline.
And those are 2 things that kinda come up to me. Like, the the root cause, really understanding what the errors are. The other difficulty that you have is how to identify whether the issue exists within your data or whether it exists within your code or whether it exists within your infrastructure. You know, did your orchestration system not run the backfills that you thought it was supposed to run over the weekend because your server had gone down? And is that why you have data missing from your dashboard now? Or is it because there was a code change that got pushed in the middle pushed in the middle last week that tweaked, you know, 1 where statement in the SQL query. And so now you're only bringing back 10% of the data that you were bringing back before.
Or at the end of the day, did that change in the SQL query not actually matter? And it just turns out that the source system that was bringing your data into the warehouse is not actually providing you with the same volume of data that you're seeing before. So this kinda ties back to the idea of identifying that root cause and being able to say there's less options when you focus in earlier in the data management life cycle and focusing on the ingestion piece as, like, raw data comes into your data lake. That's really gonna to help you identify those issues easier. Our experience at DataBank, we've seen that that helps you quickly identify the issues, and there's also less debate over ownership of who owns the the issue at that point. Right? Like because usually, it's just the platform or data ops or data engineering team that's working on the ingestion part of the data management.
Whereas if you're in the warehouse, there could be a dozen different teams that could be responsible for that data versus code versus infrastructure question. The last the last thing I'd point out to is priority always becomes an issue. Like, whether that's, you know, prioritizing what you wanna work on in a sprint or whether that's prioritizing what's most important for your chain versus the stakeholders. Like, if there is a missing column on a dashboard, that's probably a huge priority for your stakeholders where for the engineering team, yes. We wanna deliver that, but you've got an SLA that gives us a week. Right? So should we really prioritize that over the next 24 hours when be within compliance? So that last 1 there, when it comes to priority, you know, it's part of the iron triangle. Right? Like, you can only move so far to 1 corner without sacrificing everything else that exists there. So those are just a couple of the difficulties that I think of when you talk about comparing engineers and stakeholders finding those issues in their workflows.
[00:30:47] Unknown:
As far as the sort of moving these data quality checks as early as possible into the pipeline and sort of wending their way throughout the overall life cycle. Wondering if you've seen any useful approaches for being able to aggregate that information across those various touch points to be able to identify the sort of occurrence of a data quality error that only manifests after it has made a few sort of transitions through those various stages and being able to sort of stitch those things together into a you know, in application observability, you would generally aggregate microservices into an overall service definition to say, you know, all of these work together to be able to provide this overarching service. And I'm wondering what you see as sort of the analogy in data infrastructure observability and data pipeline observability to be able to stitch these various locations and data collection points into this overarching, this is the end result or this is the sort of target use case that all of these contribute to, particularly since they are sort of very fractal in nature where, you know, 1 piece of data that's coming from 1 source might actually end up getting used in tens or hundreds or thousands of different terminal use cases.
[00:32:06] Unknown:
I love that you mentioned fractal because I think the application of chaos theory and data quality is a really interesting topic, but that's an episode in and of itself. So we won't go down that rabbit hole. I'd say that for us at DataBank, what we're really trying to achieve is to build the Datadog of data pipelines. Right? And so I'd love that you set this up in terms of application observability because from our perspective, your pipeline is the core process that you're going to monitor. And as you monitor these pipelines, you can look at it from an operational observability to understand the state of the execution.
And then when you look inside of the pipeline itself, you can monitor your data in motion as it moves from source to destination, from lake to warehouse, from warehouse to visualization platform. And by looking at that data in motion, that helps you identify the state that your data moves from. And when you are looking at each stage of the data management life cycle and identifying each state that your data hits, that allows you to understand when your data went from an acceptable state into an unacceptable state. And when you see these changes occur from acceptable to unacceptable, you can then tie this back to the dataset that this data came from. So you've got pipelines that are your processes. These get tied back to your datasets, which we view as the the microservice, and the application observability would be the analogy there. And once you understand how your datasets are changing over time due to the actions that are being applied to them, then that's gonna give you better perspective on the overall health of your data ecosystem and ultimately give you a better understanding of what your data management life cycle looks like. So breaking it down simply, pipelines are your processes, and datasets that those pipelines act upon are the microservices that are being monitored.
And monitoring the pipelines and datasets together ultimately gives you an understanding of how well your application works, which in our scenario here, we're just talking about your data management life cycle, your data ecosystem in general, and ultimately, like, the trustworthiness of your data.
[00:34:14] Unknown:
And as far as the sort of establishing trust, that's always the hardest thing to do and the easiest thing to lose. And that's sort of the overarching goal of all data quality efforts is being able to establish and maintain that trust both for the end stakeholders and for the people who are responsible for the data platform. Because once the data platform owners lose trust in the accuracy of the data, then, you know, it becomes very easy to succumb to burnout and just, you know, go to the next job in hopes of greener pastures. And so as far as being able to sort of build and maintain that trust and at the same time balancing the challenge of alert fatigue where you don't want to be paged on every minor issue that, you know, maybe is indicative of some potential problem for data quality, but, you know, in 80% of the cases is just noise and doesn't actually contribute any errors. I'm wondering what you have seen as some useful patterns for being able to, you know, take these individual metrics and individual data points and error conditions and expectations around the various aspects of your data and kind of weave that together into an approach that strikes that balance between alert fatigue and lack of trust?
[00:35:33] Unknown:
So this is the part that I actually find fun. Right? Because this is the point where we can take this conversation where we've been talking about data quality and actually place it into why a data observability tool is talking about data quality. And so for me, the way that I like to talk about it is you have orchestration that exists. An orchestration allows you to run your pipelines on a regular basis, and that just allows you to automate the data quality that you want to check. Right? And so by having that orchestration in place, now you have a reliable way of gathering data quality information. But if no one's looking at it and no one's acting upon it, then what good is that for you? Right? And so for me, the observability becomes kind of like the natural evolution of data quality, and that's allowing you to complete that feedback loop. You're allowed to gather the metrics from your data quality framework. You're allowed to understand what that root cause is. You're allowed to iterate on the data or the code or the infrastructure issue that you've identified and resolve it so that way you don't see it occurring any further. And the interesting use cases that we've seen clients use DataMed to do that with is being able to use our application to track these data quality metrics and the state of their data and pipelines in a time series and a trend analysis and allowing them to be able to look back and say, okay.
We expected this range for our sales totals. But over the course of a month, they've slowly dipped down and down. And while that may not have triggered an anomaly alert for us, we have seen that there's, like, a 10% decrease. So now is that a business problem, or is that a data problem? And we can start looking into that. And whenever you have this observability layer that's collecting data quality metrics and displaying it in a time series manner, that also allows you to start tracing lineage of the state of your data that I referred to earlier. So if you are running an analytical pipeline that you're pulling data out of Snowflake and you're running it through your DBT transformations, and then you're putting it into a data mart or sending it over to Tableau for to populate the dashboard.
And you notice that the dashboard isn't displaying the information or you probably don't notice it. Your stakeholder calls you and says, hey. This column's not exactly doing what I expect it to. By going into DataBank, you can actually see, like, okay. This change did occur in the analytical pipeline, but do we see a correlating change in the pipeline that previously wrote to this dataset? And being able to trace that all the way back to the external source that the data came from and allow you to quickly identify that root cause, that's been a really fun experience to see clients kind of find a way to incorporate DataBank into their operational workflows.
We have a client currently using a learning system based on the severity of the alerts that they've set up that actually goes and notifies their on call person through PagerDuty. They've then set up a post action pipeline that creates a Jira ticket for that alert to be tracked and actually allow that alert to be tracked. And then now the information that's coming back from your data quality framework is now actionable in an immediate sense. There's you don't have to go digging through the code. You can see, okay, this is where it came from. This is where I can go look at it because everything's centralized in DataVAN.
And then you can understand whether you've seen this before because you have a history of these issues that are in Jira that have notified it before. So being able to track your data quality metrics through time series analysis and then enable data lineage and understanding how the state of your dataset has changed based on the pipelines that are interacting with it just creates this whole new level of understanding without having to roll your own framework to really get there.
[00:39:25] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain DataFold also helps DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. In your experience since you've started working at Data Band and have gained sort of broader exposure to the types of data quality problems that people are encountering and the manifestations of those and the ways that they're tracking them. I'm wondering what are some of the ways that your ideas and assumptions around the concept of data quality and how to approach it at an engineering and organizational level have been challenged or changed?
[00:40:47] Unknown:
It's kind of fun to think about where DataBank started and how we ended up in the observability space. It started as an orchestration project. Right? Like, we wanted to give better orchestration, more make a data ware project. But then as we were working with our clients, we saw that there was a big need for understanding what was occurring inside of that orchestration system, whether it's data aware or not, understanding how your executions were changing over time, how your run durations were changing over time, that was really a bigger challenge for our clients than necessarily being able to orchestrate because there are other tools that you could rely on to orchestrate. I had to shift into the observability space, and first thing that I've seen is not so much a lesson, but an affirmation that, like it's an affirmation that there's not a 1 size fits all.
If you were to provide the flexibility and allow engineers to get creative in the way that they implement an observability tool that empowers the time series analysis of their data quality framework that's already in place. 1 of the other interesting lessons we've heard learned is that, like, there's a really fine line between, like, data quality management and data control. And depending on which team you're working with, you may not be able to control what is coming into your system. As we talked about earlier, like, when we shift our focus to the left and applying our data quality checks and focusing our observability on the ingestion layer, it's obvious that we're not going to have as much control on the external sources that are coming in here. However, whenever we look further downstream after that ingestion there brings your external data into your data lake, we continue to see similar issues further down the pipeline where you have changes that are being made and they aren't necessarily being communicated either upstream to the application team or from the analytics team back to the data platform team. And so you still end up having to solve these same problems where you may be able to implement a data quality strategy, But if you don't control the data, then your options are limited on how you react to that. And in that scenario, DataBank ends up playing a nice role where these teams finally have an avenue to communicate in a standardized way. They can point to the dashboard and data band. They can point to the time series analysis of these data data quality metrics. So that way, they can be able to ensure everyone's on the same page at that point in time.
1 of the more challenging lessons that we're we're having to grapple with is at the end of the day, DataBank wants to collect the metadata and the description of the data that's coming through your system. But we don't wanna capture your data itself. Right? We want your data to stay where it exists. And so how do we enable the ability to run, say, column level profiling on massive terabyte sized datasets without removing that from your system? And so having to think about having to be intentional about the way that we architect our solutions so that way data can remain where it's supposed to be while running in a performant manner and passing that metadata back to the data band application. So that way, you can track this time series over time. It's a fun problem to work with is the best way to say it. You really get into these creative situations where you're looking at not only data management and data privacy and data security, but you're also looking at distributed systems and cloud native technology and understanding, like, what's the trade off between supporting 1 cloud service versus another or support supporting a particular type of compute, whether that's a spark cluster or an airflow environment versus really focusing in on the data when it is at rest. You know, it's it's a simpler solution then. Right? Like, you can just deploy some sort of daemon or monitor that looks at the data at rest. It It stays right there. You do a calculation. You send it back the metadata back to you. So I guess to sum that up, really, like, the challenge is is data security. Right? Data privacy. Like, how do we capture metadata about a client's data and ensure that we're still protecting the data privacy and the data security that they ultimately need?
[00:44:55] Unknown:
And as far as the ways that you've seen the DataBAND product, both the open source library and the commercial service applied to this problem of data quality and being able to gain greater visibility into the behavior of the data platform and the data pipelines? What are some of the most interesting or innovative or unexpected ways that you've seen them applied?
[00:45:17] Unknown:
I think 1 of the most interesting use cases I've seen, so is our client, I mentioned earlier, who's using DataBands alerting engine to inform their operational workflow, being able to quickly notify the engineer who's on call and automate the creation of Jira tickets, being able to not only use DataBank to track the history of data quality metrics, but also track the history of SLA violations or triaging these events or these operational incidents and providing better visibility when it comes to retroing and iterating on resolving these issues when they come in. Like, I think that's a really interesting use case. The other interesting use case is I was talking to a client the other day, and they are using DataBank.
They've got probably hundreds of different external APIs that they're pulling data in from, and they have no control over those APIs. And they've noticed over time that certain APIs are more prone to changing their payload than other ones are. And they've actually used DataBank as proof or they've shown that they provided the information that they've collected in their DataBank instance to that vendor who's providing them with data to say, this is how often we're seeing issues with your source in breaking our pipelines. And, again, using DataBank as that standard form of communication between teams, not only within your own company, but also throughout the data ecosystem, the data management and throughout the data industry. I think that's a really promising use case for DataBank in the fact that, like, it becomes a way for people to communicate across domain and across discipline, whether that's inside of a company or external to their company. The last use case that I find kinda find in the, it drifts a little bit away from our observability conversation and dabbles into the roots of DataBank on the orchestration side. But there's a client that's using DataBank for its observability purposes to track metrics and understand how their pipeline's health is changing over time, but also using parts of our orchestration library that allows them to make their workloads data aware within airflow and also using our orchestration library to coordinate deployment of Kubernetes pods that ultimately are used for running machine learning models that are then conditionally moved from 1 lower environment to another based on the results. And the last time we discussed it, we saw something like 1 pipeline was able to spawn over a 100 Kubernetes pods, which then ended up running over a 1000 or 2, 000 different, like, machine learning models. So just seeing that level of scale all within a single product that is orchestrating it across these different platforms, Airflow, Kubernetes, Oozie's involved over there, and then reporting it all back to a central system so it can be kind of digested and consumed. Like, it's really fun to see the the creativity in that use case. Those are the ones that I come to mind.
[00:48:37] Unknown:
And so for people who are interested in digging more into this data quality and data observability space, and they wanna be able to collect more information and identify alerts earlier in the life cycle of data and push it closer to the source? What are some of the cases where Data Band is the wrong choice or where that overall approach is too cumbersome and they're better off just going with the data warehouse as a choke point approach?
[00:49:04] Unknown:
I think data is the right choice when you wanna focus on your ingestion layer and managing a wide variety of external data sources that you don't control. And this will allow you to treat data checks early in your process to ensure that the shape or the schema of your raw data coming in is as you expect it to. You wanna track the volume that it's coming through, ensure that the freshness of your data is as you expect it. Did your pipelines run on time? Are they running at the length that you expect it to? Focusing in on that ingestion layer where you have a more custom built approach. You're building your own DAGs and airflow. You're writing your own Python handlers to interact with these APIs or external databases or even internal databases.
That's gonna be the right choice for DataBank because all those things are gonna allow you to be proactive in your ability to identify issues early in the system. If you're looking for a tool to monitor the data when it's already in your warehouse, we're gonna be less helpful in that situation. If you are only dealing with a handful of sources, you know, that don't really cause you many issues, we can give you value there. You're gonna be able to track those pipelines. You're gonna be able to do those same types of checks. But if it's not the major pain point for you, then that's gonna be a situation where DataBank may not be the best tool for your situation. And in general, if you're if you're data quality focused like we talked about earlier in the episode, there's lots of different approaches to how you can tackle the data quality and data observability of your data ecosystem.
And if your focus and your approach is more focused downstream in the analytics space, in in the warehouse space, in the visualization layer, that's gonna be another use case where data bands are the cell phone. So if you are writing a lot of Python scripts and you're managing a lot of pipelines, whether that be through an orchestration system or whether you're running it on cron jobs or whether you're running that through Lambda and Step Functions or running it through Dataproc, those types of situations where you have a lot of custom code and you have a lot of different sources that you need to keep an eye on because you can't control those sources, that's gonna be the best use case for DataVAN.
[00:51:19] Unknown:
And as you continue to work with customers and help to grow the DataBank project, what are some of the things that are planned for the near to medium term or any projects that you're particularly excited to work on?
[00:51:31] Unknown:
I'm really excited to continue the collaborations that we're having with data ingestion tools that are in the space. We've seen a lot of players in that space with the 5 Trans and your Singers and your Multanos and Airbyte. And working with those teams and collaborating with them on how DataBAND can be a good tool for their users to be able to see observability on top of the really slick extract load tools that these companies are providing. I think that that's something that I'm looking forward to having a lot of fun. On our product side, we're really gonna focus in on nailing the data in motion use case that I talked to earlier, we're able to look at pipelines. We're able to peek inside those pipelines, and we can identify the reads and the write operations that are occurring not only on your APIs or your file system, whether that be parquet files or CSVs, but also expanding that into doing query tracking for your data motion. In a couple weeks, we're gonna be rolling out some tracking on Snowflake for looking at the copy into command. We're gonna expand that out into Redshift. And so it's a really exciting time at DataBank for me because we're right on the cusp of, like, that accelerating exponential climb of more features. Because as we invest in the abstraction of interacting with a cloud data warehouse, it's easy to then implement that into the other competitors in that space and provide more value for the data teams that are coming and evaluating DataDAN for their tool. And at the end of the day, like, the thing that I'm excited for is just to keep talking about data quality and data observability because I think that we're at a really interesting time where we can find ways to apply the best practices that have been learned in DevOps and having that observability culture that exists there and finding ways to identify what the signals are we want in the operational layer of your workflows and ensuring that you have a proactive approach to understanding where your data is going from a acceptable state to an unacceptable state, and then being able to quickly act on that information as well.
[00:53:42] Unknown:
Are there any other aspects of the overall space of data quality and being able to be proactive about its identification and management and the work that you're doing at DataBands to help enable that approach that we didn't discuss yet that you'd like to cover before we close out the show? I'm excited to engage the observability
[00:54:00] Unknown:
conversation as a whole. I did a meetup a little while ago with Nick from Fivetran, and we really talked about just what it can look like to have an observability framework on top of your data quality framework and the different layers of observability that exist there. With the approach that we're taking at DataBank, when it comes to trying to build the data platform observability, We're looking at that operational level, which is the first place that we start looking at. We see that the executions of your pipelines are running as expected. We're getting successes. We understand when the failures occur. We're capturing error. We understand if we have an anomaly within the run duration.
And after establishing that operational layer of your pipelines, we can start looking at the data and motion within that pipeline. Right? And so we're seeing the the schemas that are coming through. We're applying simple data quality checks around completeness and accuracy and understanding those simple checks and then evolving that to 1 level higher and what I described as a hierarchy of observability where you can look at column level profiling, and you can look at the summary statistics and understand over time how has the mean of this numeric column changed? How has the completeness of this date time column changed? How has the null count for this bar cart column changed? And being able to not only track that over time, but also applying alerts on that, making that actionable for people to resolve and iterate quickly so you don't see that again in the future, and it doesn't continue to move downstream into your pipeline. And after you've looked at the column level, like, that's when you get to, like, the top of the hierarchy where you can apply, like, row level validation. Right? DataBank, we've talked about monitoring pipelines and looking at datasets. You can define any custom metric that you want. And so if you have a data quality framework in place like great expectations or DQ that's doing row level validation, you're returning a value from those validations.
You can pass that directly into DataBank. Now it's a metric in DataBank. And once that metric exists in DataVAN, you have the ability to create alerts on that. And that could be a simple conditional like, Boolean conditional alert. We have out of the box anomaly detection alert. We have the ability to create range alerts. So you can really not only capture your data quality metrics that are part of your custom framework, but you also have the flexibility to define
[00:56:15] Unknown:
how you should be alerted on to avoid alert fatigue that you mentioned before. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:56:34] Unknown:
No. I think this comes back to the most difficult challenge that I face while working at DataBank, and it comes down to data security and data privacy. Right? There's a lot of tools out there that are focusing in on that ingestion that are expanding the, quote, unquote, modern data stack. Right? And it's really doing great things to have these tools and the explosion in the space, but there seems to be very little focus on a solution to simplify the way that we've managed data privacy and managed data security and data access. It seems like something that we're still, as an industry, leading up to various providers, whether that be your cloud provider, whether that be your on prem database provider. We're still relying on things like like role based access control, which is still it works. Right? Like, if there's nothing wrong with it, don't fix it. But at the same time, as we see this explosion and evolution of the workflow management tools for the data space, I'm curious to see how we're going to address and keep best practices or even create best practices for managing data security and privacy and access.
[00:57:35] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Data Band and helping to sort of drive forward the capabilities of being able to identify sources of errors earlier in the life cycle of data and gain greater observability about its overall impact on the data platform. So thank you for all the time and effort you and the folks at DataBend are putting into that, and I hope you enjoy the rest of your day. Awesome. Thanks so much, Tobias. Enjoyed it. Take care. Listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Welcome
Interview with Michael Harper
Defining Data Quality
Stakeholder Perspectives on Data Quality
Concept Drift in Data Quality
Market Approaches to Data Quality Tools
Automation vs. Explicit Data Quality Checks
Proactive vs. Reactive Data Quality Approaches
Aggregating Data Quality Information
Establishing and Maintaining Data Trust
DataFold Advertisement
Challenges and Lessons in Data Quality
Interesting Use Cases for DataBand
When DataBand is the Wrong Choice
Future Plans for DataBand
Closing Thoughts and Final Questions