Summary
Data observability is a set of technical and organizational capabilities related to understanding how your data is being processed and used so that you can proactively identify and fix errors in your workflows. In this episode Metaplane founder Kevin Hu shares his working definition of the term and explains the work that he and his team are doing to cut down on the time to adoption for this new set of practices. He discusses the factors that influenced his decision to start with the data warehouse, the potential shortcomings of that approach, and where he plans to go from there. This is a great exploration of what it means to treat your data platform as a living system and apply state of the art engineering to it.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Your host is Tobias Macey and today I’m interviewing Kevin Hu about Metaplane, a platform aiming to provide observability for modern data stacks, from warehouses to BI dashboards and everything in between.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Metaplane is and the story behind it?
- Data observability is an area that has seen a huge amount of activity over the past couple of years. What is your working definition of that term?
- What are the areas of differentiation that you see across vendors in the space?
- Can you describe how the Metaplane platform is architected?
- How have the design and goals of Metaplane changed or evolved since you started working on it?
- establishing seasonality in data metrics
- blind spots from operating at the level of the data warehouse
- What are the most interesting, innovative, or unexpected ways that you have seen Metaplane used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaplane?
- When is Metaplane the wrong choice?
- What do you have planned for the future of Metaplane?
Contact Info
- @kevinzhenghu on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Metaplane
- Datadog
- Control Theory
- James Clerk Maxwell
- Centrifugal Governor
- Huygens
- Amazon ECS
- Stop Hiring Devops Experts (And Start Growing Them)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud. Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow. Now all the data users can use software engineering best practices. Git tests and continuous deployment with a simple to use visual designer. How does it work? You visually design the pipelines and Prophecy generates clean spark code with tests and stores it in version control, then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage.
Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy. Your host is Tobias Macy. And today, I'm interviewing Kevin Hu about Metaplane, a platform aiming to provide observability for modern data stacks from warehouses to BI dashboards and everything in between. So, Kevin, can you start by introducing yourself?
[00:02:02] Unknown:
Pleasure to meet you, Tobias. It's great to be on the show. I'm Kevin. I'm the cofounder and CEO of Metaplane. We like to think of it as a Datadog for data. It's a data observability tool that continuously monitors your data stack, and alerts you when something goes wrong, and then gives you the relevant metadata to debug.
[00:02:20] Unknown:
And do you remember how you first got involved in the area of data? I do. I got into data management
[00:02:26] Unknown:
when I was an undergrad, studying physics. And MIT at the time has a notoriously difficult experimental lab course, where every 2 weeks, you replicate a Nobel Prize winning experiment. 1 week, you do the experiment, the 2nd week, you analyze the data. And even though it's known as a killer course, I noticed that the students who had the hardest time, people who didn't know the math or the physics, it was people who didn't know how to code or analyze data. And now as it turned out, the friction to working with data was exceptionally high, it still is, not just for scientists, but for people across domains.
So I spent the next 6 years doing research on trying to lower the friction to work with data, applying machine learning to automating data visualization, to type detection. And, yeah, that's how I got into the space. And the transition from research to founding a company was because, you know, the research was exciting, but ultimately, I think the way to have the largest impact on data
[00:03:26] Unknown:
is through building tools that people use. I'm wondering if you can describe a bit what it is that you're building at Metaplane, and some of the story behind how it came about, and why you decided that this was the problem that was worth spending your time and energy on. So when we were trying to implement some of our research at, you know, both small startups
[00:03:44] Unknown:
and very large companies, 1 of the largest friction points was making sure that the data was correct. And of course, data quality is a timeless problem. Papers on data quality go back to some of the original papers on databases. But after talking to my 2 co founders, who were previously software engineers at HubSpot, it was clear that in the software world, there is a playbook, and there are tools to diagnose and fix issues. Datadog, signal effects, you name it. Data teams don't have the playbook and don't have the tools yet. That's what we are building with Metaplane.
[00:04:22] Unknown:
Broadly speaking, you have positioned yourself in the space of data observability, which is a fairly new term that is still nascent and going through its sort of phase of self discovery. So I'm wondering if you can share your working definition of what that definition is and what it means to have data observability.
[00:04:43] Unknown:
Great question. I think everyone has their own definition. For us, we like to go back to the roots, Not just to software observability, but even to control theory. Way back to when observability got started. It was James Clerk Maxwell trying to formalize how to understand the system that was used to control the speed of engines. It's called a centrifugal governor. It was originally created by Huygens to control windmills, later to control steam engines. So from there, rose the idea of controllability. Given the inputs of a system, how can you understand this internal states?
And observability is the corresponding concept, the mathematical dual, which is given the outputs of a system, how can you understand the internal states? A lot of this is inspired by the idea of how can I reconstruct the state of a system at any point in time, and understand how it will evolve in the future? And that was really kind of the kernel of the idea that was borrowed by software observability vendors, where they said, Okay, given metrics, traces and logs, these 3 pillars, we can reconstruct the state of a software system arbitrarily at any point in time. And to actually answer your question, in data observability, we believe that there are 4 pillars, metadata, metrics, lineage, and logs, that if you capture those 4, then you can reconstruct the state of a data system over time. So we think of observability as how much you can reconstruct a data system, and correspondingly, a degree of visibility that you have into 1. Another interesting element of the space of data observability
[00:06:27] Unknown:
and data quality and the sort of conflation of those 2 terms is that they are areas, particularly in the data space, that have seen a lot of activity over the past 1 to 2 years. And I'm wondering what you see as the areas of differentiation across vendors in the space and some of the ways that different interpretations of observability and quality can manifest?
[00:06:52] Unknown:
Just to start off, even though quality and observability are very conflated, and in some ways, observability may be cynically regarded as a rebranding of quality. Data quality is a use case, right? It's a problem to be solved, and a very important problem. Observability is a technology, right? By gathering and centralizing all of the metadata into 1 central system, you can, yes, identify and address data quality issues, but you can also address many other issues. For example, lineage and spend monitoring, and usage analytics on your data warehouse. And vendors in this space, I think do differentiate along that access, which is how comprehensive is the metadata that your tool collects? Is it focused just on the warehouse?
Or does it go upstream to transactional databases? Does it go downstream to reverse ETL tools or to BI tools? And the other axis, which we think a lot about, is how accessible this observability tool. Or is it designed for the Fortune 1, 000 companies, you have to talk to the VP of sales, go through a large implementation process to use it? Or like every other tool in the modern data stack, like DBT and Snowflake and Liquor, in 15 minutes, just try bringing on observability. If it works, that's great. If it doesn't work, no harm, no foul.
[00:08:20] Unknown:
And in terms of the access that you were discussing as far as, you know, do you cover the entire life cycle of the data from the, you know, transactional systems and SaaS platforms that originate the data through to delivery and interpretation and then closing that loop? Or do you decide from some arbitrary point in the middle or at the end, this is the domain that I'm going to cover, and then figuring out from there what are the attachment points and additional systems that you need to be able to maintain coverage of. I'm curious how you approached that question as you were starting to architect and iterate on the idea of Metaplane to see what is the sort of highest leverage point that we can go from that will provide the most value versus saying kind of, idealistically, we want to cover all of the use cases and then being able to figure out sort of what is the approach.
[00:09:17] Unknown:
You just gave us the answer right there. And it's a great question, where the highest leverage point for an observability tool is to connect to the warehouse. Right? That's the center of gravity of your data stack, so to speak. And there's only a handful of vendors that you need to integrate with to cover a large percentage of data organizations in the world. So we did start there, however, there's an ongoing debate in observability, even in software observability, of do you monitor the causes, or do you monitor the symptoms? And there's pros and cons to both.
I think a lot of the software world has landed on the consensus that we do want to monitor the symptoms, because that is the most correlated with, you know, end user performance, and you can always trace back into potential causes. So in our world, it might be monitoring BI dashboards, or machine learning models, or go to market tools. So to answer your question, we're starting from the warehouse, but kind of going outwards to adjacent spaces.
[00:10:25] Unknown:
And as far as exploring those adjacencies, how did you think about the kind of network effects of saying, okay. We have covered the warehouse or even identifying what your coverage is for the warehouse to determine your level of completeness to say, we need to spend more energy at just covering all of the interactions and elements that exist within the confines of the data warehouse versus saying, we've got, you know, the core workflows managed, and now we need to think about branching out into these other systems so that we can get an end to end coverage of a small subset of use cases versus a complete set of coverage in a more narrowed sort of technological scope?
[00:11:06] Unknown:
Increasingly, I think more and more of the life cycle of data is within the warehouse, going from, like, a raw landing zone from ELT tools to a staging zone, to a fully modeled, ready to consume layer. So by integrating within the warehouse, I think you do cover a decent segment of the life cycle of data. We are going upstream and downstream, but not necessarily focused on 1 particular use case. To give you an example, we have integrations with both Snowflake and Postgres, and many of our customers use both of them. Not only for detecting schema changes that might be caused in upstream systems that impact downstream systems, but also replication issues.
So I think focusing on those sorts of jobs makes it so that tools like ours don't necessarily need to be focused on the use case, and yet will still address those use cases.
[00:12:03] Unknown:
And so can you dig a bit more into the actual implementation and architecture of Metaplane and how you think about the collection and analysis of the signals that people are generating from their various data systems and platforms and workflows?
[00:12:20] Unknown:
So every hour, we use Amazon's ECS to spin up some computing resources, and make tens of thousands of observations across our customers as warehouses and BI tools and transformation tools. And on that hour, we retrieve the observation and train a machine learning and time series model, depends on the type of test we have an ensemble of models, and then alerts you when there is an unexpected event.
[00:12:46] Unknown:
And then as far as the kind of periodicity there, you mentioned that you do it hourly. I'm curious what the sensitivity is to latency of being able to uncover some of these anomalous events or being able to gain some insight about the performance of your data stack, particularly as it compares to the types of appetite for latency that you might see in a production software system?
[00:13:12] Unknown:
Characteristic time scale of the time series that we analyze makes it such an interesting problem. Right? Like in a given data warehouse, sometimes you might have data landing every minute, or even on a sub minute or second basis. Other times, you might have data that's refreshed every week. And it's hard to say that 1 is necessarily more important to your monitor than another. But at the end of the day, seasonality is incredibly important, especially now that we pass the holiday season. And many of us, our customers are e commerce customers. We don't wanna send everyone alerts the moment that Black Friday or Christmas comes. So we do account for seasonality as and we also account for trends. And we try and be very careful about understanding, okay, does this table change every day, every week, and taking that into account. And so 1 of the other things that you mentioned is the
[00:14:07] Unknown:
goal of being very accessible as a platform to be able to bring some of these characteristics of observability to a broader audience. And I'm wondering who you think of as the primary consumers of the information that you're generating and some of the ways that that has influenced your thinking about the feature sets that you need to develop and some of the user experience and user interface patterns to lean on to be able to appropriately convey the necessary signals to the people who are interacting with your system? 1 of the
[00:14:41] Unknown:
most unexpected learnings throughout the course of starting this company is just how much data is changing the world. In our world, it's kind of clear, okay, startups are hiring heads of data, large companies already have large data teams. But you'd be surprised, as we were, that even some of the smallest startups out there are starting to spin up data stacks. And the people who are creating the data stacks, and gathering and analyzing the data, don't necessarily have the title of a head of data or data engineer or analytics engineer, but they are doing the work. So for a lot of growth stage companies, including many of our customers, the people who get the most value out of Metaplane at first, maybe the data engineers or maybe the heads of data. But very quickly, we've noticed that Metaplane, the hashtag Metaplane alerts channel can maybe start at 2 people, but quickly it grows to a dozen people, or 20 people, as, you know, something breaks. And I at mention 1 of my colleagues saying, hey, you know, this table looks weird, can you please check it out? You're recently touching this DBT job.
So the people who bring us in aren't necessarily the people who are impacted by the data, or causing some of the issues,
[00:15:58] Unknown:
but that Metaplane is disseminated throughout organizations to all of them. You mentioned that you're also working with some of the lineage information and integrating with the BI systems, and I'm curious how that manifests. Is it something where somebody needs to go to the Metaplane dash board to be able to view these different linkages and view the sort of freshness of the report that they're querying? Or is it something where in the BI system, you're going to incorporate a signal from Metaplane to say, you know, this report is up to date. This is the last time it was refreshed. These are the signals that we're relying on to be able to compute that fact or, you know, from the dashboard being able to say, I want to understand more about, you know, the state of this report and then be able to jump from that into Metaplane to be able to dig deeper to explore some of these aspects of observability in the data space.
[00:16:54] Unknown:
That is such a good idea, and 1 that we are looking at. Unfortunately, we don't have that dashboard for your dashboards yet, so to speak, or a KPI for your KPIs that we're trying to be cute. But the way that our users consume the lineage information primarily is actually through the alerts. They're using a tool like Slack or email, and Metaplane sends them an alert saying, for example, this user's table is typically updated every hour, but it's been 7 hours since it's been updated. We include downstream as well as upstream lineage. So for example, these 3 liquor dashboards that have been viewed 500 times are impacted by this table being delayed, and here are upstream DBT models.
[00:17:40] Unknown:
And as far as the actual pieces of information that you're collecting, I'm wondering if you can talk to some of the sort of categories of data that you're collecting to be able to establish the kind of observability of the system. Like, what are the pieces of information that are necessary to understand the key aspects of how the data platform is operating and be able to dig into some of the problem occurrences or be able to proactively identify this is going to cause an issue with this downstream system and just how that compares to maybe some of the data quality focused tools and vendors in the market?
[00:18:21] Unknown:
So going back to the control theory roots of observability, this idea of a state space representation of a system, where you want to describe a system in as few variables as possible, and to make sure that the variables contain, like orthogonal information. And for us, we believe that there are kind of 4 variables that matter for data systems. 2 of which describe the characteristics of the data itself. Whether it's the internal characteristics, like the metrics of data. Right? What is the knownness of this column? What is the mean? Does it contain PII? And 1 describes the external characteristics of the data.
How many rows are there? Is it being refreshed? And then there's 2 variables that describe interactions within the data system. Right? Lineage, is this data derived from a computation applied to previous pieces of data and logs? How does this data interact with external people and external systems? So with those 4 together, all of which Metaplane collects, we centralize all of the data, and surface potential issues. The subtle difference I think with a data quality focused system is, even if an issue isn't occurring, we still are collecting this metadata.
Because down the line, you don't necessarily know when an issue will occur. And ultimately, you want to be able to look back historically to see how your data has trended and changed over time. So for people who are coming from the software space and leaning on things like logs, metrics, and traces,
[00:20:04] Unknown:
and trying to either work with their colleagues in the data teams and be able to map their understanding or people who are coming from the software space and trying to build out a data platform. What are some of the kind of useful analogies or useful mappings that you've seen for being able to say, okay. If you're working in the software space and you're used to using these 3 tools to be able to understand how to trace back the overall flow of requests in the system, and then now I'm going to go from, you know, distributed systems tracing to lineage tracking, being able to just kind of map those concepts back and forth to be able to rely on some of the existing experience that they might have.
[00:20:47] Unknown:
To kind of get the point of observability across to someone who might have a software background, I ask them, you know, remember what developing software used to be like in the early 2010s? Right? You might write a rails app, push it to an EC2 box, put on a heartbeat check, and you kinda call it a day. Today, if you start a software project and you don't install an observability tool, it is a rough start. And I would just ask, can you imagine a world in which you're developing a software system, and the only way that you know that you have a slow API request is when your users tell you or when your app breaks.
Because, frankly, that is the case in a lot of data teams today. Unfortunately, I just don't think that the technology has, you know, come about yet to make this possible. So, and we take a lot of inspiration from Datadog. To give you an example, when you bring on Datadog for the first time, there's a whole mountain of integrations. It's not even a question about whether or not Datadog integrates with the majority of your systems. And the moment you integrate, the time to value is extremely quick. And you can cover a whole, like a large swath of use cases. As an engineer, I use Datadog. It's not a question that they have me covered, not only for software quality issues,
[00:22:12] Unknown:
but for any other sorts of tests I might wanna run. In terms of the kind of lessons that you've learned from Datadog as this example of 1 of the biggest players in the monitoring and observability space for software systems. What are some of the negative signals that you've been able to learn from to understand where to diverge from their example?
[00:22:38] Unknown:
Excellent question. I think data and software are different. Right? You know, it's very en vogue for data vendors to talk about how software, how we're adopting many practices, like CICD and test driven development from the software world, and applying to the data world. But there are significant differences. 1 huge difference is that concept of lineage. Infrastructure mapping is probably and request mapping is probably a decent analogy, but not quite on the nose. Where, the idea that a piece of data within a BI dashboard comes from a computation applied to a warehouse, which comes from Shopify via an ELT tool, is kind of foreign. I would say it's novel to our space, and it's so critical that needs to be plumbed throughout the application.
[00:23:31] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hitouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. Going back to your focus on the data warehouse as the initial target and then focusing on some of the downstream systems that are consuming from it, what are some of the blind spots that you have identified from using that as your focal point where some of your customers maybe have questions that they're unable to do so because they don't have information from the kind of prior stages of the data life cycle or some of the cases where you're trying to close the feedback loop from, I have information about the state of this data as it exists in the data warehouse. It's now being consumed by this downstream tool. Maybe it's going out through a census or a high touch back into these various SaaS platforms that are being re ingested into the data warehouse. And just some of the types of questions about the system state and the overall holistic view of the platform that you're unable to satisfactorily answer given your current viewpoint?
[00:25:09] Unknown:
It's funny that the warehouse is both the single place, or aspirationally the single source of truth, and yet the source of none of the truth. Right? Like Snowflake does not create data. Right? The data comes from another place. Right? It either comes from a SaaS tool, or an a transactional database, or some other source. So I would say that many of the users who use Metaplane are happy with in terms of the coverage provided by us monitoring their data warehouse, but it's not the complete picture. Right? The complete picture goes downstream, all the way down to the consumer.
Sometimes it loops all the way back with the reverse ETL tools like you mentioned, and all the way upstream to where the data was generated.
[00:26:02] Unknown:
Another interesting aspect of systems like this are that there are kind of 2 camps of users where there are the people who know that they need the system and will ask somebody to build it for them or build it on their own. And then there are the people who will argue that they don't need the system because they already have unit tests or CICD, etcetera, or, you know, they have a small enough team that they're able to maintain all of the context. And I'm wondering what are some of the elements of customer education that you have had to learn or develop to be able to help people either identify when the solution that you have built is the answer to the problem that they have or understand when the problems that they're facing are related to the fact that they don't have a complete view of the data as it transits their various systems.
[00:26:59] Unknown:
We like to describe the status quo as primarily EDT, which is executive driven testing, such that when you have many, many tables in your warehouse and even many more dashboards, who's the first to know about data issues? Right? Is it the end user, like a CFO checking out a financial reporting dashboard? And if so, that might be a use case that Metaplane can help you address. Right? We can help you be the first to identify, and then help you identify the issue, and then help you resolve it faster. 1 of the best parts about working in the data space is working with awesome data teams. And in many ways, our users can build meta plan, they know how to build meta plan. It's like you're saying, right? Building things in house is often an option, but because you can build it in house, right, you know that this isn't necessarily the best use of your time. Right? Let us take care of some of the plumbing and orchestration and modeling for you.
[00:28:08] Unknown:
In terms of the kind of workflows for people who are relying on Metaplane to answer their questions, I'm curious how you're able to hook into the sort of various interaction points of the systems where maybe somebody's coming from the data warehouse and they want to understand, you know, the lineage of a table, or they're coming in from their dbt models and they want to know what the kind of performance characteristics are of the transformations that they're building or they're trying to figure out, you know, I've got this data that's landing in the source table, and I have this report that is querying this data. And the latency from the delivery of the source data to the refresh of the BI table is, you know, 6 hours. I'm trying to figure out how can I cut that down, what are the performance bottlenecks?
Like, how are you able to kind of establish some of these deep links into, I have this problem. I'm at this point of the system, to jumping into Metaplane to say, this is the piece of information that I need versus having to sort of start at the 30, 000 foot view and then dig down every time?
[00:29:23] Unknown:
Well, the majority of our users got started with Metaplane without talking with us. You can just go to the website and try it out, almost like a la carte style. You can connect to your Snowflake if you want for the use case that you mentioned. You can connect to your DBT instance or a BI tool to address the other use cases you mentioned. And we do have a free plan, where you can run tests and run schema change tests for free forever if you want, and then it kind of scales up based on the number of tests you have. But once everything is connected, and this typically takes less than 15 minutes, it's up to you what kinds of tests you wanna add on top. Right? We can automatically add tests for you, if it's based on data warehouse metadata, like freshness or row counts. You might also wanna tailor it down a little bit, if you want specific tests like tracking the distribution of a numeric metric or the uniqueness of a primary key. But the idea is that once you attach those tests, then let us do our thing. You'll have schema change alerts immediately and after training period that depends on the natural variation of your data. You'll start receiving alerts on freshness and volume, and that will include the metadata that you need. And from there, you can provide feedback to our models.
For 1 thing, alert fatigue is very real. If you're listening to this podcast, your Slack is probably going off like crazy, and we do not wanna contribute to that. We only wanna send you the best alerts as possible. So the workflow of many of our users is to provide us annotations that we take into account.
[00:31:02] Unknown:
To the point of alert fatigue and being able to identify what are the useful pieces of information and understanding when it's important enough to actually raise an alert versus just providing a sort of informational note when somebody logs into the platform? Kind of how are you exploring that continuum? Because it's different for every team, and it's always a difficult balance to strike no matter how hard you try.
[00:31:30] Unknown:
It is challenging. Part of it is understanding that, ultimately the impact is what matters. Right? If this table is not fresh, or the number of rows has tanked, and it's not being used by anyone, perhaps that's not a very high priority issue. It's not a p 0. You don't have to throw away your afternoon to solve this issue. But there are a lot of nuances when it comes to modeling the data, specifically in the data warehousing setting, which is a different setting than a lot of the off the shelf time series analysis tools assume, where you have an extremely high sampling rate of data that varies quite a bit. In a data warehouse, there are additional constraints on top of that. Your data might not vary every second, it probably does not.
And there are I'm gonna give you 1 small example. If you take the number of rows in a table, oftentimes the number of rows on table is purely additive, right? With many incremental models, if you are tracking the number of events, or a click stream. So we wouldn't want to alert you on standard increases, even if the increase is statistically significant. But in this sort of a table, we would want to alert you on a decrease. And it's a combination of those sorts of kind of intuitions codified into models, plus the knowledge that the data again is interrelated, it has a lineage. So that if 20 tables go down, but that many of them have a single root cause, we wouldn't want to send you 20 alerts, we'd want to send you maybe 1 or a handful of alerts. And kind of crafting this system around, you know, at the end of the day, the only metric that matters for us, which is the signal to noise ratio.
That's kind of where the complexity comes in.
[00:33:25] Unknown:
In terms of the workflow of onboarding to Metaplane and starting to incorporate it into explore it on their own. I'm curious what you have seen as some explore it on their own. I'm curious what you have seen as some of the largest motivating factors for people saying, this is a pain point that I have, and this is the problem that I'm trying to solve when I signed up for it. And then maybe some of the ways that they can incorporate Metaplane into their development and maintenance routines for their data systems.
[00:34:05] Unknown:
I would say that it's about 1 third, 1 third, 1 third, roughly, in terms of users who bring on MetaPlan. 1 third are people who work on data systems that have come from a software background, like you mentioned before that say, I cannot live like this. Right? I cannot go on, like, not knowing that my data is correct, that the end result that I produce for my warehouse, which by the way is a production system, that is not up. The second group of users have come from an organization where observability is available. It might be a large organization probably where they built a tool in house, and then go into a new setting, or starting a new team, and saying, I kind of need this. Right? Once you see it, there's no going back. And then the third, honestly, are data leaders where crap hit the fan.
Right? That something went terribly wrong, and they were the people who were held responsible. I'll describe it as kind of those 3 buckets. And the overarching theme is observability is a new category. It's in very, very early days, but it is moving quickly. But oftentimes, there's no going back. Where without observability, you could make the argument that's not necessary, but oftentimes, if you already tried it, you know how valuable it can be and how little overhead cost is required to both bring it on and maintain it. In terms of the
[00:35:37] Unknown:
applications of MetaPlay, and I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it used or some of the kind of problem solutions that people have settled on that you didn't anticipate when you were initially designing the platform?
[00:35:53] Unknown:
To go back to the question about the warehouse as the focal point, we thought that we could just integrate with the warehouse, and that would carry us for a long time. But very, very quickly, our users pulled us both upstream to say, please integrate with with segment, with Postgres, with a production data store, and they pulled us downstream to integrate with the feature stores. Right? Inputs into machine learning models and into BI tools. So that was unexpected, but that happened pretty quickly on the technological standpoint.
The kind of social relationship standpoint, we were very surprised to see Metaplane turn into almost a nucleus for team interactions. Whether it's on the data team, so we send an alert, right? And you kind of at mention this, or forward it to someone else on your team, or you include someone who is directly impacted by this issue. After we start sending alerts. And that was a bit of a surprise to us. Right? We after we start sending alerts. And that was a bit of a surprise to us. We came in thinking that Metaplane is infrastructure. Infrastructure lives in the background. Well, honestly, sometimes infrastructure lives in the foreground.
[00:37:14] Unknown:
Another interesting element of the observability space for data and where it sits in some of these larger trends that we've been seeing in the evolution of the data ecosystem is that there's a decent amount of overlap from a number of different directions where, you know, you are overlapping with some of the data quality tools because you're able to surface some of the debugging information or raise alerts on anomalous data situations. You are overlapping a little bit with some of the data catalog and data lineage systems because you have this lineage tracking to be able to understand from a, you know, debugging perspective, you know, this downstream problem occurred, so now I need to trace it back to its root or, you know, multiple downstream problems occurred because of an error at this focal point. You know, there's also the overlap with some of these software observability systems that people are repurposing to work with their data platforms. I'm just curious how how you think about inhabiting that kind of Venn diagram of problem domains and some of the ways that you think about both differentiating as well as kind of coexisting with these other systems that have overlapping, but occasionally orthogonal purposes?
[00:38:27] Unknown:
For us, it comes down to solving a problem for data teams. That's our number 1 focus. And for us, we think observability is a problem that data teams inevitably run into as data is being used across more and more use cases, kind of beyond the main historical use case of business reporting to machine learning and artificial intelligence and data science, to go to market operations, that as more and more data is collected and modeled and used, the stakes only go up and up and up, and it's a matter of time until teams need a way to be confident in the data.
So within the narrow observability space, we think that every team should be able to bring on a data observability tool if they want to, and not make a whole thing of it. Right? It doesn't need to be a quarterly initiative. Right? Bringing on a software observability tool is more of an afternoon kind of project, just to get started. And it's the same way for Metaplane, 15 minutes, and then you have your initial suite of tests laid out. And in terms of overlap with adjacent metadata driven tools, data quality tools, and cataloging tools.
I think 1 note is the data space is still very small and quite early, where in the future, I think that a lot of these tools will differentiate and expand into separate niches, in the same way that, right, the software market, you have observability tools, and you have unit testing, and CICD, and build tools. There are overlapping use cases, but they've differentiated and have become interoperable. To the point that, you know, asking if Datadog can replace a unit testing tool is kind of like, Oh, you do need both at the end of the day. And the reverse is also true. Right? Just unit tests, they're building a complex software project, does not replace the need for an observability tool. So I think that there is like a core set of metadata that is shared between these tools, going back to the metrics, metadata, lineage, and logs. But it's being surfaced in many different ways, solve different use cases for different people.
[00:40:44] Unknown:
In terms of your experience of building the Metaplane platform and working with your customers and working with practitioners in the space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:40:57] Unknown:
The main interesting lesson is the misconception that observability takes a long time to implement. And going to the software world, observability is not viewed as a huge initiative just to get started. Right? It's almost like my focus is on building a scalable, usable software system that meets the specs, add observability on top. It won't be an enormous lift, either in terms of time, or energy, or cost. And I think a lot of data teams, for good reason, view observability as a major effort to get started. You know, many other initiatives are huge efforts, like setting up a data warehouse, or setting up a BI tool.
We don't think that that should be the case for observability as well. Observability should be 10 or 30 minutes that you don't need dedicated resources to get started.
[00:41:55] Unknown:
And for people who are trying to understand more about the behavior of their data systems, what are some of the cases where Metaplane is the wrong choice and they might be better served with a more narrowly scoped sort of data quality tool or a data catalog tool, or they might be better served with building out their own internal tooling or internal systems.
[00:42:21] Unknown:
For 1 thing, if your primary data asset is unstructured data, Metaplane is probably not the right place to get started, either Metaplane or any data observability tool. Once you introduce unstructured data to them, makes you have a whole different set of concerns and analysis. And also, if there are not many downstream users or systems from the data that you're responsible for, you probably don't need Metaplane for now. Right? The moment that there is a critical use case that is attached to the data, that's the right time to bring on an observability tool, hopefully before then. But just at the beginning stage where the stakes are still relatively low, me personally, I would focus on making sure that it is tied to your use case, and then bringing on observability.
[00:43:11] Unknown:
And in terms of the work that you're doing on meta play now and into the future, what are some of the things you have planned for the near to medium term, or any projects or problem spaces that you're excited to dig into?
[00:43:24] Unknown:
Well, the observability space is still early days. Right? Every company was started in the past 2 or 3 years, and has less than a 100 customers. And our goal is to be the observability tool that every team can use, and that means being the first tool with a 100, and then a 1000 customers. And due to that, we are working in 2 big directions. 1 is deepening our integrations, and some exciting ones in the pipeline include Databricks and ClickHouse. And we're also shipping new features that use the metadata that we collect to kind of augment the primary use case of being the first to know about data issues.
Some examples include tracking the lineage, upstream of the warehouse, like we mentioned, or tracking credit spend, and usage analytics of the tables within your warehouse. You know, ultimately, we only exist to save data teams time and money, and help increase awareness, and help them make better decisions. So observability is just a technology. There's many use cases on top of the technology that can be created in service of those overarching goals.
[00:44:36] Unknown:
Are there any other aspects of the observability space, or some of its future evolution, or the work that you're doing on Metaplane that we didn't discuss yet that you'd like to cover before we close out the show? I think
[00:44:48] Unknown:
data teams are so busy and resource constrained that if anything takes effort or time, it should really, really be critically evaluated. And that includes bringing on an observability toward implementing testing of any sort. We think that tests are hard to add. You will not have tests, unless you roll with an iron fist. Same thing with observability. It's not easy to bring on observability, unless it's a priority, and it'll probably happen a little bit late. So that's why we focus on automating as much as possible, Not only automating the collection of observations of freshness and row counts from your data warehouse, but also automating the modeling, automating the lineage extraction, and the feedback mechanisms.
Our focus is have users only give us information when it's needed to improve the models over time, not to do rote work that a tool should be taken care of for you.
[00:45:52] Unknown:
And for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question And as a corollary to that, what are some of the areas of data observability that you think are still yet to be defined or properly understood?
[00:46:22] Unknown:
The biggest gap in our world today is probably hiring, where every team that we talk to is trying to hire, but there is simply not enough talent to meet the demand. And while I don't think there's a purely technical solution here, there must be ways that better tooling and technology can help train the talent, help people differentiate themselves, and help them connect to great companies to start or grow their careers. At the end of the day, the goal is to bring data to more and more companies, and more and more use cases, and the biggest bottleneck is the amount of people who are skilled and able to do that. Right? Tools are just tools. Right? We love Metaplane, but it's people who are using the tools and creating the data that are making the difference.
There's 1 other interesting gap that Sylvain, who leads growth at Census, brought up, which is, you know, integrating a tool like reverse detail tool or an observability tool to your warehouse currently. It's kind of a painful process and insecure, where you provision this role, and you create a connection string and paste it in. Where is OAuth for data warehouses? I think that is a pretty big gap that any of the big vendors can figure out, and both the warehouse providers themselves, or vendors like us, and users on data teams, everyone wins. And it's not like OAuth is a new technology.
[00:47:54] Unknown:
Yes. But then you don't get to have the joy of configuring your ODBC drivers.
[00:48:00] Unknown:
It's fun the first time. That's for sure. It's never fun. Maybe
[00:48:04] Unknown:
Not even the first time.
[00:48:06] Unknown:
You're right about that. That's true. That's true. Maybe to unlock OAuth, you have to do it 1 time just to yeah.
[00:48:13] Unknown:
Yeah. And to your point about hiring, I think that there's a great talk from the early days of the DevOps sort of I don't know if revolution or evolution or emergence are the right terms, but from Jez Humble where he gave a presentation of stop hiring DevOps engineers and start growing them. And I think that we're in a similar situation in the data space of we as organizations and engineers, we need to stop thinking that the solution to our capacity for our data teams is to go out and hire somebody who's already an expert and to start giving internal people the opportunities to grow into the role and give them the education they need to be effective in that space.
[00:48:52] Unknown:
I think that is a 100% right, that 1, I have to go read that article. Now there are many people trying to break into data. Right? There's 2 truths that we have to suspend in our minds. Right? 1 is that talent does not meet demand, and that data is a hot space, and then many people are trying to break into it. Right? It's kind of a disconnect between junior and senior, and the way to bridge that gap in lieu of external training is, like you mentioned, being able to foster that talent yourselves.
[00:49:22] Unknown:
Well, I appreciate you taking the time today to join me and share the work that you've been doing on Metaplane and your perspective on the data observability space. It's definitely a very interesting ecosystem and 1 that, as we've discussed, is still very nascent. So it's nice to see a lot of people who are starting to explore that problem domain and helping to flesh it out and understand what are the utilities and constraints of that overall solution space. So I definitely appreciate all the time and energy you put into that, and I hope you enjoy the rest of your day. Yeah. You too. Thanks a lot, Tobias. Take care. Thank you for listening.
For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Kevin Hu: Introduction to Metaplane
Kevin Hu's Journey into Data Management
Defining Data Observability
Data Quality vs. Data Observability
Architecting Metaplane: Key Decisions and Challenges
Implementation and Architecture of Metaplane
Primary Consumers and User Experience
Integrating Lineage Information
Collecting and Analyzing Data Signals
Mapping Software Observability to Data Observability
Blind Spots and Challenges in Data Observability
Customer Education and Adoption
Onboarding and User Workflows
Balancing Alerts and Information
Motivating Factors for Adopting Metaplane
Unexpected Uses and Team Interactions
Overlapping Domains in Data Observability
Lessons Learned and Challenges
Future Plans and Integrations
Closing Thoughts and Contact Information