Data Observability Out Of The Box With Metaplane

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud.

Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow.

Now all the data users can use software engineering best practices.

Git tests and continuous deployment with a simple to use visual designer.

How does it work? You visually design the pipelines and Prophecy generates clean spark code with tests and stores it in version control, then you visually schedule these pipelines on Airflow.

You can observe your pipelines with built in metadata search and column level lineage.

Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy.

Your host is Tobias Macy. And today, I'm interviewing Kevin Hu about Metaplane, a platform aiming to provide observability for modern data stacks from warehouses to BI dashboards and everything in between. So, Kevin, can you start by introducing yourself?

Pleasure to meet you, Tobias. It's great to be on the show. I'm Kevin. I'm the cofounder and CEO of Metaplane.

We like to think of it as a Datadog for data. It's a data observability tool that continuously monitors your data stack, and alerts you when something goes wrong, and then gives you the relevant metadata to debug.

And do you remember how you first got involved in the area of data? I do. I got into data management

when I was an undergrad,

studying physics. And MIT at the time has a notoriously difficult

experimental lab course,

where every 2 weeks,

you replicate a Nobel Prize winning experiment.

1 week, you do the experiment, the 2nd week, you analyze the data.

And

even though it's known as

a killer course,

I noticed that the students who had the hardest time, people who didn't know the math or the physics, it was people who didn't know how to code or analyze data.

And now as it turned out, the friction to working with data

was exceptionally high, it still is, not just for scientists, but for people across domains.

So I spent the next 6 years doing research on trying to lower the friction to work with data,

applying machine learning to automating data visualization,

to type detection.

And, yeah, that's how I got into the space.

And the transition from

research to founding a company was because, you know, the research was exciting, but ultimately, I think the way to have the largest impact on data

is through building tools that people use. I'm wondering if you can describe a bit what it is that you're building at Metaplane, and some of the story behind how it came about, and why you decided that this was the problem that was worth spending your time and energy on. So when we were trying to implement some of our research at, you know, both small startups

and very large companies,

1 of the largest friction points was making sure that the data was correct.

And of course, data quality is a timeless problem.

Papers on data quality go back to some of the original

papers on databases.

But after talking to my 2 co founders, who were previously software engineers

at HubSpot,

it was clear that in the software world,

there is a playbook, and there are tools to diagnose and fix issues.

Datadog,

signal effects, you name it. Data teams don't have the playbook and don't have the tools yet. That's what we are building with Metaplane.

Broadly speaking, you have positioned yourself in the space of data observability,

which is a fairly new term that is still

nascent and

going through its sort of phase of self discovery. So I'm wondering if you can share your working definition of what that definition is and what it means to have data observability.

Great question. I think everyone has their own definition.

For us,

we like to go back to the roots,

Not just to software observability,

but even to control theory.

Way back to when observability got started. It was James Clerk Maxwell trying to

formalize

how to understand the system that was used to control the speed of engines.

It's called a centrifugal governor. It was originally created

by Huygens to control windmills, later to control steam engines.

So from there, rose the idea of controllability.

Given

the inputs of a system, how can you understand this internal states?

And observability

is the corresponding concept, the mathematical dual, which is given the outputs

of a system, how can you understand the internal states? A lot of this is inspired by the idea of how can I reconstruct the state of a system at any point in time, and understand how it will evolve in the future?

And that was

really kind of the kernel of the idea that was borrowed by software observability vendors,

where they said, Okay, given metrics, traces and logs, these 3 pillars,

we can reconstruct the state of a software system arbitrarily at any point in time. And to actually answer your question, in data observability,

we believe that there are 4 pillars,

metadata, metrics,

lineage, and logs, that if you capture those 4, then you can reconstruct the state of a data system over time. So we think of observability as how much you can reconstruct a data system, and correspondingly, a degree of visibility that you have into 1. Another interesting element of the space of data observability

and data quality and the sort of conflation of those 2 terms is that they are

areas, particularly in the data space, that have seen a lot of activity over the past 1 to 2 years.

And I'm wondering what you see as the areas of differentiation

across vendors in the space and some of the

ways that different interpretations

of observability and quality can manifest?

Just to start off, even though quality and observability

are very conflated, and

in some ways, observability

may be cynically

regarded as a rebranding of quality.

Data quality is a use case, right? It's a problem to be solved, and a very important problem.

Observability

is a technology,

right? By gathering and centralizing all of the metadata

into 1 central system,

you can, yes,

identify and address data quality issues, but you can also address many other issues.

For example, lineage and spend monitoring, and usage analytics on your data warehouse.

And vendors in this space, I think do differentiate

along

that access,

which is how comprehensive is the metadata

that your

tool collects? Is it focused just on the warehouse?

Or does it go upstream to transactional databases? Does it go downstream

to reverse ETL tools or to BI tools?

And the other axis,

which we think a lot about, is how accessible

this observability tool. Or is it designed

for the Fortune 1, 000 companies, you have to talk to the VP of sales,

go through a large implementation process to use it?

Or like every other tool in the modern data stack, like DBT and Snowflake and Liquor,

in 15 minutes, just try bringing on observability. If it works, that's great. If it doesn't work, no harm, no foul.

And in terms of the access that you were discussing as far as, you know, do you

cover the entire life cycle of the data from

the, you know, transactional systems and SaaS platforms that originate the data through to delivery and interpretation and then closing that loop? Or do you decide

from some arbitrary point in the middle or at the end, this is the domain that I'm going to cover, and then

figuring out from there what are the attachment points

and additional systems that you need to be able to maintain coverage of. I'm curious how you

approached that question as you were starting to

architect and iterate on the idea of Metaplane to see what is the

sort of highest leverage point that we can go from that will provide the most value versus saying

kind of, idealistically,

we want to cover all of the use cases and then being able to figure out sort of what is the approach.

You just gave us the answer right there. And it's a great question, where the highest leverage point for an observability tool is to connect to the warehouse.

Right? That's the center of gravity

of your data stack, so to speak.

And there's only a handful of vendors that you need to integrate with to cover

a large percentage

of

data organizations in the world.

So we did start there,

however,

there's an ongoing debate in observability, even in software observability, of do you monitor the causes,

or do you monitor the symptoms?

And there's pros and cons to both.

I think a lot of the software world has landed on the consensus that we do want to monitor

the symptoms, because that is the most correlated with, you know, end user performance,

and

you can always trace back into potential causes.

So in our world, it might be monitoring

BI dashboards, or machine learning models,

or go to market tools.

So to answer your question, we're starting from the warehouse, but kind of going outwards to adjacent spaces.

And as far as exploring those adjacencies,

how did you think about the kind of network effects of saying, okay. We

have covered the warehouse or even identifying

what your coverage is for the warehouse to determine your level of completeness to say, we need to spend more energy at just covering all of the interactions and elements that exist within the confines of the data warehouse versus

saying, we've got, you know, the core workflows

managed, and now we need to think about branching out into these other systems so that we can get an end to end coverage of a small subset of use cases versus a complete set of coverage in a more narrowed sort of technological scope?

Increasingly, I think

more and more of the life cycle of data is within the warehouse,

going from, like, a raw landing zone from ELT tools to a staging zone, to

a fully modeled, ready to consume layer. So by integrating within the warehouse, I think you do cover

a decent segment of the life cycle of data.

We are going upstream and downstream,

but not necessarily focused on 1 particular use case.

To give you an example, we have integrations with both Snowflake and Postgres, and many of our customers use both of them.

Not only for detecting schema changes that might

be

caused in upstream systems that impact downstream systems, but also replication issues.

So I think focusing on those sorts of jobs

makes it so that

tools like ours don't necessarily need to be focused on the use case, and yet will still address those use cases.

And so can you dig a bit more into the actual implementation and architecture of Metaplane

and how you think about the collection and analysis

of the signals that people are generating from their various data systems and platforms and workflows?

So every hour, we use Amazon's ECS to spin up some computing resources, and make tens of thousands of observations across our customers as warehouses and BI tools and transformation tools.

And

on that hour, we retrieve the observation and train a machine learning and time series model, depends on the type of test we have an ensemble of models,

and then alerts you when there is an unexpected event.

And then as far as the

kind of

periodicity there, you mentioned that you do it hourly. I'm curious what the

sensitivity is to latency of being able to uncover some of these anomalous events or being able to gain some insight about the performance of your data stack, particularly as it compares to the types of

appetite for latency that you might see in a production

software system?

Characteristic

time scale of the time series that we analyze

makes it such an interesting problem.

Right? Like in a given data warehouse, sometimes you might have data landing every minute, or even on a sub minute or

second basis. Other times, you might have data that's refreshed every week. And it's hard to say that 1 is necessarily

more important to your monitor than another.

But at the end of the day, seasonality is incredibly important,

especially now that we pass the holiday season. And many of us, our customers are e commerce customers. We don't wanna send everyone alerts the moment that Black Friday or Christmas comes. So we do account for seasonality as and we also account for trends.

And we try and be very careful about understanding, okay,

does this table change every day, every week, and taking that into account. And so 1 of the other things that you mentioned is the

goal of being very accessible

as a platform to be able to bring some of these characteristics of observability to a broader audience. And I'm wondering

who you think of as the

primary consumers

of the information that you're generating and some of the ways that that has

influenced your thinking about the feature sets that you need to develop and some of the user experience and user interface patterns to

lean on to be able to appropriately convey the necessary signals to the people who are interacting with your system? 1 of the

most

unexpected learnings throughout the course of starting this company is just how

much data is

changing the world. In our world, it's

kind of clear, okay, startups are hiring heads of data, large companies already have large data teams.

But you'd be surprised, as we were, that even some of the smallest

startups out there are starting to spin up data stacks.

And the people who are creating the data stacks, and gathering and analyzing the data, don't necessarily have the title of a head of data or data engineer or analytics engineer, but they are doing the work.

So for a lot of growth stage companies,

including many of our customers,

the people who get the most value out of Metaplane at first,

maybe the data engineers or maybe the heads of data. But very quickly, we've noticed that

Metaplane, the hashtag Metaplane alerts channel can maybe start at 2 people, but quickly it grows to a dozen people, or 20 people, as, you know, something breaks. And I at mention 1 of my colleagues saying, hey, you know, this table looks weird, can you please check it out? You're recently touching this DBT job.

So the people who bring us in aren't necessarily the people who are impacted

by the data, or

causing some of the issues,

but that Metaplane is disseminated throughout organizations to all of them. You mentioned that you're also working with some of the lineage information and integrating with the BI systems, and I'm curious how that manifests. Is it something where

somebody needs to

go to the Metaplane dash board to be able to view these different linkages and view the sort of freshness of the report that they're querying? Or is it something where

in the BI system, you're going to incorporate a

signal from Metaplane to say, you know, this report is up to date. This is the last time it was

refreshed. These are the signals that we're relying on to be able to compute that fact or, you know, from the dashboard being able to say,

I want to understand more about, you know, the state of this report and then be able to jump from that into Metaplane to be able to dig deeper to explore some of these aspects of observability in the data space.

That is such a good idea,

and 1 that we are looking at. Unfortunately,

we don't have

that dashboard for your dashboards yet, so to speak, or a KPI for your KPIs that we're trying to be cute.

But the way that our users consume the lineage information primarily is actually through the alerts. They're using a tool like Slack or email, and Metaplane sends them an alert saying, for example,

this user's table is typically updated every hour, but it's been 7 hours since it's been updated.

We include

downstream as well as upstream lineage.

So for example, these 3

liquor

dashboards that have been viewed 500 times

are impacted by this table being delayed, and here are upstream DBT models.

And as far as the actual

pieces of information that you're collecting, I'm wondering if you can talk to some of the sort of categories

of data that you're collecting to be able to establish the kind of observability of the system. Like, what are the pieces of information that are necessary

to understand the key aspects of how the data platform is operating and be able to

dig into some of

the problem occurrences

or be able to

proactively

identify

this is going to cause an issue with this downstream system and just how that compares to maybe some of the data quality focused

tools and vendors in the market?

So going back to the control theory roots of observability,

this idea of a state space representation of a system,

where you want to describe a system in as few variables as possible, and to make sure that the variables contain, like orthogonal

information.

And for us, we believe that there are kind of 4

variables that matter for data systems.

2 of which describe

the characteristics

of the data itself.

Whether it's the internal characteristics,

like

the

metrics of data. Right? What is the knownness of this column? What is the mean? Does it contain PII?

And 1 describes the external

characteristics of the data.

How many rows are there? Is it being refreshed?

And then there's 2

variables that describe interactions

within the data system. Right? Lineage,

is this data derived from a computation applied to

previous pieces of data and logs?

How does this data interact with external people and external systems?

So with those 4 together,

all of which Metaplane collects,

we centralize all of the data,

and surface potential issues.

The subtle difference I think with a data quality focused system is,

even if an issue isn't occurring,

we still are collecting this metadata.

Because down the line, you don't necessarily know when an issue will occur.

And ultimately, you want to be able to look back historically to see how your data has trended and changed over time. So for people who are coming from the software space and leaning on things like logs, metrics, and traces,

and trying to

either work with their colleagues in the data teams and be able to map their understanding or people who are coming from

the software space and trying to build out a data platform. What are some of the kind of useful analogies or useful mappings that you've seen for being able to say, okay. If you're working in the software space and you're used to using these 3 tools to be able to understand

how to trace back the overall flow of requests in the system, and then now I'm going to go from, you know, distributed systems tracing to lineage tracking, being able to just kind of map those concepts back and forth to be able to

rely on some of the

existing experience that they might have.

To kind of get the point of observability

across to someone who might have a software background,

I ask them, you know, remember what developing software used to be like in the early 2010s?

Right? You might write a rails app, push it to an EC2 box,

put on a heartbeat check, and you kinda call it a day.

Today, if you start a software project and you don't install an observability tool,

it is a rough start.

And I would just ask,

can you imagine a world in which you're developing a software system, and the only way that you know

that you have a slow API request is when your users tell you or when your app breaks.

Because, frankly, that is the case in a lot of data teams today.

Unfortunately, I just don't think that the technology

has, you know, come about yet to make this possible.

So,

and we take a lot of inspiration

from Datadog.

To give you an example,

when you bring on Datadog for the first time,

there's a whole mountain of integrations. It's not even a question about whether or not Datadog integrates

with the majority of your systems. And the moment you integrate,

the time to value is extremely quick.

And you can cover a whole, like a large swath of use cases. As an engineer, I use Datadog. It's not a question that they have me covered,

not only for software quality issues,

but for any other sorts of tests I might wanna run. In terms of the kind of lessons that you've learned from Datadog as this example of 1 of the

biggest players in the monitoring and observability space for software systems. What are some of the negative signals that you've been able to

learn from to understand

where to diverge from their example?

Excellent question. I think data and software

are different. Right? You know, it's very en vogue for data vendors to talk about how software, how we're adopting many practices, like CICD

and test driven development from the software world, and applying to the data world. But there are significant differences.

1 huge difference is that concept of lineage.

Infrastructure mapping is probably and request mapping is probably

a decent analogy, but not quite on the nose. Where,

the idea that a piece of data within a BI dashboard

comes from a computation applied to a warehouse, which comes from

Shopify via an ELT tool,

is

kind of foreign. I would say it's novel

to our space, and it's so critical that needs to be plumbed throughout the application.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads?

Hightouch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hitouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch.

Going back to your focus on the data warehouse

as the initial target

and then focusing on some of the downstream systems that are consuming from it, what are some of the blind spots that you have identified

from

using that as your focal point where some of your customers maybe have questions

that they're unable to do so because they don't have information from

the kind of prior stages of the data life cycle or some of the cases where you're trying to close the feedback loop from,

I have information about the state of this data as it exists in the data warehouse. It's now being consumed by this downstream tool. Maybe it's going out through

a census or a high touch back into

these various SaaS platforms that are being re ingested into the data warehouse. And just some of the types of

questions about the system state and the overall holistic view of the platform

that you're unable to

satisfactorily

answer given your current viewpoint?

It's funny that the warehouse is both

the

single

place, or aspirationally

the single source of truth,

and yet the source of none of the truth. Right? Like Snowflake does not create data. Right? The data comes from another place.

Right? It either comes from a SaaS tool,

or an a transactional database,

or some other source.

So I would say that

many of the users who use Metaplane

are happy with in terms of the coverage

provided by us

monitoring their data warehouse, but it's not the complete picture. Right? The complete picture goes downstream,

all the way down to the consumer.

Sometimes

it loops all the way back with the reverse ETL tools like you mentioned,

and all the way upstream to where the data was

generated.

Another interesting aspect of systems like this are that

there are kind of 2 camps of users where there are the people who know that they need the system

and will ask somebody to build it for them or build it on their own. And then there are the people who will argue that they don't need the system because they already have unit tests or CICD,

etcetera, or, you know, they have a small enough team that they're able to maintain all of the context. And I'm wondering what are some of the

elements of customer education

that you have had to

learn or develop to be able to help people

either identify

when the solution that you have built is the answer to the problem that they have

or understand

when the problems that they're facing are related to the fact that they don't have a complete view of the data as it transits their various systems.

We like to describe the status quo as primarily

EDT,

which is executive driven testing,

such that when you have many, many tables in your warehouse and even many more dashboards,

who's the first to know about data issues? Right? Is it the end user, like a CFO checking out a financial

reporting dashboard?

And if so,

that might be a use case that Metaplane can help you address. Right? We can help you be the first to identify,

and then

help you identify the issue, and then

help you resolve it

faster.

1 of the

best parts about working in the data space

is

working with awesome data teams. And

in many ways, our users can build meta plan, they know how to build meta plan. It's like you're saying, right? Building things in house is often an option,

but because you can build it in house, right, you know that this isn't necessarily

the best use of your time. Right? Let us take care of some of the plumbing and orchestration and modeling

for you.

In terms of the

kind of workflows

for people who are

relying on Metaplane to answer their questions, I'm curious

how you're able to

hook into

the

sort of various

interaction points of the systems where maybe somebody's coming from

the data warehouse and they want to understand,

you know, the lineage of a table, or they're coming in from their dbt models and they want to know

what

the kind of performance characteristics are of the transformations that they're building or they're trying to figure out, you know, I've got this data that's landing in the source table, and I have this report that is querying this data.

And the latency from the delivery of the source data to the refresh of the BI table

is, you know, 6 hours. I'm trying to figure out how can I cut that down, what are the

performance bottlenecks?

Like, how are you able to

kind of establish some of these deep links into,

I have this problem. I'm at this point of the system, to jumping into Metaplane to say, this is the piece of information that I need versus having to sort of start at the 30, 000 foot view and then dig down every time?

Well, the majority of our users got started with Metaplane without talking with us. You can just go to the website

and try it out, almost like a la carte style. You can connect to your Snowflake if you want for the use case that you mentioned. You can connect to your DBT

instance or a BI tool to address the other use cases you mentioned. And we do have a free plan,

where you can run tests and run schema change

tests for free forever if you want, and then it kind of scales up based on the number of tests you have.

But

once everything is connected, and this typically takes

less than 15 minutes,

it's up to you

what kinds of tests you wanna add on top. Right? We can automatically add tests for you, if it's based on data warehouse metadata, like freshness or row counts. You might also wanna tailor it down a little bit, if you want specific tests like tracking the distribution

of a numeric metric or the uniqueness of a primary key. But the idea is that once you attach those tests,

then let us do our thing. You'll have schema change alerts

immediately and after training period that depends

on the natural variation of your data. You'll start receiving

alerts on freshness and volume,

and that will include the metadata

that you need. And from there, you can provide feedback to our models.

For 1 thing, alert fatigue is

very real. If you're listening to this podcast,

your Slack is probably going off like crazy, and we do not wanna contribute to that. We only wanna send you

the best alerts as possible.

So the workflow of many of our users is to provide us

annotations

that we take into account.

To the point of alert fatigue and being able to identify

what are the useful pieces of information

and understanding

when it's important enough to actually

raise an alert versus just providing a sort of informational note when somebody logs into the platform? Kind of how are you

exploring

that continuum? Because it's different for every team, and it's always a difficult balance to strike no matter how hard you try.

It is challenging.

Part of it is

understanding that,

ultimately the impact is what matters.

Right? If this table

is not fresh,

or

the number of rows has tanked,

and it's not being used by anyone, perhaps that's not a very high priority issue. It's not a p 0. You don't have to throw away your afternoon

to solve this issue.

But there are a lot of nuances when it comes to modeling the data,

specifically in the data warehousing setting,

which is a different setting than a lot of the off the shelf time series analysis tools

assume, where you have an extremely high sampling rate of data that varies quite a bit. In a data warehouse, there are additional constraints on top of that. Your data might not vary every second, it probably does not.

And there are

I'm gonna give you 1 small example.

If you take the number of rows in a table,

oftentimes the number of rows on table is purely additive, right? With many incremental models,

if you are tracking the number of events, or a click stream.

So we wouldn't want to alert you on

standard increases,

even if the increase is

statistically significant.

But in this sort of a table, we would want to alert you on a decrease.

And it's a combination of those sorts of kind of intuitions codified into models,

plus the knowledge that the data again is interrelated, it has a lineage.

So that if

20 tables go down, but that many of them have a single root cause, we wouldn't want to send you 20 alerts, we'd want to send you maybe 1 or a handful of alerts. And kind of crafting

this

system around, you know, at the end of the day,

the only metric that matters for us, which is the signal to noise ratio.

That's kind of where the complexity comes in.

In terms of the workflow

of onboarding to Metaplane and starting to incorporate it into explore it on

their

own.

I'm

curious

what you have seen as some explore it on their own. I'm curious what you have seen as

some of the largest motivating factors for people saying, this is a pain point that I have, and this is the problem that I'm trying to solve when I signed up for it. And then maybe some of the ways that they can incorporate

Metaplane into their

development and maintenance

routines for their data systems.

I would say that it's about

1 third, 1 third, 1 third, roughly,

in terms of users who bring on MetaPlan. 1 third are

people who work on data systems that have come from a software background,

like you mentioned before that say, I cannot live like this. Right? I cannot

go on, like, not knowing that my data is correct, that

the end result that I produce for my warehouse, which by the way is a production system,

that is not up. The second group of

users have come from an organization where observability

is available.

It might be a large organization probably where they built a tool in house, and then go into a new setting, or starting a new team,

and saying, I kind of need this. Right? Once you see it, there's no going back. And then the third,

honestly,

are data leaders

where

crap hit the fan.

Right? That something went terribly wrong, and they were

the people who were held responsible.

I'll describe it as kind of those

3 buckets.

And

the overarching

theme is observability is a new category.

It's in very, very early days, but it is moving quickly.

But oftentimes, there's no going back.

Where

without observability,

you could make the argument that's not necessary, but oftentimes, if you already tried it, you know how valuable it can be and how little overhead cost is required to both bring it on and maintain it. In terms of the

applications of MetaPlay, and I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it used or some of the

kind of problem solutions that people have

settled on that you didn't anticipate when you were initially designing the platform?

To go back to the question about the warehouse as the focal point, we thought that we could just integrate with the warehouse, and that would carry us for a long time.

But very, very quickly, our users

pulled us both upstream

to say, please integrate with with segment,

with Postgres,

with a production data store, and

they pulled us downstream

to integrate

with the feature stores. Right? Inputs into machine learning models and into BI tools. So that was unexpected, but that happened pretty quickly on the technological

standpoint.

The kind of

social relationship

standpoint, we were very surprised

to see Metaplane

turn into almost a nucleus for team interactions.

Whether it's on the data team, so we send an alert,

right? And you kind of at mention this, or forward it to

someone else on your team,

or you include someone who is directly impacted by this issue.

After we start sending alerts.

And that was a bit of a surprise to us. Right? We after we start sending alerts. And that was a bit of a surprise to us.

We came in thinking that Metaplane is infrastructure.

Infrastructure

lives in the background. Well, honestly, sometimes infrastructure lives in the foreground.

Another interesting element of the observability

space

for data and where it sits in some of these larger trends that we've been seeing in the evolution of the data ecosystem is that

there's a decent amount of overlap from a number of different directions where, you know, you are overlapping with some of the data quality tools because you're able to surface some of the debugging information

or raise alerts

on anomalous data situations.

You are overlapping a little bit with some of the data catalog and data lineage systems because you have this lineage tracking to be able to understand from a, you know, debugging perspective, you know, this downstream problem occurred, so now I need to trace it back to its root or, you know, multiple downstream problems occurred because of an error at this focal point.

You know, there's also the overlap with some of these software observability systems that people are repurposing to work with their data platforms. I'm just curious

how how you think about inhabiting

that kind of Venn diagram of problem domains and some of the ways that you think about both differentiating

as well as kind of coexisting with these other systems that have

overlapping, but occasionally orthogonal

purposes?

For us, it comes down to solving a problem

for

data teams. That's our number 1 focus. And

for us,

we think observability

is a problem that data teams inevitably run into as data is being used across more and more use cases, kind of beyond the main

historical use case of business reporting to machine learning and artificial intelligence and data science,

to go to market operations,

that

as more and more data is collected and modeled

and used, the stakes only go up and up and up, and it's a matter of time

until teams need a way to be confident

in the data.

So within the narrow observability

space,

we think that every team should be able to bring on a data observability tool if they want to, and not make a whole thing of it. Right? It doesn't need to be a quarterly initiative.

Right? Bringing on a software observability

tool is more of an afternoon kind of project,

just to get started. And it's the same way for Metaplane, 15 minutes, and then you have your initial suite of tests laid out. And in terms

of overlap with

adjacent

metadata driven tools,

data quality tools, and

cataloging tools.

I think

1 note is

the data space is still very small and quite early,

where in the future,

I think that a lot of these tools will differentiate and expand into separate niches, in the same way that, right, the software market,

you have observability tools, and you have unit testing, and CICD, and build tools.

There are overlapping

use cases, but they've differentiated and have become interoperable.

To the point that, you know, asking if Datadog can replace a unit testing tool

is kind of like, Oh, you do need both at the end of the day. And the reverse is also true. Right? Just unit tests, they're building a complex software project, does not replace the need for an observability tool. So I think that there is like a core

set of metadata that is shared between these tools,

going back to the metrics,

metadata, lineage, and logs. But it's being surfaced in many different ways, solve different use cases for different people.

In terms of your experience of building the Metaplane platform and working with your customers and working with practitioners in the space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

The main

interesting lesson

is the misconception

that observability

takes a long time to implement.

And going to the software world, observability

is not viewed as

a huge initiative

just to get started. Right? It's almost like my focus is on building

a scalable,

usable software system that meets the specs, add observability on top. It won't be an enormous lift,

either in terms of time, or energy, or cost.

And I think a lot of data teams, for good reason,

view observability

as a major effort to get started.

You know, many other initiatives

are huge efforts,

like setting up a data warehouse, or setting up a BI tool.

We don't think that

that should be the case for observability as well. Observability should be

10 or 30 minutes that you don't need dedicated resources to get started.

And for people who are

trying

to understand more about the behavior of their data systems, what are some of the cases where Metaplane is the wrong choice and they might be better served

with

a more narrowly scoped sort of data quality tool or a data catalog tool,

or

they might be better served with building out their own internal

tooling or internal systems.

For 1 thing, if your

primary data asset is unstructured data,

Metaplane is probably not the right place to get started, either Metaplane or any data observability tool. Once you introduce unstructured data to them, makes you have a whole different set of concerns and analysis.

And also, if

there are

not many downstream

users or systems from the data that you're responsible for, you probably don't need Metaplane

for now. Right? The moment that there is

a critical use case that is attached to the data,

that's the right time to bring on an observability tool, hopefully before then. But just at the beginning stage where the stakes are still relatively low,

me personally, I would focus on making sure that it is tied to your use case, and then bringing on observability.

And in terms of the work that you're doing on meta play now and into the future, what are some of the things you have planned for the near to medium term, or any projects or problem spaces that you're excited to dig into?

Well, the observability

space is still early days. Right? Every company was started in the past 2 or 3 years, and has less than a 100 customers.

And our goal is to be the observability tool that every team can use, and that means being the first tool with a 100, and then a 1000 customers.

And due to that,

we are working in 2 big directions.

1 is deepening our integrations,

and some exciting ones in the pipeline

include Databricks

and ClickHouse.

And we're also shipping

new features

that use the metadata that we collect

to kind of augment

the primary

use case of being the first to know about data issues.

Some examples include tracking the lineage, upstream of the warehouse, like we mentioned, or tracking credit spend, and usage analytics

of the tables within your warehouse.

You know, ultimately,

we only exist to save data teams time and money, and help increase awareness, and help them make better decisions.

So observability

is just a technology. There's many use cases on top of the technology

that can be created in service of those

overarching goals.

Are there any other aspects of the observability

space, or some of its future evolution, or the work that you're doing on Metaplane that we didn't discuss yet that you'd like to cover before we close out the show? I think

data teams are

so busy and resource constrained

that

if anything takes effort

or time,

it should really, really be critically evaluated. And that includes bringing on an observability toward implementing

testing of any sort.

We think that

tests are hard to add. You will not have tests, unless you roll with an iron fist.

Same thing with observability. It's not easy to bring on observability,

unless it's a priority, and it'll probably happen a little bit late.

So that's why we focus on automating as much as possible,

Not only automating the collection

of observations of freshness and row counts from your data warehouse, but also automating the modeling,

automating the lineage extraction, and the feedback mechanisms.

Our focus is have

users only give

us information

when it's needed to improve the models over time, not to do rote work

that a tool should be taken care of for you.

And for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question

And as a corollary to that, what are some of

the areas of

data observability

that you think are still yet to be defined or properly understood?

The biggest gap

in our world today

is probably hiring,

where every team that we talk to is trying to hire, but there is simply not enough talent to meet the demand.

And while I don't think there's a purely technical solution here,

there must be ways that better tooling and technology can help

train

the talent,

help

people differentiate themselves, and help them connect

to great companies to start or grow their careers.

At the end of the day,

the goal is to bring data to more and more companies, and more and more use cases, and the biggest bottleneck is the amount of people who are skilled and able to do that. Right? Tools are just tools.

Right? We love Metaplane, but it's people who are using the tools and creating the data that are making the difference.

There's 1 other interesting

gap that Sylvain, who leads growth at Census, brought up, which is, you know, integrating a tool like reverse detail tool or an observability tool to your warehouse currently.

It's kind of a

painful

process and insecure, where you

provision this role, and you create a connection string and paste it in. Where is OAuth for data warehouses?

I think that is a pretty big gap that any of the big vendors can figure out, and both the warehouse providers themselves, or vendors like us, and

users on data teams, everyone wins. And it's not like OAuth is a new technology.

Yes. But then you don't get to have the joy of configuring your ODBC drivers.

It's fun the first time. That's for sure. It's never fun. Maybe

Not even the first time.

You're right about that. That's true. That's true. Maybe to unlock OAuth, you have to do it 1 time just to yeah.

Yeah. And to your point about hiring, I think that there's a great talk from the

early days of the DevOps sort of I don't know if revolution or evolution or emergence are the right terms, but from Jez Humble where he gave a presentation

of stop hiring DevOps engineers and start growing them. And I think that we're in a similar situation in the data space of we

as organizations and engineers, we need to stop thinking that the solution to our

capacity for our data teams is to go out and hire somebody who's already an expert and to start giving internal people the opportunities to grow into the role and give them the education they need to be effective in that space.

I think that is a 100%

right, that

1, I have to go read that article. Now there are many people trying to break into data. Right? There's 2 truths that we

have to suspend in our minds. Right? 1 is that

talent does not meet demand, and that data is a hot space, and then many people are trying to break into it. Right? It's kind of a disconnect between junior and senior,

and the way to bridge that gap

in lieu of external training is, like you mentioned, being able to foster that talent yourselves.

Well, I appreciate you taking the time today to join me and share the work that you've been doing on Metaplane

and your perspective on the data observability space. It's definitely a very

interesting ecosystem

and 1 that, as we've discussed, is still very nascent. So it's nice to see a lot of people who are starting to

explore that problem domain and helping to flesh it out and understand what are the utilities and constraints of that overall solution

space. So I definitely appreciate all the time and energy you put into that, and I hope you enjoy the rest of your day. Yeah. You too. Thanks a lot, Tobias. Take care.

Thank you for listening.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links