Summary
Collecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform. He explains the challenges that occur when metrics are maintained across a variety of systems, the benefits of unifying them in a common access layer, and the potential that it unlocks for everyone in the business to confidently answer questions with data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Nick Handel about Transform, a platform providing a dedicated metrics layer for your data stack
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Transform is and the story behind it?
- How do you define the concept of a "metric" in the context of the data platform?
- What are the general strategies in the industry for creating, managing, and consuming metrics?
- How has that been changing in the past couple of years?
- What is driving that shift?
- How has that been changing in the past couple of years?
- What are the main goals that you have for the Transform platform?
- Who are the target users? How does that focus influence your approach to the design of the platform?
- How is the Transform platform architected?
- What are the core capabilities that are required for a metrics service?
- What are the integration points for a metrics service?
- Can you talk through the workflow of defining and consuming metrics with Transform?
- What are the challenges that teams face in establishing consensus or a shared understanding around a given metric definition?
- What are the lifecycle stages that need to be factored into the long-term maintenance of a metric definition?
- What are some of the capabilities or projects that are made possible by having a metrics layer in the data platform?
- What are the capabilities in downstream tools that are currently missing or underdeveloped to support the metrics store as a core layer of the platform?
- What are the most interesting, innovative, or unexpected ways that you have seen Transform used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Transform?
- When is Transform the wrong choice?
- What do you have planned for the future of Transform?
Contact Info
- @nick_handel on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Transform
- Transform’s Metrics Framework
- Transform’s Metrics Catalog
- Transform’s Metrics API
- Nick’s experiences using Airbnb’s Metrics Store
- Get Transform
- BlackRock
- AirBnB
- Airflow
- Superset
- AirBnB Knowledge Repo
- AirBnB Minerva Metric Store
- OLAP Cube
- Semantic Layer
- Master Data Management
- Data Normalization
- OpenLineage
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there's a book that captures the foundational lessons and principles that underlie everything that you hear about here. I'm happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O'Reilly to publish it as 97 things every data engineer should know. Go to data engineering podcast.com/97 things today to get your copy.
Your host is Tobias Macy. And today, I'm interviewing Nick Handel about transform, a platform providing a dedicated metrics layer for your data stack. So Nick, can you start by introducing yourself? Yeah. Thanks for having me, Tobias. I'm a big fan of the show. So I am the cofounder and CEO of Transform. And do you remember how you first got involved in data management?
[00:02:41] Unknown:
Yeah. So for me, I originally studied math and then joined BlackRock out of college. And so I was, you know, working on a bunch of different technologies that I think now would be considered legacy tooling, but learned a lot about just, you know, how BlackRock was using various macroeconomic datasets to build models and do analysis on some of their portfolios. And so from there, it kind of progressed towards wanting to do things in kind of, I'd say, more modern tooling. And so started exploring different opportunities and moved over to Airbnb in 2014, originally as a data scientist.
And this was kind of a golden era of Airbnb's data team. There was a bunch of investment in tooling like airflow and then superset, the experimentation platform, the knowledge repo, just a bunch of great tools. And so, you know, kind of progressed from there.
[00:03:42] Unknown:
My understanding is that your work at Airbnb and experiencing the work that they were doing with their metrics layer was some of the inspiration for what you're building at Transform now. So I'm wondering if you can just give a bit of the backstory of how you ended up where you are now and what you're building at Transform.
[00:04:00] Unknown:
I I actually joined a few weeks before Airbnb released the the very first version of its metric store. It was called metrics repo, and it was actually within the experimentation tool the company was building. So Airbnb was going through this shift of kind of being a very design led company to being a both design plus data led company. And as a part of that, I was really investing in tooling around product experimentation. And so I had joined the growth team and the, you know, primary job that I had as a data scientist was to run experiments. And when I first joined, I was, you know, really just a bottleneck to the product team that I was on because it took me so long to run analysis on each of these individual experiments. And this tool came around that basically just made it easy to define the various metrics that I wanted to use and do analysis on my experiments with and just built out the pipelines to then serve those metrics to this kind of experimentation readout.
And so, you know, very quickly went from running an individual experiment a week to running tens of experiments at the same time and actually getting to dive a lot deeper into the interesting parts of them because all of the metrics and all of the different basic analysis, the kind of stats testing and whatnot was served to me in this nice clean readout. And so over time, Airbnb invested more and more in that framework. And, you know, originally, it really kind of served the use case of experimentation, but data scientists started to see that there were different applications. And so my, you know, very naive approach was to start running fake experiments and generating metrics out of this, kind of automated data pipelining tool to then pull into analysis. And then later on, it evolved into the tool that is now Minerva, which Airbnb talks a lot about. In the context of
[00:06:04] Unknown:
analytics and data platforms, I'm wondering if you can just share your definition of what a metric actually is and some of the ways that they manifest throughout the data life cycle.
[00:06:16] Unknown:
Yeah. So a metric is a bit of an abstract concept. And to make it a bit more concrete, I might dive into a specific example, from Airbnb. So 1 of our key metrics was night nights booked. It was kind of the North Star metric for the whole company. Every team tracked it. Every experiment run at the company was either trying to impact it or make sure that it didn't impact it and get something else done. And so that metric actually makes sense in a bunch of different contexts. So, you know, it makes sense as how many nights booked were there by country, by listing type, by super host status, by whether it was a tree house or not. These things are called dimensions, and they bring context to numerical data. And so being able to aggregate that metric to many different dimensions is really powerful, and there's a clear relation here to OLAP cubes.
And data engineers and, you know, today kind of more and more analytics engineers are responsible for building these nice clean interfaces into the data warehouse for broader business consumption. And so by, you know, capturing these definitions for metrics in a somewhat abstract way and then being able to flexibly build them to various different dimensional levels, we can, you know, serve these nice clean datasets to the company that then allow less technical users to consume them. You know, we've seen a bunch of different solutions here around kind of summary tables or just, you know, queries existing in a bunch of different downstream tools from BI tools to really, really a wide range of different places where people want to consume metrics.
And so the point of this definition of a metric in our framework is to then be able to both build those datasets in the warehouse and also build them in downstream tools consistently.
[00:08:05] Unknown:
1 of the other pieces of terminology that I've encountered that is reminiscent of what we're discussing here with the idea of metrics is the concept of master data management where you have this 1 golden table that says, if you need to be able to query against using the example that you gave of knights booked, then you query it against this table because we did the calculation ahead of time for you. And I'm wondering if you can just draw some parallels between some of the ways that master data management has been done historically and some of the challenges that it poses and what you are working towards with transform to enable this more sort of flexible category of metrics that can be calculated sort of at query time.
[00:08:50] Unknown:
Yeah. So there's this really interesting history of, you know, semantic layers in general. And there are, you know, a wide range of takes historically, whether they existed inside of business intelligence tools or they existed, you know, within kind of data warehousing type solutions. And the point of this tool is really to kind of pull that out and separate it from the various pieces of infrastructure that are either storing or applying compute to data, and then all of the different places where people want to consume metrics.
[00:09:24] Unknown:
As you said, there have been a few different generational shifts with the idea of the metric store being the most recent 1 and 1 that's been gaining a lot of attention at least in the past few months that I've been seeing it popping up. And I'm wondering if you can just talk through some of the ways that those different semantic layers have been managed and some of the challenges and complexities that teams face when trying to create and manage the context and the semantic meaning around data
[00:09:54] Unknown:
and sort of what you see as driving the shift towards this dedicated metrics layer? Yeah. So it might help to kind of back up and define what a metric store is and then kind of dive into the various takes. And so I see a metric store as really these 4 pieces. And that is the semantics for how you capture the information. And it seems relatively simple. It's, you know, various tables in the data warehouse, and they have connections or relationships to each other. But actually, it's probably 1 of the most important pieces, and it's something that Airbnb iterated on for years.
And it's also quite hard to change once you start capturing information. And so kind of moving between different ways of capturing the semantic information. It's a challenging evolution. The second piece is really around performance. And that's kind of getting at this question that I think you're asking around, you know, static. Are you building the datasets in the data warehouse? Are you, you know, building them to some kind of location that can serve them really quickly versus dynamic? Are you asking on the fly for some kind of metric denormalized dataset to get constructed?
The next 2 are really kind of how are you exposing that data to the rest of the company. And so the 3rd piece is governance. How do you apply life cycle management? How are you managing the definitions of these metrics? And the last 1 is interfaces. How are you exposing these metrics to all of the different places where they're getting consumed? And so, you know, when I look across various tools that exist, I think that they're largely the techniques that they're applying can be bucketed into those 4 categories, and there's varying levels of investment in each of those. And so, you know, I think that there are quite a number of tools out there that solve problems in each of those spaces.
But the kind of metric store in my mind is a holistic solution to how am I consuming data off of the data warehouse to how
[00:11:57] Unknown:
how am I consuming data off of the data warehouse to how am I making sure it's right and getting it into the various tools where it needs to get consumed from. As you're saying, historically, there have been a few different approaches to solving different pieces of the problem where a lot of it will live maybe in the business intelligence tool where there's a way to add context to a particular calculation. But then if you need to be able to use that same calculation in a Spark job, for instance, then there's no clean way to be able to access that because it doesn't live in a place that Spark can easily get to without reaching into the metadata database for the business intelligence tool.
And I'm wondering what are some of the potential negative impacts of having slight differences or inconsistencies in how these metrics are calculated and maintained and differences in life cycle that can come about if you think that you have sort of replicated a metric, you know, accurately in 2 different places, but then later find out that maybe you, you know, flipped an operation or changed an order of operations somewhere, and all a sudden you're wildly divergent.
[00:13:07] Unknown:
Yeah. Yep. Exactly. And so the challenge here is, how do you in this process of doing denormalization, once you have, you know, these nice clean normalized models sitting in your data warehouse, how are you then going and kind of consistently building the datasets that you wanna consume in all of the different places that you wanna consume them. And so, you know, there are a lot of different negative consequences, but I think that it kind of all boils down to lost trust in data and a lack of productivity amongst the kind of data consumers.
Can they easily access the metric that they're trying to consume, and do they trust others when they say they have some kind of insight? When I joined Airbnb, there were 3 definitions of the company's North Star metric bookings. And so, you know, the the big challenge there was that different teams would come to meeting and say they saw this thing happening in the business, and then there would be some disagreement. And ultimately, what it would boil down to is 2 data analysts staying after that meeting and just hashing out, you know, specific nuances of the SQL that they had written. And so it was an incredibly inefficient process, but worse it, you know, led to the higher ups coming to those meetings to just say, I'm just gonna use, you know, intuition here. Let these data analysts figure this out, and then, you know, next time, we'll come back and look at the data.
[00:14:31] Unknown:
And in terms of what you're building at transform, what are sort of the primary goals that you have for the platform and the target users that you have in mind as you're building out the overall system and the user experience design and the integration points?
[00:14:49] Unknown:
So the company's mission is to make data accessible. And the philosophy, you behind how we're going to do that is that there needs to be better interfaces for data producers and data consumers broadly bucketed to communicate with each other. And so, you know, our hypothesis is that a metric is a really great interface, because in some ways, it's the language of how nontechnical users, you know, use to then communicate around around data. And so, you know, this all starts with establishing a definition in our metrics framework, and then exposing that broadly to be, you know, both computed, but also to kind of share that definition in that metadata with a wide range of tools. And so that's where our APIs and our metrics catalog kind of come in. And then on top of that, there are kind of a bunch of different ideas for how we can use those metric datasets to do interesting things.
So there are, you know, ideas around forecasting and anomaly detection and, you know, applying annotations to metrics and building datasets for experimentation and, you know, really just kind of pushing metrics into the various places where people can then make use of them. So the 2 users of the tool in my mind are kind of these, you know, broad buckets of data producers and consumers. And I think to get a little bit more granular on, you know, data producers, it's some combination of data engineers, analytics engineers, data analysts. They're the people who build the normalized datasets in the warehouse and have a hypothesis around how they should be consumed by the broader company. On the consumer side, you know, probably about 97% of most companies are not, you know, data workers. They're not data analysts or data engineers.
And so really, I think, you know, these metrics should be consumable much more broadly. And so that means building nice interfaces that allow them to then consume those datasets or to pull them into the interfaces that they know and like to consume datasets from. And so in order to accomplish that, the metrics framework, which is really aimed at that data producer is, you know, a framework that's built around the ML and SQL. It's contributed to Git in order to do version control, And then that publishes these metrics through either our catalog, which is kind of the first demonstration of the power of some of our APIs.
And, you know, hopefully, that catalog makes it easy for this data consumer to then kind of ask basic questions. Show me this metric slice by this dimension. And then beyond that, you know, there are a bunch of different interfaces that data producers also want to expose their datasets in. And so we publish a number of different APIs that can then connect to anything from business intelligence tools and Jupyter Notebooks to, you know, GraphQL and React, which allow front end developers to build on top of transform.
[00:17:58] Unknown:
And can you dig a bit deeper into the way that the platform is architected and some of the system design considerations that you had to deal with as you were building out the initial versions of the platform and some of the ways that it has grown and evolved since those initial prototypes?
[00:18:15] Unknown:
The core of this platform is really this semantic layer where the data producer is defining these YAML files. And these YAML files have some amount of kind of SQL expressions in them. But really the most important part are the abstractions that we've chosen for how to capture this information and what those abstractions enable. And so those files then get parsed by the semantic layer. And we have a a server which then basically builds SQL against the customer's underlying data warehouse. So everything that we do is built on top of the customer's data warehouse. We use their existing storage and compute, but we can kind of do 2 deployments because of that infrastructure.
So 1 is where we're actually deploying on their virtual premise. That means that they are connecting their data warehouse to transform. It's it's all staying in their ecosystem, and they kind of get all of the security guarantees that they want. The other option is a hosted version where we're basically just building SQL to their data warehouse and not actually passing any data back to our ecosystem. So the specifics of kind of what's built out, the metrics framework that we use is written in Python. The front end is TypeScript, GraphQL, React, and then the APIs are written around a GraphQL core.
But, really, there are, you know, any number of interfaces that we can build on top of that, and that would be in whatever language that's being consumed by. So our command line interface and our Python client are both built in Python. Our JDBC is built in Java, and then our front end is built using the React components and GraphQL components that are GraphQL interface that we are then exposing also to our customers so they can build on top of those same APIs.
[00:20:05] Unknown:
In terms of the actual workflow of building a set of metrics and then consuming it down stream, what's involved in actually defining a metric, populating that into transform, validating that, you know, in terms of any sort of organizational discussion that needs to happen around that, and then being able to consume that from a downstream system, whether that's business intelligence or Jupyter Notebook or a Spark pipeline, for instance?
[00:20:33] Unknown:
The actual definition workflow is, you know, typically done locally, and we have a command line interface that makes it easy to iterate on these config files, test them, you know, run variations of metrics that already exist or define new metrics. Then it, you know, follows kind of the standard code commit practices that the company is using. So those files will get contributed to get Those, you know, once merged would go to our MQL server, get parsed into the current active semantic layer. And then any API requests coming in would be made against that current semantic layer.
And so that means that, you know, our front end is then building on top of these current definitions. But another really cool thing about this is that if a metric definition changes in that semantic layer, then all of the different places that the company is referencing that metric, so through our JDBC over SQL, or through some notebook, you know, really any of the kinds of interfaces that they're consuming it would then be consistent because they're getting the current definition of that metric. The nice thing about this is that we're really building on top of the same interface that we're exposing to our customers, which means that once a metric is defined in this framework, it should be consistent across all of the different places that they're consuming it. And as far as the integration
[00:21:59] Unknown:
with the customer's data systems and data platform, you mentioned that the the transform sits on top of the data warehouse layer. And I'm wondering what types of validation and introspection you need to do to be able to provide useful feedback to the engineers who are building the metrics definitions and as they iterate on defining it and creating the code representation that they're then going to commit and populate into transform.
[00:22:26] Unknown:
Yeah. So the core of this dev workflow is to basically be able to run this semantic layer against whatever set of configs you're using. And so, you know, the objective here is to really be able to iterate off of the current version of this kind of semantic mapping of the data warehouse and then to be able to use those configs in the same way that you would use the configs that are currently in production. And so it effectively gives the end user the same experience as if they were just querying the production MQL server.
[00:23:00] Unknown:
Because of the fact that you are targeting the data warehouse, I'm wondering if there are any challenges in being able to extend this layer or if it even makes sense to try and extend this layer to account for more semi structured or unstructured data storage locations or if it's purely something that only really makes sense on a
[00:23:23] Unknown:
data warehouse that already has some measure of structure applied to it? Yeah. Right now, you know, we're really focused on kind of the data analytics use case. And because of that, we're primarily building on top of the data warehouse as it exists and the structured datasets that are already there. I think that that probably satisfies the large majority of applications for metrics. And so I think, you know, it'd be probably good to understand what kinds of metrics are getting built off of unstructured or semi structured datasets to really be able to answer that question.
[00:24:00] Unknown:
In terms of the actual life cycle of a metrics definition, I'm wondering what are some of the interesting stages that it progresses through from when it's first instantiated and somebody determines that they need to create this calculation through to, you know, many years down the road where the business shifts and maybe the underlying meaning of the metrics change, or you need to be able to incorporate additional factors into how the metric is calculated or what the overall value should be? So this is a really interesting and important evolution
[00:24:34] Unknown:
of the framework that we saw at Airbnb. And, you know, for the first 2 years of this framework, there was really very little governance. Aside from the fact that it was being committed to Git. There was very little kind of oversight of what these metrics were and who is consuming them and how are they consuming them and which ones were old. And so there was a big push to basically think through what are the stages of a metric life cycle. And I think that, you know, there's been a lot of iteration and Airbnb published some great blog posts about this, but we have our own definition. That's that there are 4 stages. So it starts with definition.
I have an idea of how I want to measure something. How do I define this? Is it different than the other metrics that exist in this framework? How do I compare it to existing metrics? How do I test it? Who do I want to consume this, and how do I want them to consume it? And that kind of leads into the 2nd stage, consumption. If I want to consume this, am I using this right? Does it mean what I think it means? Is it up to date? Is it still accurate? Is the data good? And, you know, generally, am I able to pull it into the tools that I want to consume this from? The 3rd stage is iteration.
So I think that this metric needs to change. Who needs to know why is it changing? How is it different than before? What's actually changed about this metric? How do I compare it to the old version? And how do I then, you know, in the UI or in kind of these APIs, be able to generate the old version if for some reason I still need to do that? The final stage is archival. So if this metric is old, how can I stop others from consuming it? Where does it go? You know, do I still want to maybe calculate it at some point in the future, but I wanna make sure that nobody else is calculating it? And how do I retain the knowledge that's been built around this? So I, you know, don't want people necessarily to consume this, but I still probably learned some valuable things around this metric over time, and we used it to make decisions.
And so there's some kind of lasting institutional knowledge that's been created that needs to be tracked over time.
[00:26:47] Unknown:
There are a few interesting points from that that are largely based on the organizational aspects of the metric, particularly in terms of who needs to know about this metric changing, who needs to be brought in to help with the definition of the metric or validate that the way that I'm calculating it is accurate. And I'm wondering if you can just talk through some of the collaboration aspects, what you're building with Transform and how you think about enabling these organizational workflows beyond just the technical implementation?
[00:27:19] Unknown:
So in our minds, the biggest challenge around, you know, helping an organization to define these metrics is really kind of creating that interface between the data consumer and the data producer. So I said this previously, but we really do believe that the metric is the ideal interface because it is currently the language that data consumers around a company are using to describe data and to to understand it. And so by enabling the data producer to then go out and define these metrics and kind of follow some process. At the very least, it establishes a standard for, you know, where it's located and how it connects to these various systems.
And so that enables an organization, I think, to build some of their own process around metric definition. And hopefully, you know, on the other side of this, there is a product that can then support the process that they're trying to build. And so I think that that is probably 1 of the biggest things that we will be working on in the future as we continue to expand our customer base is just understanding all of the differences between how these organizations are consuming metrics. And, you know, what that means for the actual process that they want to follow to make sure that those metrics are agreed upon and trusted across the organization.
[00:28:44] Unknown:
Your point about the metrics layer being the interface between data producers and data consumers puts me in mind of the feature store, which is another layer that's been gaining a lot of ground recently that acts as that same kind of interface point with the difference being that that's primarily for the machine learning workflow versus the analytics workflow that the metric store empowers. And I'm wondering if you have any thoughts on sort of the juxtaposition of the metrics store versus the feature store and the relative utility of metrics versus features and maybe some of the overlap that might exist where you might want to have some level of communication between your metrics and your feature stores and how those different calculations are defined and performed?
[00:29:27] Unknown:
That's a really great question and something that I kind of glossed over in my background was, for a while, I was working as a product manager at Airbnb, and the team I was working on was building out Airbnb's feature store zipline. And so at the core, I think these 2 things are very similar. But there are some really significant differences that I think make it a long way off of being a kind of similar piece of infrastructure that is gonna get built out. But at the core, you know, what they're doing is creating derived data and then serving that derived data to specific application.
The really hard part here around the feature store is that there are much stricter requirements around the way that a feature is defined, and it tends to be a lot more granular. And that means that it doesn't necessarily serve the analytical application nearly as well where you want to be able to slice and dice and ask different questions. There are some other complicated ones around timeliness, you know, feature stores require some kind of melding of real time and batch data construction. Machine learning models tend to require something called point in time correctness or time travel.
And it's a complicated subject, but it's also something that, you know, is fairly different between analysis and feature construction. And then the last really big difference between the 2 is consumption and reuse. And so there are really strong forces within organizations that push metric consumption to be consistent. At the core, a metric is, you know, really just a way of kind of compressing a bunch of information that a company is collecting a bunch of data into something that's useful for decision making or analysis. And what that means is that broadly, you have companies that are trying to push for a consistent definition across teams, across individual data analysts.
It just makes the world simpler if everything is clean and consistent. And that's a really big difference compared to features because features can perform better in certain models, and sometimes you want many different iterations of the same features. And so, you know, the ways that I saw feature stores being adopted was primarily taking a feature, iterating on it, and then, you know, ending up with another variant to that feature. And that's not really something that you do with a metric or if you, you know, do that kind of analysis, it is through some dimensions, and it's not actually changing the core definition of the metric.
You're just kind of aggregating it to some different granularity.
[00:32:12] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook ads? Hi touch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hi touch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. In terms of the granularity and dimensionality of the metric, I'm wondering if you can dig a bit deeper into some of the complexities that come up and some of the ways that somebody who's trying to build a metric definition can shoot themselves in the foot when they're trying to figure out how do I calculate this metric and then be able to actually explore it at sort of different levels of granularity and dimensionality and just some of the sort of technical and cognitive complexity that arises from that. When I think about the
[00:33:16] Unknown:
most, you know, complicated part of this tool, the most complicated technical challenge, I really think that it's denormalization. And so, you know, to kind of back up and just quickly define normalization and then denormalization. So, you know, normalization is defined as reducing data redundancy and improving data integrity. And so the goal there is to basically define these nice clean datasets that don't replicate data around the warehouse because then they're much easier to manage. There are a bunch of great tools that have come out, you know, more recently that have enabled companies to build better cleaned normalized datasets.
And there's been a ton of research in this space and a ton of discussions of different techniques of normalization, like Kimball and Inman and and etcetera. And so, you know, when I think about what do you do with the data from there? Well, that's really great that that data is clean, but then you need to go and make it useful. And in order to make it useful, you need to start merging datasets. You need to start, you know, applying filters and doing all of the different things that happen in SQL or in Python to kind of transform data. That's really where this framework is aimed at supporting our end users technically.
And so the input into our framework is typically these nice clean normalized datasets. And you can put in raw datasets and partially denormalized datasets, But really, you get a lot more out of this framework if you've gone through the work of building these nice, clean, and normalized datasets. And so from there, you know, denormalization is happening across so many different tools today. It's happening in the data warehouse where we're building summary tables. It's happening in the BI tool where we're asking some question. It's happening in dashboards where we've, you know, asked a bunch of specific questions.
And so what we really wanna be able to do is build those metrics to a wide variety of granularities consistently across all of those tools. And, you know, 1 of the biggest challenges there is what are you doing ahead of time and what are you doing on the fly? And so, you know, ideally, you want those datasets to be really snappy. Right? You want your BI dashboards to load quickly. But the more you've kind of baked into your tables in the data warehouse, the less questions you can ask. And so the power of these data modeling frameworks is that they enable you to ask a wide variety of questions while also consuming those datasets in all of the different places where you want to consume them, and hopefully it's making them much faster.
And so, you know, the kind of core technical challenge of our framework then is enabling denormalization to happen in all of these different places efficiently and consistently. And so in order to solve for that, we've worked on a bunch of different approaches to caching datasets and trying to make that end result, whatever the question is, whether it's something that that the end user has, you know, pre specified as a question that they ask frequently, or if it's a question that is new, trying to make that dataset as fast as possible.
[00:36:32] Unknown:
And then as far as the actual platform integration, as far as the data source, it's fairly obvious that you connect up to the customer's data warehouse and use either, you know, ODBC or JDBC for that connection. And then on the other side, you know, you mentioned that you have these JDBC interfaces or you have GraphQL APIs. But for somebody who maybe connects it up to their business intelligence dashboard and then wants to run a query that uses data from their data warehouse and also factors in this metric, is that something where they would just pass everything through the transform layer, and then you would pull in your metrics definition and then also push down a query into the data warehouse and then join those 2 on their return flight back? Or what's the story of being able to query against the existing database tables and the calculated metrics?
[00:37:25] Unknown:
We basically have an API, and we call it MQL, metrics query language. And it allows the end user to ask questions in the format of metric by dimension. And so you're asking for, you know, some metric aggregated to some dimension. And you can also apply filters and, you know, ordering and whatnot. But that API request can basically be expressed within a SQL query. So I can say from MQL, you know, metric by dimensions, and that will return to me some dataset that kind of comes in as metric and then the various dimensions that I've aggregated that metric to. And so I can then express that API request within some broader SQL query where I'm using the full power of the customer's underlying data warehouse.
So, really, what this is doing is it's just building a denormalized dataset on the fly and then querying that dataset and joining it or applying aggregations or transformations in whatever SQL the end user has expressed.
[00:38:36] Unknown:
So in some ways, it's kind of the inverse of a, you know, stored procedure or user defined function in that instead of you pushing a function definition into the database, you're pushing the database into the function definition.
[00:38:48] Unknown:
Yep. That's exactly right. Yep. That's that's right.
[00:38:51] Unknown:
You mentioned that the interface for the data producers is this code first YAML and SQL sort of combined format. And I'm wondering what your process was for deciding whether to go with a code first and code native approach versus more of a sort of low code or no code, UI driven framework for somebody who's maybe coming from the business side who wants to be able to define these dimensions and just what you see as the trade offs of having this sort of text based flat file definition versus a more UI driven approach?
[00:39:28] Unknown:
I think that there's probably a future where those files get pushed into a UI or an ID kind of experience. I think we just wanted to start with an interface that gave us the maximum flexibility and ability to iterate. And so, you know, in the early days, when we kind of thought about that, what are the tools that our end users are using right now? Well, SQL and YAML are pretty widely adopted in kind of the data
[00:39:55] Unknown:
engineering analytics engineering world. And so we wanted to kind of meet them where they were. Another interesting element of this emerging space is how much support there is in downstream tools, thinking particularly around things like business intelligence dashboards and, you know, other analytics frameworks for being able to introspect and understand the additional context that can be defined and exposed by the metrics layer as far as having a, you know, prose definition of, you know, this is what this metric is for. This is, you know, how you might want to use it, and this is, you know, some of the metadata about who owns it and who created it kind of thing. And what are some of the missing pieces of the overall data ecosystem that you hope to see filled in in the coming months years as the metrics layer becomes a more established architectural sort of quanta?
[00:40:50] Unknown:
The challenge here for us is that the entire data ecosystem is really built around tables today. And it's not necessarily a significant challenge, but it is a missed opportunity. And so, you know, we can build tables off of our API requests. And by exposing this JDBC, you know, we can build datasets that make sense and share the metadata that we want over kind of whatever connection is coming in. But really what's kind of missing here is that you're not necessarily getting that rich experience that you get when you connect to an underlying database where you can browse the various tables and you can, you know, look at all of the different columns and kind of get some kind of summary information around it.
And so, you know, ultimately, I think that it's not necessarily a challenge for our end users to get that information because we can expose it as tables to them. And so if they want to look at a metric and look at the various dimensions that they could aggregate that metric to, we can share that with them. But it's coming in the form of a table and obviously to kind of conform to the world as it exists today. It's more about a missed opportunity to share that information and the kind of interesting information that can come with a semantic layer.
[00:42:12] Unknown:
In terms of having this semantic layer and this more sort of holistically defined and uniformly exposed method of creating and managing these metrics, what are some of the capabilities or projects or organizational capacities that are unlocked by adding this to the data platform that are either impractical or intractable otherwise?
[00:42:37] Unknown:
Just to start with the core value proposition, just consistent consumption of metrics and various tools. I think that that it sounds obvious, but it really just doesn't exist at the majority of companies that we've talked to. It seems like it's 1 of the most universal challenges in the data stack right now. And then, you know, looking out to the future, I think that there are a number of different applications that are enabled if you have this information. So just thinking about the first 1 that really got me hooked on this type of tooling product experimentation, when I was at Airbnb, I ran a 150 experiments in something like 2 years.
And, you know, I was looking at a 100 plus metrics on every single 1 of those. That is just not possible today. People don't have that kind of tooling broadly. You know, this is 1 of the core things that this enables. Beyond that, I have a lot of ideas for our product around this connection between forecasting and only detection annotations and then notifications in context that can be pushed out to a company more broadly. A forecast is, you know, where do you think the metric is gonna go? An anomaly is when it's outside of, you know, wherever you think it's going to go. And then an annotation is kind of the addition of some context for whenever that metric moved outside of what you expected.
And then, you know, that's an important piece of structured information that can then be pushed out to an organization. And so I think that that is a very significant paradigm shift, where today, we're creating a lot of data objects where we expect data consumers, so business users, to come to a dashboard and pull some insight out of it. And that's a really, really tall ask. It's not just a tall ask because it's hard to get the data. It's a tall ask because, you know, having all of the context that's necessary to pull some interesting and valuable insight out of that data typically takes somebody who has kind of seen the data go end to end, to that place.
And so I think that we can create these really interesting interfaces beyond just the APIs and pushing the data out to actually add context to these metrics. And that kind of takes me to this last point, which is that a metric is an incredible vehicle for information. They're 1 of the most consistent objects in a company over time. They don't switch teams. They don't quit. You know, they are consistent and long lasting, especially if they're well managed. And so by actually tying knowledge to them over time, you have the potential to add a lot of context that, you know, I think people don't have in many of the organizations that they're working in. So just to kind of tie that down to a concrete example, it just happens so frequently in just about every organization that I was in that, you know, somebody asked me what happened on this specific date. I know you were at this company 3 years ago.
Help me kind of understand that. And, you know, oftentimes that information just gets lost. And I think that a metric is a really interesting unit to kind of carry that information forward.
[00:45:47] Unknown:
Yeah. There's definitely a huge risk of loss of context and loss of value in an organization when somebody who has that useful understanding and experience either changes roles or responsibilities or leaves the company entirely and doesn't actively document it. And so being able to have this as the long term artifact of somebody's experience, I can definitely see a lot of potential value from that. Yep. In terms of the users of the platform and customers who are starting to onboard with transform? What are some of the most interesting or innovative or unexpected ways that you're seeing it used?
[00:46:26] Unknown:
I think that probably the most interesting thing is defining interfaces between teams. I think that I took this for granted when I was at Airbnb. I kind of just assumed that this was normal, but we've seen a lot of teams adopt this tool and then define various metrics in different parts of the company that historically have not been kind of consumed or kind of crossed the boundaries of various teams. And, you know, we've gotten some really fun feedback from our customers around, hey, I've just I've never sliced this metric by this dimension before because, you know, this 1 existed in a dataset that this team relied on, and this 1 existed in a dataset that my team relied on. And so that's really exciting, and I think it demonstrates a lot of the potential of this framework. And, you know, I kind of think back now to my time at Airbnb where I was on the growth team. Right? And so I consumed metrics from a wide variety of teams because oftentimes growth teams work impacts some other team. And so I was consuming, you know, the customer service contact rate or the, you know, account takeovers related to sign up and log in flow work that I was doing. And I, to this day, don't know the definitions of those metrics. I could not have written the SQL to calculate them, but I know that I consumed them. And I know that the teams that reviewed my analysis trusted the analysis because they had defined the sequel.
And so it's this kind of incredible unlock to basically just be able to communicate with another team reliably. I think that this actually touches on 1 of the kind of core principles of data mesh. You know, that's an exciting future that, we are moving towards.
[00:48:09] Unknown:
Yeah. The data mesh aspect is definitely an interesting element to pull out because it's been gaining a lot of ground over the past couple of years and has a lot of sort of utility in terms of how you think about building out the technical underlayment of the organizational capacity for data. And I can definitely see the metric as being a useful sort of exposed artifact for a given data team to be able to propagate and let other teams consume and combine them without necessarily having to understand the underlying calculations and computation that happens. That's an interesting point worth noting. Mhmm. And then in terms of your experience creating the transform product and building the business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:48:58] Unknown:
You know, I think that the majority of these come from generalization. So we saw this tool work within 1 company, and we went out and talk to, you know, maybe the 10 or 15 companies that have gone out and built similar tooling. But that's a very narrow picture of how people build and consume metrics. And there are a lot of really complicated factors in there that, you know, require us to then generalize the way that the tool is built such that it'll be more useful broadly. And so, you know, some of these include just different data modeling techniques.
You know, Airbnb had a good mixture, I think, of nice clean normalized datasets, semi denormalized datasets, and then raw datasets that were finding their ways into metrics. But it it wasn't even close to representative of all of the different, you know, data modeling and data engineering techniques that companies are using. And so a lot of lessons there. I think that also different scale puts different requirements on this framework. So when I think about this, I think about that denormalization challenge of what are we building statically? So, you know, what are we building to the data warehouse ahead of time?
And what are the kinds of questions that we're making it so that even if there are a 100, 000, 000, 000 rows in this fact table, we can still answer the question of, you know, how many rows were there per dimension that I'm trying to aggregate it to. And what that takes is basically pre aggregating datasets. And that's something that Airbnb got really good at because it had large amounts of data. But a lot of the companies that we're working with really just want to be able to do these things dynamically and on the fly, and they still don't wanna wait that much time. And so it's, you know, some combination of building datasets and then storing intermediate representations of them such that incremental questions can be answered quickly.
But they don't necessarily have the time to go out and build a bunch of, you know, nice clean, denormalized summary tables that they can expose their organization. You know, that's been a really big challenge, but also a really big learning. And I think that it pushed us towards making our APIs dynamic so that you can ask for any metric dimension combination. But there's a bunch of interesting work that we're doing around caching to make it so that those results can get returned quickly. You know, the last 1 I think is just organizational challenges associated with metric definition and the whole life cycle management process that I mentioned.
It's tough and just about every company has a different idea of how this works. And so there is a big challenge around kind of productizing that. You know, what that means is that there needs to be a lot of configurability because this catalog really needs to work in the the ways that companies expect it to work for that process that they want to run.
[00:52:00] Unknown:
And your point about precalculating summary tables is interesting because I've had a lot of conversations with people where the sort of general guidance is that you should have 1 or a small set of tables that can answer 80% of the questions in your business. And with the introduction of metrics and the amount of information that you have about what data is being used, how and by whom exposes the potential for an interesting feature where you can recommend a set of summary tables that would be useful to precompute to increase the speed at which you're able to generate these other sort of metrics views of the underlying data. Yeah. That's right. And that's kind of why we have really 2 primary layers of caching.
[00:52:46] Unknown:
The first 1 is 1 where the company can say, I know that I wanna compute, you know, this metric and this dimension together, and I want it to be really, really quickly. I want the queries to be really quick on top of that. And so, you know, that's something where they know ahead of time, and we can get that query down to, you know, in a really fast data warehouse under 1 second. And on the other end of that, there are times when people just ask new questions, but if they find something interesting, they're gonna keep asking it. And so we have this layer that we call dynamic caching, which basically allows you to ask questions. And then if you go and ask that same question again, it's gonna be really fast because we're saving that dataset in a similar way to the way that we're saving that materializations dataset.
And this really enables people to ask these metric questions really quickly, but also enables them to ask a wide variety of them. And so I've definitely heard that 80% of questions can be answered by core summary tables. And I think I would push back on that, and I would say that it might be that the people who are consuming data at your company have just given up, and so you're not discovering the rest of the data questions that they have because you're kind of just seeing the ones where they ask a question and it's not answerable, and then they give up. And so I think that, you know, what we're seeing is that as more and more people are adopting this tool and there are more combinations of metrics and dimensions that people can ask questions about, They will just ask more questions, and hopefully, that leads to, you know, more interesting and valuable insights getting pulled out of the data.
[00:54:27] Unknown:
For people who are interested in the idea of a metrics layer and they want to be able to add some uniformity to how the metrics are defined across their different tools, and they want to be able to enable their business users to explore more of the dimensionality of their data, what are the cases where transform is the wrong choice and they might be better suited with some in house tool or something purpose built for their particular use case? I mean, we've talked to a lot of fairly small companies because
[00:54:58] Unknown:
I think that they have productivity challenges, but they don't yet have the trust challenges that our framework and the rest of our product is really aimed at solving. And the reason there is just that if if you have 1 or 2 data analysts on a team, you already have metrics consistency. Right? It's already in the heads of those data analysts. They know the definitions and, you know, they are kind of the interface to data for the rest of the company. And, you know, there are some productivity challenges associated with that because if it's a data hungry organization, there's gonna be a lot of consumption of metrics, and that's a significant thing to support.
But then, you know, what inevitably happens is they add more data analysts to that team, and then you start to have some of those trust challenges. And so I would say that, you know, fairly small companies should probably just kind of focus on the core of getting good clean data into their warehouse and normalized and ready for consumption. And then they need to start thinking about, you know, what are the different applications where I wanna consume metrics? Because transform is really valuable once you have, you know, more than 1 application.
You know, just because if you're consuming in multiple places, that is where, you know, this framework adds a lot of value. The second 1 I would say is, you know, there's a whole kind of set of companies that consumes metrics off of, you know, Salesforce or Zendesk or any number of other tools. And because we're built on top of the company's centralized data warehouse, we, you know, just can't serve those customers yet. But, you know, generally, I would say that just about every medium to large company has metrics problems. And that's kind of the, you know, set of companies that we're working with in the in the early days.
[00:56:43] Unknown:
And as you continue to build out the product and build out the business, what are some of the things you have planned for the near to medium term that you're excited
[00:56:51] Unknown:
for? There's just so much foundational work. And, you know, the reason there is that if you are going to define a single source of truth for metrics, there's kind of a core product philosophy that I think you have to have. 1 is that you have to be able to consume metrics from, you know, wherever it's located, and you need to be able to build whatever metric types a customer wants, and we're still working on that. There are a lot of different types of metrics that companies wanna consume, and, you know, I would put us in the kind of 90% at this point. We can support all of the kind of core types of metrics, but still working to support some of the kind of edge cases that specific companies are interested in tracking.
And then on the other side of this, you have to be able to connect to every single tool that a company wants to consume those metrics in. Because in order for this truly to be a single source of truth, it has to be consumable in all of them. The moment it's not consumable in 1 of them, they will go around this tool, and it is no longer a single source of truth. So there's just a lot of foundational work to enable that vision. But, you know, beyond that, I think that once you have consistent metric definitions, there are a bunch of really interesting applications. And these are the things that I already called out around forecasting and anomaly detection, you know, interesting correlation analysis between them, building metrics for different applications like experimentation, you know, executive reporting.
There are just so many different applications, and I think some of those are well served today. You know, BI is an example of something that there are many different takes on how BI should work, and there are many people who are kind of building the future there. But I think that there are a lot of different applications for metrics where people are still just kind of starting from home base and trying to figure out how am I going to build this application. And it all starts with building metrics out in the data warehouse and then figuring out how to then kind of production as that. And so I think we can help with some of those, I call them long tail applications of metrics.
[00:58:56] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's data ecosystem and and
[00:59:16] Unknown:
just making metrics a first class citizen of the data ecosystem and and generally making data more accessible. But maybe more broadly, 1 that I'm passionate about is, I think, in order for data to really truly be accessible, we need to make a lot of progress with the data tools that we've built out. And I think in order to do that, there needs to be much broader cooperation between the various companies working in this industry. And so, you know, I'm excited about projects like Open Lineage. I'm excited that we are kind of pushing the specs of how our semantic layer works out into the open. And I think that this is something that will hopefully allow more companies to build on top of transform.
[01:00:02] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at transform. It's a very interesting product and an interesting problem space. I'm definitely excited to see more energy behind it and making and the wider availability of metrics across the overall data ecosystem. So thank you for all of the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thanks for having me, Tobias. This was a lot of fun. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Transform with Nick Handel
Nick Handel's Journey in Data Management
Defining Metrics and Their Importance
Challenges and Evolution of Metric Stores
Transform's Mission and User Experience
Platform Architecture and System Design
Handling Semi-Structured Data and Metric Life Cycle
Metrics Layer vs. Feature Store
Complexities in Metric Definition and Denormalization
Unlocking Capabilities with a Metrics Layer
Customer Use Cases and Organizational Impact
Lessons Learned and Future Plans for Transform