Summary
Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information. In order to eliminate the wasted effort of building custom integrations every time you want to combine lineage information across systems Julien Le Dem introduced the OpenLineage specification. In this episode he explains his motivations for starting the effort, the far-reaching benefits that it can provide to the industry, and how you can start integrating it into your data platform today. This is an excellent conversation about how competing companies can still find mutual benefit in co-operating on open standards.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt.
- Your host is Tobias Macey and today I’m interviewing Julien Le Dem about Open Lineage, a new standard for structuring metadata to enable interoperability across the ecosystem of data management tools.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what the Open Lineage project is and the story behind it?
- What is the current state of the ecosystem for generating and sharing metadata between systems?
- What are your goals for the OpenLineage effort?
- What are the biggest conceptual or consistency challenges that you are facing in defining a metadata model that is broad and flexible enough to be widely used while still being prescriptive enough to be useful?
- What is the current state of the project? (e.g. code available, maturity of the specification, etc.)
- What are some of the ideas or assumptions that you had at the beginning of this project that have had to be revisited as you iterate on the definition and implementation?
- What are some of the projects/organizations/etc. that have committed to supporting or adopting OpenLineage?
- What problem domain(s) are best suited to adopting OpenLineage?
- What are some of the problems or use cases that you are explicitly not including in scope for OpenLineage?
- For someone who already has a lineage and/or metadata catalog, what is involved in evolving that system to work well with OpenLineage?
- What are some of the downstream/long-term impacts that you anticipate or hope that this standardization effort will generate?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on the OpenLineage effort?
- What do you have planned for the future of the project?
Contact Info
- @J_ on Twitter
- julienledem on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- OpenLineage
- Marquez
- Hadoop
- Pig
- Apache Parquet
- Doug Cutting
- Avro
- Apache Arrow
- Service Oriented Architecture
- Data Lineage
- Apache Atlas
- DataHub
- Amundsen
- Egeria
- Pandas
- Apache Spark
- EXIF
- JSON Schema
- OpenTelemetry
- OpenTracing
- Superset
- Iceberg
- Great Expectations
- dbt
- Data Mesh
- The map is not the territory
- Kafka
- Apache Flink
- Apache Storm
- Kafka Streams
- Stone Soup
- Apache Beam
- Linux Foundation AI & Data
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you're flying it across the ocean? Molekula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine scale projects without having to manage endless 1 off information requests. With Molecular, data engineers manage 1 single feature store that serves the entire organization with millisecond query performance, whether in the cloud or at your data center. And since it is implemented as an overlay, molecular doesn't disrupt legacy systems.
High growth startups use molecular's feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecular feature store provides continuously updated feature access, reuse, and sharing without the need to preprocess data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale real time data, visit dataengineeringpodcast.com/molecular. That's m o l e c u l a, and request a demo. Mention that you're a data engineering podcast listener, and they'll send you a free t shirt. Your host is Tobias Macy. And today, I'm interviewing Julian Ladem about open lineage, a new standard for structuring metadata to enable interoperability across the ecosystem of data management tools. So, Julian, can you start by introducing yourself? Hello.
[00:02:20] Unknown:
I'm Julian. I guess I've been working in the big data space for the past 14 years. I started at Yahoo, building platform on top of. And then I started contributing to open source project like Pyg, and that's how I joined the Twitter data platform team. They are starting the Apache Parquet project, and that led to contributing to the launch of the Apache Arrow project later on. And more recently, I was the architect for the data platform at WeWork.
[00:02:52] Unknown:
After that, I started Dedekind, which I'm the CTO now. And so you've actually been on the show. This is your 3rd time now. So you were on to talk about your work with Apache Parkay, and you were on with Doug Cutting, who was the creator of Avro. So that was a good conversation. And then you were also on to talk about your work with Marquez, which is a natural transition to where you are now with Datakin, which is building on top of that platform. So for folks who listened to the Marquez episode, I don't know if you wanna just give a quick recap about where that project has gone and maybe what you're building on top of it with Dataken before we dig into open lineage.
[00:03:25] Unknown:
Yeah. So, you know, that came from when you build data platform, it quickly becomes evident that you need an equivalent of service oriented architecture, but for data pipelines. Right? Like, people consume data and they produce data. And by default, there's very little produce and how they're impacted by the changes we may be doing. Produce and how they're impacted by the changes we may be doing. And we need to understand where the data is coming from that we're consuming and how it's being maintained and delivered. Right? So that led us to start the Marcus project at WeWork.
So in building the data platform at WeWork, there was this real need for how do we understand in an organization there where there are many teams that consume and produce data? How do they understand how they depend on each other and how things change? That was a missing piece in the open source ecosystem. Right? So it's kind of like starting Marquez, and that was the beginning point for starting Dedekind. And, like, how do we build data observability for data pipelines?
[00:04:31] Unknown:
So I know that the open lineage is also an outgrowth of the work that you've been doing at Datakin and the prior work with Marquez. Can you start by just giving a bit of an overview about what the Open Lineage project is and maybe some of the story behind starting this effort and where it's taken you so far?
[00:04:46] Unknown:
Yeah. So while building markets, it became obvious that there's a need for standardization of lineage collection. Right? Like, there are several many project that care about collecting lineage, and there are many project that produce lineage, right, whether it's through SQL or Spark, Python, Pandas DataFrames. There are a lot of project that produce lineage, and there are a lot of project like Atlas, Nigeria, Amundsen, Data Hub, Marquez, and other proprietary data catalogs, you know, who are interested in collecting this lineage. So there's lots of duplication of effort on how we extract the lineage for each of those projects.
And it makes, first, a lot of duplication. And second, it's very brittle because by extracting the lineage from the outside of those projects, whenever something changes or to the internal on how they're exposing it, it breaks this extraction. So there's lots of duplication, a lot of wasted effort, a lot of complexity. And so the idea was, let's join forces. Right? So we reach out to a large group of contributors and maintainer of those project, whether in the space of people who are more interested in consuming, like the Data Hub, the Egeria, Amundsen, all those projects, or people producing these data. Right? Like, in the Spark community, in the Pandas community, and, like, all their SQL related projects.
And, really, we kind of got together and said, like, hey. How we do we build this standard in open source? And how do we agree on having a standard way of representing lineage? And so, you know, as soon as those conversations started, it was pretty clear that everybody was, like, yes, Why don't we have already a standard like this? That's where we started the project. Like, we did a kickoff, have initial group of contributors that, and we got the first version of the standard really having a core model of inputs, outputs, and runs of a job and capturing the lineage and having a framework for adding metadata around this core model. And it's really an invitation for all the people who care about lineage to help define this metadata and improve that project.
[00:07:02] Unknown:
As far as the state of the ecosystem, you mentioned that each of these different systems are generating metadata, but there isn't really a way for them to be able to communicate that effectively. And then another aspect worth digging into is, you know, so we're talking about lineage specifically. And I'm wondering before we dig into sort of the ecosystem piece, if you can kind of differentiate between data lineage and then the broader category of metadata management and sort of how those 2 are interrelated and sort of the scope of what Open Lineage is trying to accomplish in terms of standardizing, you know, at at least the lineage aspect of that metadata format?
[00:07:40] Unknown:
Lineage is something we talk a lot about, but I think there are several use cases. So when people talk about lineage, they actually have different conceptions of it Because they are fairly different use cases related to what do we do with lineage. So what there's 1 aspect which is more about governance and, like, understanding how a particular dataset or particular column is in a dataset is derived from a canonical data source that has a source of truth for defining something. And then there's a lineage in the sense of, I want to know exactly how this dataset was derived from this other 1, what was the version of the the job, and it's more like operational lineage of understanding what happened and how things were transformed.
And that led to building products. The tools that care about lineage, they focus on 1 of those use cases. So they may be focusing on, oh, I want to understand how this particular metric is derived from core datasets. Right? So that's 1 aspect. And others may be caring more about actually, I want to make sure that those transformation are reliable and happening at the right time and in a reliable way. And those lead to collecting a slightly different version of the lineage and building different use cases on top. And so Open Lineage has been focusing on how do we model those running jobs. Right? Because that's a core underlying data that you can build a lot of use cases. Like, whether you care about privacy or governance or compliance or operations, you need to be able to understand how the running jobs happening. So I think this different point of views on the use cases led to different solutions and makes it difficult to reuse if we don't focus on standardizing and modeling those running jobs and really collecting this metadata. In some way, opening it a little bit like the EXIF for data pipeline. You know, EXIF is this standard for adding metadata to pictures. So when you take a picture with your digital camera or smartphone, you have your GPS coordinates in it. You'll have the time it was taken, and this is all standardized.
And so is a bit the same. Right? The best way to capture this metadata about the job and the transformation is doing is when it's running and this information is available.
[00:10:07] Unknown:
And so this is what Open Lineage is focusing on. In terms of the actual goals of the Open Lineage effort, there's obviously the need to be able to standardize on the format of some of these core aspects of generating data and processing data. So how are you approaching the actual standardization effort and, you know, identifying what are those core elements and the building blocks for being able to actually generate useful information and just the overall scope of what you're trying to build with this standard. Is it just you know, some markdown and other people have to go and implement it, or do you also have reference implementations and code that people can take off the shelf to be able to actually get up and running with it? The approach to this standardization I guess, like, we talk about standard, but it's very different from what the standard body would be doing. Right? Like, it's not like 1 of the goal of upland lineage
[00:10:55] Unknown:
has been to not have 1 big monolithic spec. So the way it's defined is there's a core spec. The way it's encoded is using adjacent schema. So there's kind of a formal representation of what the metadata looks like. And there's a core model that captures, you know, just a very simple lineage notion of, like, there's a job, and it's uniquely identified by the name. There's a input and output dataset that are also I uniquely identify by their name, and there's this notion of run. Right? So we're expecting there's recurring jobs, and there's a run, and we know when it started, when it finished, and what dataset it read from, and what dataset it wrote to. And then around that, there's this notion of facets.
And a facet is an atomic piece of metadata. Basically, it's its own mini spec. So each facet has its own JSON schema. And what that helps is, as a community, instead of having 1 big discussion that we move very slowly and 1 monolithic spec, There's a core spec that's very small and just focusing on lineage. And then we can have a different conversation, very focused conversation on how do we define the schema of a dataset? How do we define data quality metrics, you know, data profiling? How do we define column level lineage? How do we define a representation of a query profile, for example? And how do we define the version of a job? So in this model, you know, the nice thing about open source and the first goal is to agree on the common direction for the project.
And then you can have different conversations moving at different speed. So things that are controversial, while we're going to make sure there's 1 conversation independently where people can argue in things and find a consensus. And things that are less controversial can move fast because they'll have their own facet to move on. So it's a bit of controlled chaos, like, emergeless facets and enabling really empowering people to drive the efforts they care about. Like, there are people who care about data quality, the people who care about governance and column level lineage. And these may be different subgroups in the community, and that helps making each of these conversation move faster. So it's kind of like 1 of the goals. And so, you know, the naming of Open Image is also during the parallel with OpenTelemetry and OpenTracing, right, which are providing similar effort for the service world. There's a nice parallel because it's really about making a spec, standardizing how the image and quality metadata is exposed.
But then, of course, as you mentioned, there's a reference implementation. So Open Lineage provides clients to make it easier to produce Open Lineage events. And there's a reference implementation, which is a Marcus project that consumes and stores and indexes a lineage event. And, of course, there are several other projects looking into consuming OpenInage. So today, if you want to use Open Image out of the box and you don't have anything, you can start by deploying Marquez, and you can start consuming. And there's several integrations that produce up and manage events for Spark, for BigQuery, for Redshift, Snowflake, like, kind of Spark and Airflow Spark, Airflow, and the SQL warehouses to start collecting lineage in markets. But, of course, then open lineage is a mechanism that lets you figure it this information. Right? It can easily be used in different tools or different products that may have a different point of view. Right? When we talk about the more governance aspect or the more operation aspects, the fact that there's a standard and exchange of metadata makes it easier for various tools to help you having a diff those different point of views on the lineage data. In terms of the actual
[00:14:49] Unknown:
effort of getting this off the ground, I know that when you first published the post that was announcing this effort, there were already a number of other projects and organizations that had signed on to it. Some notable ones that come to mind are Superset, Datahub, obviously, Marquez and Datakin. You mentioned Airflow. And I'm just wondering how much sort of upfront effort there was in terms of discussing with people what these core building blocks needed to be and sort of agreeing on that core structure before you started writing the specification or if it was flipped where you started writing the specification, and then you got people excited and signed on and just sort of the process involved there and identifying what are those core elements that are universal enough to be worth keeping into the core specification versus adding into these facets that are, you know, these optional plug ins to the metadata information?
[00:15:39] Unknown:
Yeah. So we took an approach that was similar to what we did for Apache Arrow. So some years back, contributing to the launch of the Apache Arrow project. And we did a similar approach, which was, let's reach out to all the people we think that are interested by that space or, like, need this kind of standardization. So, you know, it's a little bit of through the Apache Parquet community and then Apache Arrow community. So reaching out to projects so I reach out to projects like DataHub, I mentioned, you know, Pandas project, people in the Spark community, iceberg, like data quality world, like a great expectation or GBT as well.
We started by reaching out to people and, like, most people I reach out to were agreeing, like, yes, we need a standard. Why doesn't that exist already? How do we make it happen? So what we did, we started a group project, started a GitHub mailing list, Slack channel. And so we started with no specs. So I didn't come with a spec pre written, but we started by discussing some kind of agreeing on, like, the core, model. And, really, the initial spec is really minimalistic. So it's on purpose, I'm narrowing it down to just some lineage information, which is the connection between the job names and the dataset names they're reading and writing in a run. Right? And so it focused on that, which help getting to a consensus pretty quickly with this initial group of people. And then a few other people kind of heard about it, like, so we kind of tweeted about it. We presented about it, and they kind of said, oh, well, we were not part of the initial conversation, but we'd like to join.
And in particular, I'd say, like, the project that have been most drivers in this effort have been, like, the Egeria project, which is a project about metadata and governance, like, much wider scope of metadata and governance. The DBT folks are very helpful. There's people around the data mesh efforts have given a lot of feedback. You know, there's people from the great expectation projects that are really interested in the data quality aspect of this. And I'm sorry. I apologize in advance for the people I'm forgetting. You know, it seems like in every open source project, there are people who are kind of more driver and, like, pushing the project and people who are more interested in following and making sure they're on top and they give their feedback. You know, they'll adopt as it gets more steam as well. So it's kind of where we at.
[00:18:10] Unknown:
In terms of the actual core model, you mentioned that there's this concept of the entity, and then there are the jobs and the runs. So you had this group of people who were, you know, working together to figure out what are the elements that we need to include in the specification. And I'm just wondering what you've run into as far as any sort of differences in opinion or complexities in terms of figuring out how do you fit this model that needs to be so broadly applicable, but also opinionated enough to be worthwhile, just figuring out what that balance is, particularly because you have so many people involved who I'm sure have their own opinions as to what's most important and how to represent it and how they might need to consume it and just the overall process of going from this is a problem that needs to be solved to we have a standard that people are iterating on and actually using somewhere.
[00:18:56] Unknown:
Yeah. So 1 of the design goal of the core model is to really remove impediments from people being able to use it without having to go through, like, thorough approval system. So there's also this notion of custom facet. And so there's kind of trade offs of being able to deal with either being very precise and modeling precisely some, like, you know, the query profile from Snowflake or the query profile from BigQuery or the query profile from Presto. Right? So there's 1 aspect of being very precise and 1 aspect about having a model that is generic enough so it can represent any of those things in a way that you understand a query, profiling a generic way. So this notion of custom facet lets each contributor define their own opinionated version of something. Right? Like, when you define a query profile, each query engine may have, like, their own opinionated version of it. And so they'll have the ability to define, this is how we define a query profile. This is how we define column level lineage, or this is how we define, you know, the database the table scheme.
And so that allows them to make their own specific definition in their own namespace, and they own those specs. Right? They own that definition. They don't need to get approval from the central project. I'll get consensus to build them. And then on the other hand, they are going to be core facets that are parts of the project spec itself and that define a generic version of this. And so when you publish metadata, it's okay to have some, you know, duplication in the metadata. And to have the same information or presented in the opinionated version of 1 of the contributor to the spec. And also at the generic version, which might be a bit glossy. Right? Like, if you're trying to make a model that will work, you know, a representation of schema that works across all the possible thing that exist, it might be a bit glossy, but there's value in that something that is, hey. We have a way of representing schema that makes sense, you know, across everything. And we'll have then we'll deliver value. So there's kind of this trade off between being very flexible and extensible where people can add their own facet.
And also having core facets that aim to be a more generic representation of things. And that will require more consensus and more conversations. But the good thing about having both is, like, you can start, like people are not blocked. Right? They can start publishing metadata right away from their own point of view and under their own name space. And then as we collect more of those, it makes it easier as well to build a consensus. Right? Because people will already have exposed in a very explicit way. Right? They'll have defined a JSON schema for those things. And so there'll be a clear definition of what they're doing, and it will make it easier to arrive to consensus and combine those things and, you know, promote some of those facet to the core, spec.
So it's kind of the general idea of how we build this in the open source.
[00:22:06] Unknown:
And in terms of the actual structure or infrastructure around this, so, you know, there's the repository that has the core specification. I know that there's a Python and a Java client for being able to build your own integrations. And then for the actual namespaced facets, I'm curious what the discovery mechanism looks like for being able to propagate that information. Is it something where there's a certain producer, they add that into 1 of these facets, and then it just kind of exists as part of that data element? And then once you're looking at it on the other side, you know, it's been generated. It's been propagated through your system. It's stored in your, you know, system of record for lineage metadata. How do you then discover, you know, what that facet model is trying to convey? Like, what are their contextual elements? How do you propagate that? Is that just entirely self contained and that's part of the specification?
Or do you have some sort of central mechanism of being able to look up those namespaced elements and then have, know, maybe some pros to explain these are what these different attributes are trying to describe?
[00:23:07] Unknown:
When you define a facet, you have to provide a JSON schema, and the facet itself has to contain the URL to your JSON scheme. So that will let us build a registry of all the facet that exists. So when you build a custom facet, your requirement is to define adjacent schema for it. And as part of the adjacent schema, you can add, like, documentation of the semantics of each of the fields. So you just of course, you can put constraints of the name of the types, whether they're required or not, any level of testing, documentation, examples.
And so every custom facet will contain a link to the schema that represents it. And that's how we'll be able to, you know, keep track of all the facets that exist, all the custom facets, who define them, and have a registry as part of open lineage that keeps track of all of those. Right? So you'll have the core facet that their schema will be part of the core spec. And then the custom facet must provide reachable URL with the JSON schema.
[00:24:09] Unknown:
In terms of the actual concrete implementations of this, particularly in terms of the storage system for this lineage information, what kinds of challenges does it pose by having these arbitrarily structured, arbitrarily nested records and then being able to actually analyze them and expose them in a meaningful fashion for people who are trying to understand, so be able to actually build some sort of automation or visualization or intelligence on top of these pieces of metadata and just, you know, any types of best practices or recommended constraints in terms of, you know, the number of levels of nesting or the specific kind of structuring of attribute names or anything along those lines to try and build in some sort of commonality and be able to build tooling that is able to take advantage across a number of different facet implementations.
[00:24:59] Unknown:
That's the core of, like is focused on modeling the running jobs. Right? So that makes it very reusable, but that's also makes it not geared toward any specific use case. On, like, you know, there are people who care about user privacy. Right? So they'd want to be able to track how user private data is consumed and used across pipelines? Are there people who care about data quality and they want to make sure the data is correct and how it evolves over time and different things like that? So, really, what this is about, it's about when you receive those events. Right? So open image defines how we're going to model metadata coming from the running jobs. And it's up to the consumer to kind of build those indices.
And, like, store the data and index the data in a way that's going to make it easy to use for their particular use cases. And so that's where it's important that the models, they represent the data, and they're not necessarily optimized for their storage. But there's a spec that defines exactly what it looks like. Right? So you can pick which fields you need to index of how you need to store things to be able to enable various use cases. So Marquez offers, like, a a generic representation, storage of those facets, And it does index in a specific way some of those, field in particular that it picks, and then it gives a generic access to the list of facets for datasets or for job. And so you can keep track of also how the metadata change over time. So it's kind of value straight off. Right? So you kind of there's a generic way of storing those facets in JSON, and they're defined as JSON.
And then the facets you care about, that's where you're going to extract this information using those schema. They'll have a fixed representation, And you'd be able to index it for your use case. And I think this decoupling is what makes open lineage very reusable. Right? When you care about modeling the data, the way you're going to be using it at the time you collect it, that's where you actually losing a lot of information and then making this metadata you're collecting very specific to your use case. Right? When you care about dataset to dataset lineage, and that's how you collect the information and you drop all the information about the job itself, then you lose the ability to use the same metadata you collected for other use cases. So it's really important to be able to decouple and have a clean model of the running jobs.
That may not be the exact representation you need for, you know, your indexing use case, but that's where you'll have a transformation in between between decks it
[00:27:34] Unknown:
accordingly. In terms of your experience of going through this process, you're somebody who, as you mentioned, have been working in open source for quite a while. So you're very familiar with that overall process, and, you know, you've been working on Marquez and in the data ecosystem. So I'm sure you have a lot of your own opinions and understandings of the various useful aspects of lineage metadata and just metadata in general. But what are some of the ideas or assumptions that you had at the outset of this project that have been revisited or revised as you've iterated through and worked with other people to build out this definition and try to grow the community around it?
[00:28:09] Unknown:
You know, initially, there was this strong need to standardize things. But, actually, you know, as we grew, this idea of being flexible, and letting people, customer things, and enabling them to move fast and add user system and add metadata to it without having friction was very important. Also, I think when we think about data processing, there's some nice properties of it that our best practices, like, oh, jobs should be atomic. Meaning that users, they succeed and they update the output, or they fail and they don't touch any of the data. And so this atomicity, for example, is not always real. Right? Like, it's kind of it depends how things are done. And we also define those lineage graphs as DAGs. Right? Like, this directed acyclic graph because you only consume from Azure dataset and rights to new datasets. Right? And so there's no cycles, which is not true either. Right? Like, there are jobs who can actually will read the previous version of the data and then merge it with new data coming in and right back to the same dataset, creating cycles in the graph. And this is fine. And, you know, some of those things that need to be part of the model because especially when you have known atomic jobs or you have cycles, those are things that can create problems in production. So you want to make sure that you are able to observe those things and have accurate representation of them. So it's kind of some of the, you know, assumptions about, like, a perfect, you know, a spherical data pipeline in a vacuum, so to speak. We had to make sure that the model would actually able to reflect reality and keep track of those imperfections because those are the ones you want to be able to observe when you're debugging something.
[00:29:57] Unknown:
Along that same line, you know, there's a quote, I believe it's from Marshall McLuhan, but I could be wrong saying, you know, the map is not the terrain. And so in this case, open lineage and the lineage metadata is the map. And as you were saying, it's it's not necessarily a specific perfect representation of the terrain and the reality of the processing system. And I'm curious if there are any other sort of edge cases or, you difficulties that you've run into in terms of figuring out, you know, okay. Well, you know, in the real world, this is what's happening, which makes no sense.
I don't know how this ever worked, but now you need to figure out how to be able to actually represent this makes no sense. I don't know how this ever worked, but now I need to figure out how to be able to actually represent this so that I can show it to people and so that they can then debug it and fix it.
[00:30:32] Unknown:
Open lineage, somewhat, it builds a map, but it builds a map automatically. Right? Like, Slack often, you know, you will have diagrams. So people will have pictures of how pipelines depend on each other. Right? We read from there and we write from there. And the good thing about instrumenting, you know, your schedulers and your pipelines with something like OpenEdge, it's going to automatically draw a map of what's actually running. And so from that perspective, it's going to be more accurate, and it's also going to show you over time how that map has changed. Right? And so it's going to be more accurate that, you know, per user drone map. But, of course, you know, there's always limits to this modeling. Right? So there's some, in particular, at the moment, 1 of the things we care about is adding more precisions on the partition level dependency, for example.
Because, initially, opening edge at the core level, it just knows that, you know, this data set flow to a table, and this other data set right from that table. And, therefore, we're kind of inferring that there's a dependency between the previous update and the next update. Right? Someone modified it and someone consumed it, and there's a dependency. However, this is not always true. Very often, processing in the data world is incremental and, you know, like, it's updating 1 partition at a time. And, therefore, a job that updates a partition and another job that reads from a different partition don't depend on each other. Right? So it's kind of having more precision like this, of having more precise tracking of dependencies.
You know, like, depending on the use cases, people care about those level of precision. And that's part of the model. Right? The model is, if you don't have this level of precision, we'll be able to capture, like, coarse grain dependencies, real table to job to table. But if you have more precision, like, you know that, actually, this column depending on this column, you can add a facet that captured more precise dependency at the column level. If you're not treating all the columns from a table, for example. Or if you care about, like, more like the incremental processing and capturing the dependency at the partition level, that's also can become a facet. Unlike, actually, when we were reading the input, we were filtering and reading only a subset of the data. On where we were writing to the dataset, we were appending a partition. Right? And this is a subset of the data that was produced by that run. So this is the level of, you know, depending on the integrations and their level of precision and how much they can introspect the job, they'd be able to extract more information. So typically, a SQL job is very explicit about everything it's doing, and the engine knows exactly how each column is derived from something.
And something that's more like pandas or spark might have more opaque logic that should be harder, you know, like, not all jobs may have the same level of precision. But that's part of the flexibility of the model to be able to get the precise metadata where you can.
[00:33:41] Unknown:
RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plug ins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder today.
And then another interesting potential complication is you're mentioning, you know, everything is based around this job and the execution of the job. But what do you do when you're working in a streaming system? You know, how do you determine what is the start and end of that job?
[00:34:29] Unknown:
Yes. That's a very good point. So OpenEdge, of course, the goal people have streaming and batch processing jobs. So the goal is to cover all of those things. So so far, we've made more progress on the batch processing side of things, but, definitely, streaming jobs are covered by this model. And there are different aspects in which you can cover that. So first, this notion of asset lets you add precise material. Right? Like, there's different notion of a dataset. The dataset could be a table in a warehouse. It could be a folder in an s 3 bucket or in HDFS distributed file system, or it could be a topic, Kafka, broker.
Each of those will have slightly different metadata. And same thing about the job. Right? A job could be a batch process, a Spark job, or a SQL query. Or it could be, you know, a streaming job, Flink or Kafka streams or storm or those different things. And as streaming jobs, you know, I know we like to think of streaming jobs are continuously running. But, streaming jobs, it is still being deployed at some point, started at some point, and then they're stopped, and then they're upgraded, and then they're started again. Right? So you still have this life cycle of a run. And, like so you'll have a streaming job, and it will have a version of the code in the metadata, and it will have a run starting at certain point using some version of the code.
And then when you think of what version of the dataset it's starting from, right, like, if you read from a Kafka topic, well, where you started reading from, the metadata will be the list of offsets, the partition and offsets in those partition you started reading from. So you can very much model where you're starting reading this particular run of that particular version of the streaming job. And same thing at the point where it stops, you can start capturing the metadata of what offset it stopped trading and what information and, you know, capturing this information and where it stops. So it's all about this generic model of enabling notion of jobs and runs and dataset is pretty stable, and it applies to streaming as well.
But, of course, the metadata you collect looks different. Whereas this notion of facets, you know, it's you know, like in the Scala world, people are used to define traits of in Java having different interfaces. Right? It's about having the ability to add different facets of metadata to different those entities. And, of course, a Kafka topic has different metadata from warehouse table, but the overall lineage notion very much applies. Right? And you'll be able to track, you know, a service writing to a Kafka topic, and then this being archived into an s 3 bucket and then being consumed, you know, by your Spark job and maybe exported in a data warehouse, and then a SQL query goes on it. And it's really important to be able to track lineage across all those layers and be able to understand the metadata that's specific to each of those in understanding what's happening. So I know I've been using a lot of the batch processing use cases as, you know, to explain how this works.
But, of course, the streaming ecosystem is also covered. And I guess there's more need in the batch world at the moment. So which is why there's been more focus in this area, but there are very much people in the community that are asking those same questions and, like, figuring out how best to model those things for streaming pipelines and capturing the entire lineage graph.
[00:37:59] Unknown:
In terms of that, what are some of the problem domains or, you know, types of data or specific maybe industry verticals that are either best suited to using open lineage in this you know, capturing this type of metadata? And, you know, what are some of the problems or use cases that you are explicitly leaving out of scope for Open Lineage?
[00:38:19] Unknown:
Open Lineage is specifically focusing on capturing lineage and metadata of running jobs. Right? There's something running. It's starting. It's ending right from this and that. So there's kind of a lot of use cases, you know, like operations, making sure those are reliable, show up on time, and the data is correct are is 1 use case. Another use case is compliance, you know, like, for example, privacy with GDPR and CCPA and making sure that, you know, you know where your user private data is flowing. There's a bit of discovery, you know, what does it exist and how they're being used or where they're coming from. There's some governance applicable as well, like, making sure people use a canonical datasets to derive information from, and they don't, like, go to the wrong version of the data.
And then for things that are not covered, I would say that since OpenEdge is focusing on these running jobs. Right? Like, there's also metadata that is external to those running jobs. Right? Like, people would care about defining somewhere what's a canonical source of truth for the country codes, right, or for the user IDs. And or, you know, who are the stakeholders for a given datasets? So this typically doesn't quite fit in the Open Lineage integration because Open Lineage focuses on collecting information from the automation, automated jobs, and all of those things or ad hoc job as well, but not like external kind of metadata to that. However, I think this facet model applies very well. We have actually a lot of questions on the Slack around this. It's like, hey. How do I add to the model?
You know, who are the stakeholders for a dataset? And this is less applicable to open lineage, but it's more applicable to something like Marquez, for example. Markets, we care about adding all the metadata that are not directly linked to the job itself, but they are more, like, external to that. So it's kind of like, the distinction.
[00:40:22] Unknown:
For people who already have maybe a data catalog or a metadata management system, or they've already got some set of jobs that they maybe want to integrate to start generating the open lineage metadata, what's involved in actually evolving their systems to working well with Open Lineage specification, either as producers of that metadata or storing it and consuming it and integrating it with the rest of their ecosystem?
[00:40:46] Unknown:
Yeah. So working with the spec is fairly straightforward. Right? This is a JSON schema. So, basically, it's about producing JSON object that represents your lineage or consuming them depending on which side you are. And we provide there's a old built in, like, Java Python client, but it's a JSON schema spec. So you can use, JSON schema or, like, related open API projects to generate clients in all the languages. Most languages are supported by JSON schema in general. And so adding integrations and exposing lineage is pretty straightforward. The core model is very simple on purpose. Right? And so there are 2 aspects. So 1 is if you already have a catalog and you're producing lineage and you have collect lineage and you're interested to expose that to other tools that understand that open lineage, you can use 1 of those clients to produce the metadata and forward this metadata to the system.
And the other hand, if you're interested in consuming upper lineage because you already have a data catalog, but you want more lineage coverage, You can easily consume those events that will tell you, you know, this job is running from this dataset and running to this dataset and consume that in your model by just consuming those events. Right? And so this is about understanding JSON events that connect contain this metadata. 1 of the advantages of this is to enable federation of tools. Right? So you can care about these integration and collecting lineage once and really pushing that into your ecosystem. And then you can have various tools that use this metadata to focus on different use cases, whether it's privacy or operations, governance, and compliance, and things like that, which I think is very powerful. For example, talking to Egeria, the Egeria project is some kind of metadata hub. Right? So 1 feature they're working on is being able to consume open lineage and as well produce open lineage. So making it easy to kind of synchronize metadata system to make sure, like, the view of lineage is the same in value system that focus on different use cases.
[00:42:58] Unknown:
Digging more into this federation aspect and the sort of downstream and for how this is going to actually play out as the broader data community, you know, for how this is going to actually play out as the broader data community starts to get on board and adopt Open Lineage and some of the powerful network effects that this will have.
[00:43:19] Unknown:
Initially, Open Lineage has a set of integration. You know, there's a Spark integration that collects lineage from Spark. There's, Airflow integration that will also collect from the various SQL warehouses offerings. But the way this is happening and talking to various project, the goal is to push open lineage into each of the projects. Right? And this is the kind of thing that's slowly happening. But the goal is not to have a central project with all the integration to everything. The goal is each project to be able to expose their own lineage as open lineage. And open lineage being just a spec and, you know, it's like a Java interface if you're in the JVM.
Or just a spec, it doesn't pull in any dependencies to the project. Right? It's just having the ability to expose your lineage in a standard representation. And so then, at some point, it would become just turnkey in the entire data ecosystem. People will not have to care about, oh, how do I extract the lineage from everything? Right? Like like, everything is going to expose lineage in a standard way already built in. Right? Like, whatever project if they use Spark or if they use Flink or if they use, SQL warehouse, they'll have this lineage exposed and the metadata exposed in a standard way. And so we'll have better tools. Right? This is really an opportunity for the data ecosystem to have better tools that give you visibility across everything.
So, of course, you know, we are building tools for better observability of data pipelines. And other people, we care more about governance, you know, and privacy. And I know of people carrying a lot of these privacy by design. Right? How we make sure that by design, that user private data never leaks into the places it's not supposed to be? Are people I think they're not consented that their data is used for? Right? And really building that in the system and making it more observable will enable a lot of this. So it's kind of, you know, the end vision operate data operation perspective for me, is really I have this Maslow hierarchy of needs. Right? You know, like, this hierarchy of needs is, like, before you reach happiness, you need to have shelter, you need to have food, you need to be safe. And then when you have those things, then you can care about, like, reaching happiness and all those things. It's a bit the same about data. Right? Like, first, the data need to show up, need to show up on time. It needs to be correct.
And once you have all those things, then you can actually get value out of it. Right? So people joke about how much time it's spent cleaning up the data or, like, you know, debugging pipeline, broken pipelines, or making sure we can trust the data. It's really about building this trust in data and, like, knowing you just know and it's observable and it can be proved that everything is working accordingly, whether it's from an operation perspective and be able to trust the data or whether it's from, like, privacy by design, making sure you do the right thing with your user private data or, like, being able to prove that you're using the correct source of truth of data to, produce certain metrics.
A lot of that. Right? It's really making everything work better and be reliable. And, right, then you can prove it. Right? There's not this quest nagging question. It's like, but, you know, is this dashboard reflecting what we think is reflecting or
[00:46:40] Unknown:
thing like that? The dreaded sentence of this doesn't look right.
[00:46:44] Unknown:
Yes. And maybe it is right, and something else is the problem. And maybe it's not right. Or I like but if you don't know, it's problematic. Right? You can't make a decision.
[00:46:54] Unknown:
And so in terms of your experience of working through this open lineage spec and working with all these other projects and organizations and integrating this into the work you're doing at Datacann and Marques, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:47:10] Unknown:
I think I'm an engineer. I'm a software engineer. Right? So I I relate to this. But I think open lineage is more of a human problem rather than a difficult 1. Right? This kind of when you look at the way people have been building lineage solutions, there's a lot of software engineer, like myself, They would rather reverse engineer what others have done rather than, you know, have this conversation how we would standardize and, like, mixing it easy to consume for everyone. And so there's a lot of systems that are very complex because people look at, oh, I can look at the internals of that tools, and I can understand how it works. And I can extract something without having to ask anybody to change anything. Right?
And so which works. Right? And they're like a bunch of solution that exists that solve this particular problem. We will solve the complexity of all the things that exist for you by understanding the intricate details of every 1 of those things. And the problem with that is very brittle, it's very costly, it's very time consuming, and it gets obsoleted, right, as technologies evolves And, like, whoever became the linear solution for Hadoop, now it's kind of like, has invested a lot of time in something that's not used as much anymore. Right? And so I think it's a lot of a human problem of, like, how do we make those conversation happen? Right? And, like, the way that opening each spec is designed is to facilitate those conversation. Right? Let's avoid the giant monolithic spec, where it's a conversation that will never converge. Right? Let's have a lot of small focused conversation on the things we care about.
And so that the people who care a particle aspect, let's say it's like table schema or column level lineage, they can help this particular area of the problem move forward. Right? And so I think 1 of the learning is, like, you know, when people talk, good things happen. And it's kind of like a lot of open lineage is just facilitating, making those discussion happens. And, you know, when I initially reach out to this group of people and say, like, like, a lot of them were, like, hey, how come this doesn't exist yet? Well, what we need to know is plant the seed, and then it will happen. You know, there's this interested children's story, which is a stone soup.
You know, so it's kind of someone comes to a village with a pot, and they stir hot water. You know? They just heat water, and they put a large stone in it. And they say, look, I'm making a stone soup. But everybody's welcome to add their own ingredient to it and make it special. Right? So basically, they're making a soup, add nothing, but they're creating this focal point. It's like, look, someone is driving the collaboration. And like the stone soup, the more people come and they put their contribution to it by adding an ingredient, And the more the thing becomes more valuable for everyone. Right? So it's kind of really when you think about open source, and now we create good incentives.
And this is this kind of perfect project where every contribution, no matter how small, is going to make the project better for everyone who's been contributing to it. So it's a really good virtuous circle of, like, how do we gain momentum in this project? And everybody's like, yes. This is great. We want to contribute to this. Because we are part of this effort now, it's valuable to us, and it's more valuable for everyone. And I think this is, like, some of the learnings on open source, you know, for the past 15 years. I guess it reflects a bit some of the project I've been pushing or driving. It's kind of, hey. How do we standardize this? How do you make this reusable, enable the ecosystem?
[00:50:53] Unknown:
Absolutely. And so as you continue to go down this road of building and growing the open lineage project, what are some of the things that you have planned for the near to medium term future? And for anybody who's listening, what are some of the ways that they could be most helpful to helping make this effort a success?
[00:51:09] Unknown:
There's a couple of directions. Right? From a technical's perspective, there's evolving the spec. Right? Like, it's kind of contributing to the facet, and people could be just using it and making their own custom facet or contributing. There's a proposal mechanism on the GitHub that lets you submit a proposal and say, like, hey. I think this particular use case is not covered. We need to formalize this. You can see there are several conversation happening in GitHub issues on the project. And so 1 aspect is helping standardize more facets of the metadata. Another aspect is increasing coverage by, like, how do we adapt open lineage? So, you know, there's an existing effort in Spark, but people are welcome to join.
And how do we increase the coverage with other project like Flink, like Beam, like the streaming efforts? The other area is increasing coverage in system who consume opening edge and are able to understand it so that we can help with this figuration aspect. So that's kind of how people can contribute and join the conversation. And it's really community driven spec. Right? The goal is really to enable people to contribute and drive the areas they care about. And as part of that, we're in the process of submitting it officially to the LFAI and Data Foundation. You know, it's a some foundation of the Linux foundation that cares about data project.
Marquez is part of this project of this foundation already. And Open Lineage being fairly recent is not officially part of it yet, so it's being submitted soon. And with the goal of being part of a foundation is really this testament that, look, this project is not owned by anyone in particular. Right? This is a community driven project. There's a clear governance to it. And the goal is to standardize lineage and make exchanging lineage and metadata seamless ecosystems. Like, it's we, as a data community, are making this happen. So it's kind of, like, in the immediate future part of having a clear governance.
And when you contribute to a project, I think I'm very opinionated about if you have an open source project, you know, there's a license aspect. And, like, of course, OpenEdge is Apache 2 license, so it means you can do what you want with it. There's no restriction. The only restriction is you have to give credits. Right? You can't claim, you did something if you're reusing an Apache project. But then the other aspect is to have a clear governance and, you know, being under our foundation means the license is not going to change. Right? Like, it's always going to be a community driven project. And, you know, there's some discussion that you see that happening in some project of, like, hey. Should we relicense this and protect from someone, like, using this project?
This is not the goal here. The goal is, yes, this is a project that's open source. It's for the community. It's always going to be usable by anyone, and that's why contributing it to a foundation is really important.
[00:54:13] Unknown:
Well, for anybody who wants to get involved and get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:54:29] Unknown:
So I'm a little biased in that in that space because I'm building a tool that I think is sorely missed. And, you know, I care a lot about data pipelines, observability, and how to make sure those pipelines are reliable. Right? They're running on time. They're producing correct data, and we can very quickly identify what the problems are. Either it's data quality of someone making a change and not understanding the downstream impact of it. Or, you know, as part of this, you know, being able to analyze this lineage information and really being able to quickly understand the root cause when there's a problem.
But even more than that is just preventing problems altogether by being able to identify the consequences of a change before it gets applied. Right? That's kind of, oh, if you're changing the schema of this dataset, it's going to impact such and such job downstream. Right? Or this thing is slower than before. The set is not showing up on time anymore. How do we quickly troubleshoot and figure out what change upstream make things slower? Or maybe it's just change in data or value suspect like that. And so I think this is very lacking. Right? A lot of time you spend, like, figuring out why things are not working quite right. And I think that's a big gap. Yeah. Like I said, I'm a bit biased because that's kind of what we're building at Dedekind.
I think this is extremely important, and I expect that in the future, no data engineer or data scientist will accept a job when they don't have proper tooling to understand how their particular contribution
[00:56:07] Unknown:
works and is impacted by the wide ecosystems of consuming and producing dataset that is around them. Well, thank you very much for taking the time today to join me and share the work that you're doing at Open Lineage and touch a little bit on the work you're doing at Dataken. It's very interesting and valuable project, and I appreciate you getting the ball rolling and helping us all by getting this effort out and getting it working. So definitely gonna be keeping a close eye on open lineage and, you know, hopefully, adding my own contributions to making it more part of the ecosystem. So thank you again for your time and energy on that, and I hope you enjoy the rest of your day. Thank you very much, Tobias. Always a pleasure, and I really enjoy this podcast. Thank you. Bye bye. Thank you. Thank you for listening.
Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Julian Ladem: Background and Experience
Marquez Project and Data Observability
Introduction to Open Lineage
Data Lineage vs. Metadata Management
Standardization Efforts and Core Elements
Community Involvement and Contributions
Core Model and Custom Facets
Challenges in Storing and Analyzing Lineage Data
Real-World Applications and Edge Cases
Handling Streaming Data and Job Lifecycle
Use Cases and Scope of Open Lineage
Integrating Open Lineage with Existing Systems
Future Directions and Federation of Tools
Lessons Learned and Community Collaboration
How to Get Involved and Contribute
Biggest Gaps in Data Management Tooling
Closing Remarks and Thank You