Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you're flying it across the ocean?

Molekula is an enterprise feature store that operationalizes

advanced analytics and AI in a format designed for massive machine scale projects without having to manage endless 1 off information requests.

With Molecular, data engineers manage 1 single feature store that serves the entire organization with millisecond query performance, whether in the cloud or at your data center.

And since it is implemented as an overlay, molecular doesn't disrupt legacy systems.

High growth startups use molecular's feature store because of its unprecedented

speed, cost savings, and simplified access to all enterprise data.

From feature extraction to model training to production, the Molecular feature store provides continuously updated feature access, reuse, and sharing without the need to preprocess data.

If you need to deliver unprecedented

speed, cost savings, and simplified access to large scale real time data,

visit dataengineeringpodcast.com/molecular.

That's m o

l e c u l a, and request a demo.

Mention that you're a data engineering podcast listener, and they'll send you a free t shirt. Your host is Tobias Macy. And today, I'm interviewing Julian Ladem about open lineage, a new standard for structuring metadata to enable interoperability

across the ecosystem of data management tools. So, Julian, can you start by introducing yourself? Hello.

I'm Julian.

I guess I've been working in the big data space

for the past

14 years. I started at Yahoo,

building platform on top of. And then I started contributing to open source project like Pyg, and that's how I joined the Twitter data platform team.

They are starting the Apache Parquet project,

and that led to

contributing to the launch of the Apache Arrow project later on. And more recently,

I was the architect for the data platform at WeWork.

After that, I started Dedekind, which I'm the CTO now. And so you've actually been on the show. This is your 3rd time now. So you were on to talk about your work with Apache Parkay, and you were on with Doug Cutting, who was the creator of Avro. So that was a good conversation. And then you were also on to talk about your work with Marquez, which is a natural transition to where you are now with Datakin, which is building on top of that platform. So for folks who listened to the Marquez episode, I don't know if you wanna just give a quick recap about where that project has gone and maybe what you're building on top of it with Dataken before we dig into open lineage.

Yeah. So, you know, that came from when you build data platform,

it quickly becomes evident that you need an equivalent of service oriented architecture, but for data pipelines.

Right? Like, people

consume data and they produce data. And by default, there's very little produce

and how they're impacted by the changes we may be doing. Produce and how they're impacted by the changes we may be doing. And we need to understand where the data is coming from that we're consuming and how it's being maintained and delivered.

Right? So that led us to start the Marcus project at WeWork.

So in building the data platform at WeWork, there was this real need for how do we understand

in an organization there where there are many teams that consume and produce data? How do they understand how they depend on each other and how things change?

That was a missing piece in the open source ecosystem.

Right? So it's kind of like starting Marquez,

and that was the beginning point for starting Dedekind.

And, like, how do we build data observability for data pipelines?

So I know that the open lineage is also an outgrowth of the work that you've been doing at Datakin and the prior work with Marquez. Can you start by just giving a bit of an overview about what the Open Lineage project is and maybe some of the story behind

starting this effort and where it's taken you so far?

Yeah. So while building

markets,

it became obvious that there's a need for standardization

of lineage collection. Right? Like, there are several many project that care about collecting lineage, and there are many project

that produce lineage, right, whether it's through SQL

or Spark,

Python, Pandas DataFrames.

There are a lot of project that produce lineage, and there are a lot of project like Atlas, Nigeria, Amundsen, Data Hub,

Marquez,

and other proprietary

data catalogs, you know, who are interested in collecting this lineage. So there's lots of duplication of effort

on how we extract the lineage for each of those projects.

And it makes, first, a lot of duplication. And second, it's very brittle

because by extracting the lineage from the outside of those projects,

whenever something changes

or to the internal on how they're exposing it, it breaks this extraction. So there's lots of duplication,

a lot of wasted effort, a lot of complexity.

And so

the idea was, let's join forces. Right? So we reach out to a large group

of contributors and maintainer of those project, whether in the space of people who are more interested in consuming,

like the Data Hub, the Egeria,

Amundsen, all those projects,

or people producing these data. Right? Like, in the Spark community,

in the Pandas community,

and, like, all their SQL related projects.

And, really, we kind of got together and said, like, hey. How we do we build this standard in open source? And how do we agree

on having a standard way of representing lineage?

And so, you know, as soon as those conversations started, it was pretty clear that everybody was, like, yes, Why don't we have already a standard like this?

That's where we started the project. Like, we did a kickoff, have initial group of contributors that, and we got the first version of the standard really having a core model

of inputs, outputs, and runs of a job and capturing the lineage

and having a framework for adding metadata around this core model. And it's really an invitation for all the people who care about lineage

to help define this metadata

and improve that project.

As far as

the state of the ecosystem, you mentioned that each of these different systems are generating metadata,

but there isn't really a way for them to be able to communicate that effectively.

And then

another aspect worth digging into is, you know, so we're talking about lineage specifically. And I'm wondering before we dig into sort of the ecosystem piece, if you can kind of differentiate between

data lineage and then the broader category of metadata management and sort of how those 2 are interrelated and sort of the scope of what Open Lineage is trying to accomplish in terms of standardizing,

you know, at at least the lineage aspect of that metadata format?

Lineage is something we talk a lot about, but I think there are several use cases. So when people talk about lineage, they actually have different conceptions of it Because they are fairly

different use cases

related to

what do we do with lineage. So what there's 1 aspect which is more about

governance and, like, understanding how a particular

dataset or particular column is in a dataset is derived from

a canonical

data source that has a source of truth for defining something. And then there's a lineage in the sense of,

I want to know exactly how this dataset was derived from this other 1, what was the version of the the job, and it's more like operational lineage

of understanding what happened and how things were transformed.

And

that led to building products.

The tools that care about lineage, they focus on 1 of those use cases.

So they may be focusing on, oh,

I want to understand how this particular metric is derived from core datasets.

Right? So that's 1 aspect. And others may be caring more about actually, I want to make sure that those transformation

are reliable

and

happening at the right time and in a reliable way. And those lead to collecting a slightly different version of the lineage

and building different use cases on top. And so Open Lineage has been focusing on

how do we model

those running jobs. Right? Because that's a core underlying

data that you can build a lot of use cases. Like, whether you care about

privacy

or governance or compliance or operations,

you need to be able to understand how the running jobs happening. So I think this

different point of views on the use cases

led to different solutions and makes it difficult to reuse

if we don't focus on standardizing

and modeling those running jobs and really collecting this metadata. In some way, opening it a little bit like the EXIF

for data pipeline. You know, EXIF is this standard for adding metadata to pictures. So when you take a picture with your digital camera or smartphone, you have your GPS coordinates in it. You'll have the time it was taken, and this is all standardized.

And so is a bit the same. Right? The best way to capture this metadata

about the job and the transformation is doing is when it's running and this information is available.

And so this is what Open Lineage is focusing on. In terms of the actual goals of the Open Lineage effort, there's obviously the need to be able to standardize on the format of some of these core aspects of generating data and processing data. So how are you approaching the actual standardization effort

and, you know, identifying what are those core elements and the building blocks for being able to actually generate useful information

and just the overall scope of what you're trying to build with this standard. Is it just you know, some markdown and other people have to go and implement it, or do you also have reference implementations and code that people can take off the shelf to be able to actually get up and running with it? The approach to this standardization I guess, like, we talk about standard, but it's very different from what the standard body would be doing. Right? Like, it's not like 1 of the goal of upland lineage

has been to not have 1 big monolithic spec.

So

the way it's defined is there's a core spec. The way it's encoded is using adjacent schema. So there's kind of a formal representation

of what the metadata looks like.

And there's a core model that

captures,

you know, just a very simple lineage notion of, like, there's a job, and it's uniquely identified by the name. There's a input and output dataset that are also I uniquely identify

by their name, and there's this notion of run. Right? So we're expecting there's recurring jobs, and there's a run, and we know when it started, when it finished, and what dataset it read from, and what dataset it wrote to.

And then around that, there's this notion of facets.

And a facet is an atomic piece of metadata. Basically, it's its own mini spec. So each facet has its own JSON schema. And what that helps is,

as a community,

instead of having 1 big discussion that we move very slowly and 1 monolithic spec, There's a core spec that's very small and just focusing on lineage.

And then we can have a different conversation,

very focused conversation on how do we define

the schema of a dataset?

How do we define

data quality metrics, you know, data profiling?

How do we define column level lineage? How do we define a representation

of a query profile, for example? And how do we define the version of a job? So

in this model, you know, the nice thing about open source and the first goal is to agree on the common direction for the project.

And then you can have different conversations

moving at different speed. So things that are controversial,

while we're going to make sure there's 1 conversation independently

where people can argue in things and

find a consensus.

And things that are less controversial can move fast because they'll have their own

facet to move on. So it's a bit of controlled chaos, like, emergeless facets and enabling

really empowering people

to drive the efforts they care about. Like, there are people who care about data quality, the people who care about governance and column level lineage.

And these may be different subgroups in the community,

and that helps making each of these conversation move faster. So it's kind of like 1 of the goals. And so, you know, the naming of Open Image is also during the parallel with OpenTelemetry

and OpenTracing,

right, which are providing similar

effort for the service world. There's a nice parallel because it's really about making a spec, standardizing how the image

and quality metadata is exposed.

But then, of course, as you mentioned, there's a reference implementation. So Open Lineage provides clients to make it easier to produce Open Lineage events.

And there's a reference implementation, which is a Marcus project

that consumes

and stores and indexes

a lineage event.

And, of course, there are several other projects looking into consuming OpenInage.

So today, if you want to use Open Image out of the box and you don't have anything, you can start by deploying Marquez,

and you can start consuming. And there's several integrations

that produce up and manage events for Spark,

for BigQuery,

for Redshift, Snowflake, like, kind of Spark and Airflow

Spark, Airflow, and the SQL warehouses

to start collecting lineage in markets. But, of course, then open lineage is a mechanism that lets you figure it this information. Right? It can easily be used

in different tools or different products

that may have a different point of view. Right? When we talk about the more governance aspect or the more operation aspects,

the fact that there's a standard and exchange of metadata makes it easier

for various tools to

help you having a diff those different point of views on the lineage data. In terms of the actual

effort of getting this off the ground, I know that when you first published the post that was announcing this effort, there were already a number of other projects and

organizations that had signed on to it. Some notable ones that come to mind are Superset, Datahub,

obviously, Marquez and Datakin. You mentioned Airflow. And I'm just wondering

how much sort of upfront effort there was in terms of discussing with people

what these core building blocks needed to be and sort of agreeing on that core structure before you started writing the specification or if it was flipped where you started writing the specification, and then you got people

excited and signed on and just sort of the process involved there and identifying what are those core elements that are universal enough to be worth keeping into the core specification versus adding into these facets that are, you know, these optional plug ins to the metadata information?

Yeah. So we took an approach that was similar to what we did for Apache Arrow.

So some years back, contributing to the launch of the Apache Arrow project. And we did a similar approach, which was,

let's reach out to all the people we think that are interested by that space or, like, need this kind of standardization. So, you know, it's a little bit of through the Apache Parquet community and then Apache Arrow community.

So reaching out to projects so I reach out to projects like DataHub,

I mentioned,

you know, Pandas project, people in the Spark community,

iceberg,

like data quality world, like a great expectation or GBT as well.

We started by reaching out to people and, like,

most people I reach out to

were

agreeing, like, yes, we need a standard.

Why doesn't that exist already?

How do we make it happen? So what we did, we started a group project, started a GitHub mailing list, Slack channel. And so we started with no specs. So I didn't come with a spec pre written,

but we started by discussing some kind of agreeing on, like, the core,

model. And, really, the initial spec

is really minimalistic. So it's on purpose,

I'm narrowing it down to just

some lineage information, which is the connection between the job names and the dataset names they're reading and writing in a run. Right? And so it focused on that, which help getting to a consensus pretty quickly with this initial group of people. And then a few other people

kind of heard about it, like, so we kind of tweeted about it. We presented about it, and they kind of said, oh, well, we were not part of the initial conversation, but we'd like to join.

And in particular,

I'd say, like, the project that have been most drivers in this effort have been, like, the Egeria project,

which is a project about metadata and governance, like, much wider scope of metadata and governance.

The DBT

folks are very helpful.

There's people around the data mesh efforts have given a lot of feedback.

You know, there's people from the great expectation projects

that are really interested in the data quality aspect of this. And I'm sorry. I apologize in advance for the people I'm forgetting. You know, it seems like in every open source project, there are people who are kind of more driver and, like, pushing the project and people who are more interested in following and making sure they're on top and they give their feedback.

You know, they'll adopt as it gets more steam as well. So it's kind of where we at.

In terms of the actual core model, you mentioned that there's this concept of the entity, and then there are the jobs and the runs.

So you had this group of people who were, you know, working together to figure out what are the elements that we need to include in the specification. And I'm just wondering what you've run into as

far as any sort of differences in opinion or complexities in terms of figuring out how do you fit this model that needs to be so broadly applicable, but also opinionated enough to be worthwhile,

just figuring out what that balance is,

particularly because you have so many people involved who I'm sure have their own opinions as to what's most important and how to represent it and how they might need to consume it and just the overall process of going from this is a problem that needs to be solved to we have a standard that people are iterating on and actually using somewhere.

Yeah. So 1 of the design goal of the core model is to really remove impediments from people being able to use it

without having to go through, like, thorough approval system. So there's also this notion of custom facet.

And so there's kind of trade offs of being able to deal with

either being very precise

and modeling precisely

some, like, you know, the query profile

from Snowflake or the query profile from BigQuery or the query profile from Presto. Right? So there's 1 aspect of being very precise

and 1 aspect about having a model that is generic enough so it can represent any of those things

in a way that

you understand a query,

profiling a generic way. So this notion of custom facet

lets each contributor

define their own opinionated

version of something. Right? Like, when you define a query profile, each query engine may have, like, their own opinionated

version of it. And so they'll have the ability to define,

this is how we define a query profile. This is how we

define column level lineage, or this is how we define, you know, the database the table scheme.

And so that allows them to make their own specific

definition in their own namespace,

and they own those specs. Right? They own that definition. They don't need to get approval

from the central

project. I'll get consensus to build them. And then on the other hand, they are going to be core facets

that are parts of the project spec itself

and that define a generic version of this. And so when you publish metadata, it's okay to have some, you know, duplication in the metadata. And to have the same information

or presented in the opinionated version of 1 of the contributor to the spec. And also at the generic version,

which might be a bit glossy. Right? Like, if you're trying to make a model that will work,

you know, a representation of schema that works across all the possible

thing that exist,

it might be a bit glossy, but there's value

in that something that is, hey. We have a way of representing schema that makes sense, you know,

across everything. And we'll have then we'll deliver value. So there's kind of this trade off between being very flexible and extensible

where people can add their own facet.

And also having core facets that aim to be a more generic

representation of things.

And that will require more consensus and more conversations. But the good thing about having both is, like, you can start, like people are not blocked. Right? They can start publishing metadata

right away

from their own point of view

and under their own name space. And then

as we collect more of those, it makes it easier as well to build a consensus. Right? Because people will already have

exposed in a very

explicit way. Right? They'll have defined a JSON schema for those things.

And so there'll be a clear definition of what they're doing, and it will make it easier to arrive to consensus and combine those things and, you know, promote some of those facet to the core, spec.

So it's kind of the general idea of how we build this in the open source.

And in terms of the actual

structure or infrastructure around this, so, you know, there's the repository

that has the core specification.

I know that there's a Python and a Java client for being able to build your own integrations.

And then

for the actual namespaced facets, I'm curious what the

discovery mechanism looks like for being able to propagate that information. Is it something where there's a certain producer, they add that into 1 of these facets, and then it just kind of exists as part of that data element? And then once you're looking at it on the other side, you know, it's been generated. It's been propagated through your system. It's stored in your, you know, system of record for lineage metadata.

How do you then discover,

you know, what that facet model is trying to convey? Like, what are their contextual elements? How do you propagate that? Is that just entirely self contained and that's part of the specification?

Or do you have some sort of central mechanism of being able to look up those namespaced elements and then have, know, maybe some pros to explain these are what these different attributes are trying to describe?

When you define a facet, you have to provide a JSON schema,

and the facet itself has to contain the URL to your JSON scheme.

So that will let us build a registry of all the facet that exists. So when you build a custom facet, your requirement is to define adjacent schema for it. And as part of the adjacent schema, you can add, like, documentation

of the semantics of each of the fields.

So you just of course, you can put constraints of the name of the types, whether they're required or not, any level of testing,

documentation,

examples.

And so

every custom facet will contain a link to

the schema that represents it.

And that's how we'll be able to, you know, keep track of all the facets that exist, all the custom facets, who define them,

and have a registry

as part of open lineage that keeps track of all of those. Right? So you'll have the core facet that their schema will be part of the core spec. And then the custom facet

must provide reachable URL

with the JSON schema.

In terms of the actual concrete implementations of this,

particularly in terms of the

storage

system for this lineage information,

what kinds of challenges does it pose by having these arbitrarily structured, arbitrarily nested records and then being able to actually

analyze them and expose them in a meaningful fashion for people who are trying to understand, so be able to actually build some sort of automation or visualization

or intelligence on top of these pieces of metadata and just, you know, any types

of best practices

or recommended constraints in terms of, you know, the number of levels of nesting or the specific kind of structuring of attribute names or anything along those lines to try and build in some sort of commonality and be able to build tooling that is able to take advantage across a number of different facet implementations.

That's the core of, like is

focused

on modeling the running jobs. Right? So that makes it very reusable,

but that's also makes it not geared toward any specific use case. On, like, you know, there are people who care about user privacy. Right? So they'd want to be able to track

how user private data is consumed and used across pipelines? Are there people who care about data quality and they want to make sure

the data is correct and how it evolves over time and different things like that? So, really, what this is about, it's about when you receive those events. Right? So open image defines how we're going to model metadata coming from the running jobs.

And it's up to the consumer to kind of build those indices.

And, like, store the data and index the data in a way that's going to make it easy to use for their particular use cases.

And so that's where it's important

that the models, they represent the data, and they're not necessarily

optimized for their storage.

But there's a spec that defines exactly what it looks like. Right? So you can pick which fields you need to index of how you need to store things to be able to enable various use cases.

So Marquez offers, like, a a generic representation,

storage of those facets,

And it does index in a specific way some of those, field in particular that it picks, and then it gives a generic

access to the list of facets for datasets or for job. And so you can keep track of also how the metadata change over time. So it's kind of value straight off. Right? So you kind of there's a generic way of storing those facets in JSON, and they're defined as JSON.

And then

the facets you care about, that's where you're going to extract this information

using those schema. They'll have a fixed representation,

And you'd be able to index it for your use case. And I think this decoupling

is what makes open lineage very reusable. Right? When you care about modeling the data, the way you're going to be using it

at the time you collect it, that's where you actually losing a lot of information

and then making this metadata you're collecting very specific to your use case. Right? When you care about dataset to dataset lineage, and that's how you collect the information and you drop all the information about the job itself,

then you lose the ability

to use the same metadata you collected for other use cases. So it's really important to be able to decouple

and have a clean model of the running jobs.

That may not be the exact representation you need for, you know, your indexing use case, but that's where you'll have a transformation in between

between decks it

accordingly. In terms of your experience of going through this process, you're somebody who, as you mentioned, have been working in open source for quite a while. So you're very familiar with that overall process, and, you know, you've been working on Marquez and in the data ecosystem. So I'm sure you have a lot of your own opinions and understandings of the various useful aspects of lineage metadata and just metadata in general. But what are some of the ideas or assumptions that you had at the outset of this project that have been revisited or revised as you've iterated through and worked with other people to build out this definition and try to

grow the community around it?

You know, initially,

there was this strong need to standardize things.

But, actually, you know, as we grew,

this idea of being flexible,

and letting people, customer things, and

enabling them to move fast and add user system and add metadata to it without having friction

was very important.

Also, I think when we think about data processing, there's some nice properties of it that

our best practices, like, oh, jobs should be atomic. Meaning that users, they succeed and they update the output,

or they fail and they don't touch any of the data. And so this atomicity, for example, is not

always real. Right? Like, it's kind of it depends how things are done. And we also define those lineage graphs as DAGs. Right? Like, this directed

acyclic graph because you

only consume from Azure dataset and rights to new datasets. Right? And so there's no cycles, which is not true either. Right? Like, there are jobs who can actually will read the previous version of the data

and then

merge it with new data coming in and right back to the same dataset,

creating cycles in the graph. And this is fine. And, you know, some of those things

that need to be part of the model

because especially when you have known atomic jobs or you have cycles,

those are things that can create problems in production. So you want to make sure that you are able to observe those things

and have accurate representation of them. So it's kind of some of the, you know, assumptions about, like, a perfect, you know, a spherical data pipeline in a vacuum,

so to speak. We had to make sure that the model would actually

able to reflect reality and keep track of those imperfections because those are the ones you want to be able to observe when you're debugging something.

Along that same line, you know, there's a quote, I believe it's from Marshall McLuhan, but I could be wrong saying, you know, the map is not the terrain. And so in this case, open lineage and the lineage metadata is the map. And as you were saying, it's it's not necessarily a specific perfect representation of the terrain and the reality of the processing system. And I'm curious if there are any other sort of edge cases or, you difficulties that you've run into in terms of figuring out, you know, okay. Well, you know, in the real world,

this is what's happening, which makes no sense.

I don't know how this ever worked, but now you need to figure out how to be able to actually represent this makes no sense. I don't know how this ever worked, but now I need to figure out how to be able to actually represent this so that I can show it to people and so that they can then debug it and fix it.

Open lineage, somewhat, it builds a map, but it builds a map automatically. Right? Like, Slack often,

you know, you will have diagrams. So people will have pictures of how pipelines depend on each other. Right? We read from there and we write from there. And the good thing about instrumenting,

you know, your schedulers

and your pipelines

with something like OpenEdge, it's going to automatically

draw a map of what's actually running.

And so

from that perspective, it's going to be more accurate, and it's also going to show you over time how that map has changed. Right? And so it's going to be more accurate that, you know, per user drone map.

But, of course, you know, there's always limits to this modeling. Right? So there's some, in particular, at the moment, 1 of the things we care about is adding more precisions on

the partition level dependency, for example.

Because,

initially,

opening

edge at the core level, it just knows that, you know, this data set flow to a table, and this other data set right from that table. And, therefore,

we're kind of inferring that there's a dependency

between the previous update and the next update. Right? Someone modified it and someone consumed it,

and there's a dependency.

However, this is not always true. Very often,

processing

in the data world is

incremental and, you know, like, it's updating 1 partition at a time.

And, therefore,

a job that updates a partition and another job that reads from a different partition don't depend on each other. Right? So it's kind of having more precision like this,

of having more precise tracking of dependencies.

You know, like, depending on the use cases,

people

care about those level of precision. And that's part of the model. Right? The model is,

if you don't have this level of precision, we'll be able to capture, like, coarse grain dependencies,

real table to job to table.

But if you have more

precision, like, you know that, actually, this column depending on this column, you can add a facet that captured

more precise

dependency at the column level.

If you're not treating all the columns from a table, for example.

Or if you care about, like, more like the incremental processing

and capturing the dependency at the partition level, that's also can become a facet. Unlike, actually, when we were reading the input, we were filtering and reading only a subset of the data. On where we were writing to the dataset,

we were appending a partition. Right? And this is a subset of the data that was produced by that run. So this is the level of, you know, depending

on the integrations

and their level of precision and how much they can introspect

the job, they'd be able to extract more information. So typically, a SQL job is very explicit about everything it's doing, and the engine knows

exactly how each column is derived from something.

And something that's more like pandas or spark might have more opaque logic

that should be harder, you know, like, not all jobs may have the same level of precision. But that's part of the flexibility of the model to be able to get the precise metadata where you can.

RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery,

Amazon Redshift, and more.

Their SDKs and plug ins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming.

With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder

today.

And then another

interesting potential complication is you're mentioning, you know, everything is based around this job and the execution of the job. But what do you do when you're working in a streaming system? You know, how do you determine what is the start and end of that job?

Yes. That's a very good point. So OpenEdge,

of course, the goal people have streaming and batch processing jobs. So the goal is to cover all of those things.

So so far, we've made more progress on the batch processing side of things,

but, definitely, streaming jobs are covered by this model. And there are different aspects in which you can cover that. So first, this notion of asset lets you add precise material. Right? Like, there's different notion of a dataset. The dataset could be a table in a warehouse.

It could be a folder in an s 3 bucket

or in HDFS

distributed file system,

or it could be a topic, Kafka,

broker.

Each of those will have slightly different metadata.

And same thing about the job. Right? A job could be a batch process, a Spark job, or a SQL query.

Or it could be, you know, a streaming job, Flink or Kafka streams

or storm or those different things.

And as streaming jobs, you know, I know we like to think of streaming jobs are continuously running.

But, streaming jobs, it is still being deployed at some point, started at some point,

and then they're stopped, and then they're upgraded, and then they're started again. Right? So you still have this life cycle of a run.

And, like so you'll have a streaming job, and it will have a version of the code in the metadata,

and it will have a run starting at certain point using some version of the code.

And then when you think

of what version of the dataset it's starting from, right, like, if you read from a Kafka topic,

well, where you started reading from, the metadata will be the list of offsets, the partition and offsets in those partition you started reading from. So you can very much model

where you're starting reading this particular run of that particular version of the streaming job. And same thing at the point where it stops, you can start capturing the metadata

of what offset it stopped trading

and what information and, you know, capturing this information and where it stops. So it's all about

this generic model of enabling notion of jobs and runs and dataset is pretty stable, and it applies to streaming as well.

But, of course,

the metadata you collect looks different. Whereas this notion of facets, you know, it's you know, like in the Scala world, people are used to define traits of in Java having different interfaces. Right? It's about

having the ability to add different facets of metadata to different

those entities.

And, of course, a Kafka topic has different metadata

from warehouse

table, but the overall lineage notion

very much applies. Right? And you'll be able to track,

you know, a service writing to a Kafka topic, and then this being archived into an s 3 bucket and then being consumed,

you know, by your Spark job and maybe exported in a data warehouse, and then a SQL query goes on it. And it's really important to be able to track lineage across all those layers

and be able to understand the metadata that's specific to each of those

in understanding

what's happening. So I know I've been using a lot of the batch processing

use cases

as, you know, to explain how this works.

But, of course, the streaming ecosystem is also covered.

And I guess

there's more need in the batch world at the moment. So which is why there's been more focus in this area, but there are very much people in the community

that are asking those same questions and, like,

figuring out how best to model those things for streaming pipelines and capturing the entire lineage graph.

In terms of that, what are some of the problem domains or, you know, types of data or specific maybe industry verticals

that are either best suited to using open lineage in this you know, capturing this type of metadata? And, you know, what are some of the problems or use cases that you are explicitly leaving out of scope for Open Lineage?

Open Lineage is specifically focusing

on capturing

lineage and metadata of running jobs. Right? There's something running. It's starting. It's ending right from this and that. So there's kind of a lot of use cases, you know, like operations, making sure those are reliable,

show up on time, and the data is correct

are is 1 use case. Another use case is compliance, you know, like, for example, privacy with GDPR and CCPA

and making sure that, you know, you know where your user private data is flowing. There's a bit of discovery, you know, what does it exist and how they're being used

or where they're coming from. There's some governance

applicable as well, like, making sure people use a canonical datasets to derive information from, and they don't, like, go to the wrong version of the data.

And then for things that are not covered,

I would say that

since OpenEdge

is focusing on these

running jobs. Right? Like, there's also metadata that is external to those running jobs. Right? Like, people would care about

defining somewhere

what's a canonical source of truth for the country codes, right, or for the user IDs.

And

or, you know, who are the stakeholders for a given datasets?

So this

typically doesn't

quite fit in the Open Lineage integration because Open Lineage focuses on collecting information from the automation, automated jobs, and all of those things or ad hoc job as well, but not like external

kind of metadata to

that. However, I think this facet model applies very well. We have actually a lot of questions on the Slack

around this. It's like, hey. How do I add to the model?

You know, who are the stakeholders for a dataset?

And

this is less applicable to

open lineage, but it's more applicable to something like Marquez, for example.

Markets, we care about adding all the metadata that are not directly

linked to the job itself, but they are more, like,

external to that. So it's kind of like,

the distinction.

For people who already have maybe a data catalog or a metadata management system, or they've already got some set of jobs that they maybe want to integrate to start generating the open lineage metadata,

what's involved in actually evolving their systems to working well with Open Lineage specification, either as producers of that metadata

or storing it and consuming it and integrating it with the rest of their ecosystem?

Yeah. So working with the spec is fairly straightforward. Right? This is a JSON schema. So, basically, it's about producing

JSON

object that represents your lineage or consuming them depending on which side you are.

And we provide there's a old built in, like, Java Python client, but it's a JSON schema spec. So you can use,

JSON schema or, like, related open API projects

to generate

clients in all the languages.

Most languages are supported by JSON schema in general.

And so

adding integrations

and exposing lineage is pretty straightforward.

The core model is very simple on purpose. Right? And so there are 2 aspects. So 1 is if you already have a catalog and you're producing lineage and you have collect lineage and you're interested to expose that

to other tools that understand that open lineage, you can use 1 of those clients to produce

the metadata and forward this metadata to the system.

And the other hand, if you're interested in consuming upper lineage because you already have a data catalog, but you want more lineage coverage,

You can easily

consume those events

that will tell you, you know, this job is running from this dataset and running to this dataset and consume that in your model

by just consuming those events. Right? And so this is about understanding

JSON

events that connect contain this metadata.

1 of the advantages of this is to enable federation

of tools. Right? So you can care about these integration

and collecting lineage once and really pushing that into your ecosystem. And then you can have various tools that use this metadata to focus on different use cases, whether it's privacy or operations,

governance, and compliance, and things like that, which I think is very

powerful. For example, talking to Egeria,

the Egeria project is some kind

of metadata

hub. Right? So 1 feature they're working on is being able to consume

open lineage and as well produce open lineage. So making it easy

to kind of synchronize metadata system

to make sure, like, the view of lineage is the same in value system that focus on different use cases.

Digging more into this federation aspect and the sort of downstream and for how this is going to actually play out as the broader data community, you know,

for how this is going to actually play out as the broader data community starts to get on board and adopt Open Lineage and some of the powerful network effects that this will have.

Initially,

Open Lineage has a set of integration. You know, there's a Spark integration that collects lineage from Spark. There's, Airflow integration that will also collect from the various SQL warehouses

offerings.

But the way this is happening and talking to various project, the goal is to push

open lineage into each of the projects. Right? And this is the kind of thing that's slowly happening. But

the goal is not to have a central project with all the integration to everything.

The goal is each project

to be able to expose their own lineage

as open lineage. And open lineage being just a spec and, you know, it's like a Java interface if you're in the JVM.

Or just a spec, it doesn't pull in any dependencies to the project. Right? It's just having the ability to expose your lineage in a standard representation.

And so then, at some point, it would become just turnkey in the entire

data ecosystem. People will not have to care about, oh, how do I extract the lineage from everything? Right? Like like, everything is going to expose lineage in a standard way already built in. Right? Like, whatever project if they use Spark or if they use Flink or if they use,

SQL warehouse,

they'll have this lineage exposed and the metadata exposed in a standard way. And so

we'll have better tools. Right? This is really an opportunity

for the data ecosystem to have better tools that give you visibility across everything.

So, of course, you know, we are building tools for

better observability of data pipelines.

And other people, we care more about governance, you know, and privacy. And I know of

people carrying a lot of these privacy by design.

Right? How we make sure that

by design, that user private data never leaks into the places it's not supposed to be? Are people I think they're not consented

that their data is used for? Right? And really building that in the system and making it more observable

will enable a lot of this. So it's kind of, you know, the end vision operate data operation perspective

for me, is really

I have this Maslow

hierarchy of needs. Right? You know, like, this hierarchy of needs is, like, before you reach happiness, you need to have shelter, you need to have food, you need to be safe. And then when you have those things, then you can care about, like, reaching happiness and all those things. It's a bit the same about data. Right? Like, first, the data need to show up, need to show up on time. It needs to be correct.

And once you have all those things, then you can actually get value out of it. Right? So people joke about how much time it's spent cleaning up the data or, like, you know, debugging pipeline, broken pipelines, or making sure we can trust the data. It's really about

building this trust in data and, like, knowing you just know and it's observable and it can be proved

that everything is working accordingly, whether it's from an operation perspective and be able to trust the data or whether it's from, like, privacy by design, making sure you do the right thing with your user private data or, like, being able to prove that you're using the correct source of truth of data to,

produce certain metrics.

A lot of that. Right? It's really

making everything work better

and be reliable. And, right, then you can prove it. Right? There's not this quest nagging question. It's like, but, you know,

is this dashboard reflecting what we think is reflecting or

thing like that? The dreaded sentence of this doesn't look right.

Yes.

And maybe

it is right, and something else is the problem. And maybe it's not right. Or I like but if you don't know, it's problematic. Right? You can't make a decision.

And so in terms of your experience of working through this open lineage spec and working with all these other projects and organizations

and integrating this into the work you're doing at Datacann and Marques, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think

I'm an engineer. I'm a software engineer. Right? So I I relate to this. But I think

open lineage is more of a human problem rather than a difficult 1. Right? This kind of when you look at the way people have been building lineage solutions,

there's a lot of software engineer, like myself, They would rather

reverse engineer what others have done rather than, you know, have this conversation how we would standardize and, like, mixing it easy to consume for everyone.

And so there's a lot of systems that are very complex

because people look at, oh, I can look at the internals of that tools, and I can understand how it works. And I can extract something without having to ask anybody to change anything. Right?

And so

which works. Right? And they're like a bunch of solution that exists that

solve this particular problem. We will solve the complexity of all the things that exist for you

by understanding

the intricate details of every 1 of those things.

And the problem with that is very brittle, it's very costly,

it's very time consuming, and it gets obsoleted, right, as technologies evolves

And, like, whoever

became the linear solution for Hadoop,

now it's kind of like, has invested a lot of time in something that's not used as much anymore. Right? And so

I think it's a lot of a human problem of, like, how do we make those conversation happen? Right? And, like, the way that opening each spec is designed

is to facilitate those conversation. Right? Let's avoid the giant monolithic spec, where it's a conversation that will never converge. Right? Let's

have a lot of small focused conversation

on the things we care about.

And so that the people who care a particle aspect, let's say it's like table schema or column level lineage, they can help this particular area of the problem move forward. Right? And so I think 1 of the learning is, like,

you know, when people talk, good things happen. And it's kind of like a lot of open lineage is just

facilitating,

making those discussion happens. And, you know, when I initially reach out to this group of people and say, like, like, a lot of them were, like, hey, how come this doesn't exist yet? Well,

what we need to know

is plant the seed, and then it will happen. You know, there's this interested children's story, which is a stone soup.

You know, so it's kind of someone comes to a village

with a pot, and they stir hot water. You know? They just heat water, and they put a large stone in it. And they say, look, I'm making a stone soup. But everybody's welcome to add their own ingredient to it and make it special. Right? So basically, they're making a soup, add nothing,

but they're creating this focal point. It's like, look, someone is driving the

collaboration.

And like the stone soup, the more people come and they put their contribution to it by adding an ingredient,

And the more the thing becomes more valuable for everyone. Right? So it's kind of really when you think about open source,

and now we create good incentives.

And this is this kind of perfect project

where

every contribution,

no matter how small, is going to make the project better for everyone who's been contributing to it. So it's a really good

virtuous circle

of, like, how do we gain momentum in this project? And everybody's like, yes. This is great. We want to contribute to this. Because we are part of this effort now,

it's valuable to us, and it's more valuable for everyone. And I think this is, like, some of the learnings on open source, you know, for the past 15 years.

I guess it reflects a bit some of the project I've been pushing or driving. It's kind of, hey. How do we standardize this? How do you make this

reusable,

enable the ecosystem?

Absolutely.

And so as you continue to go down this road of building and growing the open lineage project, what are some of the things that you have planned for the near to medium term future? And for anybody who's listening,

what are some of the ways that they could be most helpful to helping make this effort a success?

There's a couple of directions. Right? From a technical's perspective,

there's evolving the spec. Right? Like, it's kind of contributing to the facet, and people could be

just using it and making their own custom facet or contributing.

There's a proposal mechanism on the GitHub that lets you submit a proposal and say, like, hey. I think this particular use case is not covered.

We need to formalize this.

You can see there are several conversation happening in GitHub

issues on the project.

And so 1 aspect is helping standardize more facets of the metadata.

Another aspect is increasing coverage

by, like, how do we adapt open lineage? So, you know, there's an existing effort in Spark, but people are welcome to join.

And how do we increase the coverage with other project like Flink, like Beam, like the streaming efforts?

The other area

is

increasing coverage in system who consume

opening edge and are able to

understand it so that we can help with this figuration aspect. So that's kind of how people can contribute and join the conversation. And it's really

community driven spec. Right? The goal is really to enable people to contribute and drive the areas

they care about. And as part of that, we're in the process of submitting it officially to the LFAI and Data Foundation.

You know, it's a some foundation of the Linux foundation that cares about data project.

Marquez is part of this project of this foundation already.

And Open Lineage being

fairly recent is not officially part of it yet, so it's being submitted soon.

And with the goal of being part of a foundation is really this testament that, look, this project is not owned by anyone in particular. Right? This is a community driven project.

There's a clear governance to it.

And the goal is to standardize lineage

and make exchanging lineage and metadata seamless ecosystems. Like, it's we, as a data community,

are making this happen. So it's kind of, like, in the immediate future

part of having a clear governance.

And when you contribute to a project, I think I'm very opinionated about

if you have an open source project, you know, there's a license aspect. And, like, of course,

OpenEdge

is Apache 2 license, so it means you can do what you want with it. There's no restriction. The only restriction is

you have to give credits. Right? You can't claim, you did something if you're reusing an Apache project.

But then the other aspect is to have a clear governance

and,

you know, being under our foundation

means the license is not going to change. Right? Like, it's always going to be a community driven project.

And, you know, there's some discussion that you see that happening in some project of, like,

hey. Should we relicense

this and protect from

someone, like, using this project?

This is not the goal here. The goal is, yes, this is a project that's open source. It's for the community.

It's always going to be usable by anyone, and that's why

contributing it to a foundation is really important.

Well, for anybody who wants to get involved and get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

So I'm a little biased in that in that space because I'm building a tool that I think is sorely missed. And,

you know, I care a lot about data pipelines, observability,

and how to make sure

those pipelines

are

reliable. Right? They're running on time. They're producing correct data,

and we can very quickly

identify what the problems are. Either it's data quality

of someone making a change and not understanding the downstream

impact of it.

Or, you know, as part of this, you know, being able to analyze this lineage information

and really

being able to quickly understand the root cause when there's a problem.

But even more than that is just preventing problems altogether by being able to identify the consequences of a change before it gets applied. Right? That's kind of, oh, if you're changing the schema of this dataset, it's going to impact such and such job downstream. Right? Or this thing is slower than before. The set is not showing up on time anymore. How do we quickly troubleshoot and figure out what change upstream make things slower? Or maybe it's just change in data

or value suspect like that. And so I think this is very lacking. Right? A lot of time you spend, like, figuring out why things are not working quite right.

And I think that's a big gap. Yeah. Like I said, I'm a bit biased because that's kind of what we're building at Dedekind.

I think this is extremely important, and

I expect that in the future,

no data engineer or data scientist will

accept a job when they don't have proper tooling

to understand

how their particular

contribution

works and is impacted by the wide ecosystems of consuming and producing dataset that is around them. Well, thank you very much for taking the time today to join me and share the work that you're doing at Open Lineage and touch a little bit on the work you're doing at Dataken. It's

very interesting and valuable project, and I appreciate you getting the ball rolling and helping us all by getting this effort out and getting it working. So definitely gonna be keeping a close eye on open lineage and, you know, hopefully, adding my own contributions to making it more part of the ecosystem. So thank you again for your time and energy on that, and I hope you enjoy the rest of your day. Thank you very much, Tobias. Always a pleasure, and I really enjoy this podcast. Thank you. Bye bye. Thank you.

Thank you for listening.

Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links