Addressing The Challenges Of Component Integration In Data Platform Architectures

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation,

or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize

today to get 2 weeks free.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles.

Get all of the details and try the new product today at data engineering podcast.com/rudderstack.

Your host is Tobias Macy, and today I'm going to be sharing an update on my own experience

of the journey of building a data platform from scratch.

And today I'm in particular going to be focusing on the challenges

of integrating the disparate tools that are required to build

able to define certain concerns

of that platform.

So if this is your first time listening to the show, I've been running this show for

the better part of 6 years now, maybe close to 7.

I have been working in technology

for over a decade and I have been spending the past year and a half coming up on maybe 2 years

building a data platform from scratch to get all the buzzwords out of the way using a

cloud first data lake house architecture

focusing on DBT for transformation,

air bite for extract and load, Dagster for

the full integration,

and using Trino as the query engine for data on top of s 3. So

recognizing that that puts me in the minority of most people who are building a data platform, particularly if they have small teams

that has put me in the position of needing

to figure out some of the

interfaces

for integration

For a lot of people where they're just getting started with building out the data platform, they're probably going to be going with the managed platform route or picking from a selection of different vendors. The

canonical 1 for the so called modern data stack is probably

Fivetran, Snowflake, and DBT

with maybe Looker as your business intelligence layer.

I'm not gonna spend a lot of time in this episode digging into the motivations

for why I selected those different components of the stack because I've covered it in other episodes which I will link to in the show notes.

But it has definitely

led to a certain amount of friction as I try to

manage some of the different integrations out of the box. Although the story for that particular set of technologies has been steadily getting better as time goes by.

Today, I really wanna talk about where I am in my journey currently

of I have a core set of functioning capabilities.

I can ingest data. I can transform it. I can query it.

But now I'm getting to the point of needing to be able to

onboard more people, provide a more seamless user experience,

being able to manage some of the different means of data sharing or data delivery where maybe not everybody who's going to be accessing the data

lives within the bounds of my team or my department.

And there are definitely

data sharing capabilities

that are part of some of the different platforms, most notable being probably big query and snowflake with the ways that they manage data sharing,

But there are a number of different ways of approaching that

given that I am in the world of building a lake house architecture.

I've got my data in s 3. I'm using the iceberg table format. So all of the data is already

representable

as a table.

And

so then the question is, okay, for somebody who just wants to be able to access the data, what's the best way to deliver it? Do I need to provide 1 off jobs to generate a CSV and send it to them via email? Do I need to give an s 3 bucket to be able to load things from? It all depends on what the level of sophistication is of the people who are going to be consuming the data. Maybe it's just a

dashboard format and people just need to be able to look at the data.

1 of the challenges

of that handoff

is also

when you need to be able to

be considerate of how that data is going to be used

after you present it because maybe

you want to be able to give a visual representation or give away for somebody to access the data, but you don't want them to then exfiltrate it into

another system via means of a c s v export for instance.

And that's where you start getting into questions of governance and

who has access to what being able to audit data access.

But

harking back to 1 of the episodes I did a while ago on the idea of shadow IT in the data industry,

the best way to prevent people from

taking data out of the context in which you want it to be presented

and bringing it into other tools

is to

reduce any friction

or pain that they're experiencing

accessing the data in the way that you have presented to them. Because

if

your options are the best option and the most accessible option for the people who are

viewing that data, then they're not going to want to bringing it out of that system because you were giving them the best experience.

And so that's where I am right now of figuring out what is that best experience for everybody,

what are the requirements, how do I then manage that. And

a lot of the

complexity

comes in with the elements of

interoperability platform deliberately

the overall platform.

And I'm using the term platform deliberately because I am aiming for

a holistic experience for end users versus just a number of point solutions where somebody can maybe plug something in and they do their thing and then somebody else can plug something in and do their own thing. I want to figure out what is the minimal set of interfaces

that I need to build and support to be able to address the

widest variety of needs while still being extensible for the case where somebody has a bespoke requirement

that I need to be able to fulfill? How do I make sure that

that doesn't add an undue amount of maintenance burden on myself and my team

while still being able to deliver on that request.

In general, the way that

that degree of interoperability

is managed is through adoption

of open standards that everybody has agreed upon. So SQL is probably the

longest lived 1 of those open standards. Although, to be fair, there are multiple different varieties of it, but at its core, SQL is understandable. So if you have a means of

using SQL to query data, that is going to

make it easy for a lot of different tools and people to be able to do their exploration and self

serve. DBT has

capitalized on that as a means of being able to build their products to be able to say, okay, the majority of structured data sources are going to be addressable by SQL. So we're going to build a tool

that allows people to build

a full engineering

development and delivery flow for that SQL data,

manage the transformations through our tool, and then add a lot of nicety around it in terms of the lineage and data documentation, etcetera.

Because of the fact that they have invested so much in the ease of use of the tool as well as doing a lot of advocacy in the community

to drive adoption

that has led to a number of

different other tools integrating with them. So they have become a de facto interface

for managing transformations

in a

warehouse or warehouse like context. So there are tools such

as preset that are

building integrations

into

d b t as far as the

data types and the metadata that they generate.

There are entire products that are built on top of the metadata that d b t generates,

light dash is 1 of those.

So that has given DBT

as a tool and as a company

a lot of momentum, a lot of inertia, so it makes it more difficult for

other people who maybe are targeting a similar use case, but want to add a

step change to some of the capabilities around that tool

to be able to actually break in and overtake that market. So the most notable 1 that I'm aware of so far is SQL Mesh

who are actually adding a compatibility layer with DBT for being able to

run SQL mesh on top of a DBT project where you can actually execute the transformations without having to do any code changes,

which solves for the

adoption step of I just wanna be able to try out this tool,

but then there is still the challenge that they're going to have to overcome of

building up that set of integrations with the broader ecosystem

that DBT

has benefited from.

So that makes that road to adoption and the road to being a viable competitor

a lot longer.

Another tool that has benefited from this status of being a de facto standard and a reference implementation

is airflow, where they have been around for a long enough time, and they have been adopted by a large enough user base that

if you, as a new tool vendor,

build your initial integration with airflow,

then you have a large enough addressable market that it is worth the time of building that integration,

but it's not necessarily worth it

to build that same integration for

DAXTER or Prefect

or

1 of the other orchestration systems that are out there.

So

as you are building your own platforms and doing this tool selection, it's worth considering, am I choosing something

that is going to be able to benefit from the existing weight of the community integrations that are available for it.

If not, then is the value that it is providing worth that additional complexity of me having to go out and build those integrations or work with other vendors to build those integrations.

Developing event driven pipelines is going to be a lot easier.

Meet functions.

Memphis functions enable developers and data engineers to build an organizational toolbox of functions to process, transform, and enrich ingested events on the fly

in a serverless manner using AWS Lambda syntax without boilerplate.

It includes orchestration, error handling, and infrastructure

almost any language, including Go, Python, JavaScript, dot net, Java, SQL, and more.

Go to data engineering podcast dotcom/Memphis

today

to

get

started.

And then

another element of that integration question is

the challenge of a lot of the different tools in the stack will want to be the single source of truth for certain aspects

of your platform.

1 of the ones that I've been dealing with recently is the question of access control and role definition

and security and auditability of the data. So

because I am building with a disaggregated

database engine where the storage and the query access are separated,

that gives 2 different locations where

those definitions

can be stored. So

1 camp will say, well, the because of the fact that the storage layer

is the lowest level

and will have potentially multiple different points of access where maybe I'm using Trino to query it by SQL, but I may also

just be accessing the data in s 3 directly using some Python

tool or maybe I'm using something like Spark or Flink to do other processing approaches to it beyond just SQL.

So

all of that access information needs to live in the storage layer. So that's the approach that companies like Tabula are taking with the iceberg table format, where

as long as you have the storage layer secured, then it doesn't matter what all the other layers on top of it might have to say about access control because it's going to be enforced at the table level.

And then you go the next level up, engines like Trino,

or the Galaxy platform

from Starburst,

say, we want to own the access control because

we are going to have more visibility into the specifics of the queries, and we can have things like rollable filtering in the role definitions.

So we should be the ones to own that

role based access control information

or attribute based access control.

And then

that's definitely another viable option, but then you go another layer up the stack into business intelligence,

and maybe

there's even more granularity available because you can say, oh, well, this person has access to

this particular dataset with this bay low level filtering and they can view these

visualizations on that data, but maybe they can't write their own queries against that data.

So there's that challenge of, okay, well,

do I have to define the roles in 3 different places?

Do I have to have slightly different roles across all those 3 places? How do I align them to be able to say, well, this user in the business intelligence is the same as this user, and the query engine is the same as this user in the storage layer and make sure that they are getting a cohesive experience across those boundaries.

So

particularly when you have

tool systems that maybe don't want to or that are maybe disincentivized

to

do that

propagation of role information. So for instance, if I use tabular and I say I'm going to do define a role that will grant read access to this subset of tables,

but no other visibility

of the rest of the dataset,

how then do I reflect that information to

the query layer of who that user is to be able to enforce those

permissions, and then all the way up to the business intelligence layer to say,

from the storage layer, these are the permissions that you have in the UI. So

that brings in the need to have some manner of

single sign on and single source of identity for people across all of those

boundaries.

Beyond the question of permissions, there's another set of information

that is

disjoint and wants to be owned by different components within the stack. And that is the question of data flow,

data processing,

data lineage,

where each tool maybe has a certain view of what the lineage graph looks like or what the processing looks like.

But you don't necessarily have the complete end to end view of a piece of data from

maybe where it lands in an application database all the way through to where it's being presented in a business intelligence dashboard or incorporated as a feature for machine learning model training workflow.

So for instance, DBT has the table level lineage of the transformations that it's providing.

Tools like airflow and Dagster

have the view of the lineage of all of the tasks that they are responsible for executing,

but they don't necessarily

have information about

out of band transformations

that are happening by some analysts who are running ad hoc queries in the data warehouse, or they don't they maybe don't have visibility into

the data as it's landing in a in an application database, and they only see the data once it lands in the warehouse or maybe they can see

the,

data integration step with an Airbyte or a Fivetran to say, okay,

I know this table in the application database feeds into this table in the warehouse, and then these DBT flows happen, but maybe it loses sight of the data dashboards that are being generated.

And so, again, this is the question of

open protocols,

interoperability.

So tools like open lineage are designed to help address that

where you fire events that will can be then constituted into a more cohesive lineage graph.

And then you also have systems such as metadata platforms

that are designed to be

more holistic views of the entire ecosystem and incorporate things like data discovery, data governance,

which gives you a single place to be able to view all that information, but then you're back to that question of integration of, okay, well, this can store and convey all of this information,

but how do I get all this information into it? So each of those tools are going to have different means of

being able to push or pull data,

but you have to make sure

as the platform

designer and operator

that those data flows are also happening. So it's an additional set of DAGs that you need to make sure are running. You need to make sure they're reliable.

So it it's useful, but it's also an additional burden.

So these are all things that I've been dealing with recently. And then

in the metadata

catalog situation,

even if you do manage to feed all of your data into that,

it is useful

as a

means of discovery or a means of being able to,

keep tabs on what's happening,

but then it also feeds back into, okay, well, if I want to use this as my single source of truth,

how then do I propagate that truth back into other systems? And that's where you start to get into questions of things like active metadata,

and then you have another set of integrations and another

another direction of integrations where if I say, okay, I have my metadata catalog,

this is my source of truth for role information and who can access what. Now I need to be able to push that back down into the storage layer and into the query engine and into the business intelligence dashboard,

and I need to make sure that all of those integrations are reliable and that there are

appropriate mappings between the different concepts throughout the different systems.

Data lakes are notoriously

complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake, and Hoodie, so you always maintain ownership of your data.

Want to see Starburst in action? Go to data engineering podcast.com/starburst

and get $500

in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

So this is definitely 1 of the benefits that fully vertically integrated platforms have of you don't have to fight with all of those different layers of integration.

But

the problem there is that you have to rely on the integrations that have been built and maybe you're constrained by what that vertically integrated platform offers.

You don't necessarily have the means of being able to extend it into

areas that the platform developers

haven't had the time to implement or

haven't been exposed to as a necessity.

So it's

the the continual

story of

software design, software development of how do we build these systems that are extensible and can be integrated,

but also make sure that the product that you're building

doesn't just get crushed under the weight of having to maintain all of these different

point solutions.

I think that we're definitely

iterating towards these community standards. I think that tools like Open Lineage, I think that the work that the open metadata folks are doing with their

schema first approach to metadata is interesting.

I think that the work that DBT has done to become that de facto standard for this is how your transformations are represented, so other tools can be able to build on top of that are all valuable. So it's great to see the direction that that has that the community has taken on all of these fronts, but I do think that we are definitely still

not to the point where

we have a lot of the answers

fully baked. I think that

I think that everybody who is investing in this ecosystem, everybody who is building these tools and using these tools and giving feedback to the tool vendors

is helping to bring us to a better place.

But,

as somebody who is trying to

integrate so many different pieces and try to figure out what does that holistic platform look like, how do I build it in a way that I can maintain it with a small team but also be able to have the flexibility

required to address

a wide

set of audiences.

It's definitely still a challenge, but 1 that I've been enjoying having the opportunity

to explore and invest

in. So now as I have been iterating on these challenges and thinking through how best to build that

holistic platform and something that is going to be

enjoyable and usable by a number of different people, the next

main architectural component that I've been

starting to work towards is that metadata platform so that I can have that more cross cutting view, so that I can improve the data discovery story for people who aren't part of the engineering team, who just wanna see what data do you have, where is it, how do I access it. So that's where I'm going to be doing some of my next work of picking the metadata platform, getting it integrated, getting all of the data flows propagated to that tool so that we can see

how everything is flowing,

and then being able to

start

integrating single sign on as a means of identity management across the different layers and then being able to say, okay.

You came through the metadata platform to do data discovery. Now you say, okay. Here's the dataset that I want to,

explore, or here's the chart that I want to view, then being able to have

a simple means of clicking a button and jumping them into the experience that they're requesting, for instance, or being able to say, okay. I need to query this database,

and then giving them the prefilled set of credentials or the prefilled client connection to be able to run

those queries with their tool of choice, whether that be something like a tableau or

a SQL command line, etcetera.

And as I have been doing this work, the most interesting or innovative or unexpected

aspects

have definitely been these

integration boundaries of saying, okay. This is the tool that I have. This is the other tool that I have. I would like to be able use them together in this manner. Now, how do I go about and do that? So

I am definitely happy that the,

Python language has become 1 of the most widely

adopted

ecosystems

for data engineering because it does simplify some of that work where you can pretty easily assume that there's going to be a Python library that does at least 80% of what you're trying to do. So in, for instance, the open metadata platform, they have a Python library for

doing that metadata ingestion. So even if they don't have an out of the box solution for ingesting metadata from the tool that you're using,

they do have a Python client that you can provide metadata to them

in order to be able

to propagate information from a system that they haven't already addressed. So I do think that that is

a another good trend in the ecosystem of

providing

a good set of software clients to be able to integrate with

their tool

even if it's from a system that they themselves didn't already plan to integrate with. So

while it does add a bit of extra burden to the people who are trying to use those systems, there is at least a

path to success for it. I'd also say that that's probably the most challenging lesson that I've learned as well of figuring out

what are the points of integration that are worth investing in and what are the ones that

I don't have to invest in right now, and maybe I can wait to see if the ecosystem around that tool grows up. So SQL mesh is a tool that I've been keeping an eye on.

When I first came across it, it didn't have support for Trino, so that was an obvious no for being able to use it. Now they have it, but they don't yet have an integration with DAXTER.

And then also, as I was pointing to earlier, the existing weight of integration with DBT gives me a bit of pause of okay. Even if I do get it working with Trino and Dagster,

what about all the

other prebuilt integrations

that I am

not going to be able to take advantage of because I have decided to use a newer tool that hasn't been as widely adopted and integrated. So

definitely a challenging aspect of that as well.

If you have found my musings useful or informative or if they have inspired something that you would like to discuss further, I'm definitely always happy

to take suggestions

or bring people on the show or,

take some feedback. So for anybody who wants

to get in touch with me, I'll add my contact information to the show notes, but the best way is

on data engineering podcast.com

that has links to how you can find me.

And

for the final question of what I see as being the biggest gap in the tooling or technology for data management today, I definitely think it's that single source of truth for

identity and access across the data stack of being able to figure out how do I manage

permissions and roles across a variety of tools without having to

build

different integrations

every single time.

So definitely look forward to seeing more investment in that. Maybe even using something like open policy agent from the cloud native ecosystem.

So

definitely,

definitely happy to continue investigating that as well. So thank you for taking the time to listen to me. I hope it has been, if not informative, at least entertaining. I'm very thankful

for being able to run this show

and bring all of these ideas to everybody who listens. So

in honor of this being the Thanksgiving week, just wanted to share that gratitude

to everybody who takes the time out of their life to pay attention to myself and the people I bring on the show. So thank you again, and I hope you all have a good rest of your day.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it.

Email hosts at dataengineeringpodcast.com

with your

story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links