The View From The Lakehouse Of Architectural Patterns For Your Data Platform

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlan is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value.

Go to data engineering podcast.com/atlan

today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with

metadata. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Colleen Tartoe about her views on the forces shaping the current generation of data architectures from the modern data stack to data mesh and beyond. So, Colleen, can you start by introducing yourself?

Absolutely. Thanks for having me. I'm Colleen Tarto from Starburst. We are based on the open source platform, Trino, and our software allows access to data in pretty much any platform or source. So you can do your analytics on data directly at the source either with SQL or through an integration with your

favorite analytics tool. I personally run the enterprise engineering organization, and I've been interested in enterprise scale data management and architecture

for quite a long time because it's a really interesting space with some fun problems to solve and cool ideas like data mesh to think about. And I'm really pleased to be here today and looking forward to good conversation.

And can you share how you first got involved in the area of data management?

Yeah. I kind of I fell into it. I really love hard problems, and I love data and math and numbers.

And data management, which in my mind is how to organize both the data and the people around it. It's a really interesting, complex problem. And so

after I left graduate school and I got involved in data software, it was the early 20 tens when

big data was the hot topic. And I found it really interesting that the focus was always on moving data around. And then, eventually, the people in the data world

moved on and started to get into data science and analytics, which, you know, that's actually how you get the value out of the data, and it's really incredibly important to the business. And I was working at an enterprise ETL software company at the time, and we kept seeing these huge challenges around

organizing, accessing,

securing,

and then analyzing massive datasets.

And then I moved on, and I was working in analytics and data science because that's also really fascinating. But I kept coming back to the data engineering and the data management side of things. Because

without that, you can't do the analytics, and you can't do data science and get the value out of the data.

And you need that solid strategy.

So I always find that really interesting.

When I was preparing for this interview, I noticed that you also have a background in astrophysics, which is kind of hilarious given that you now work for a company called Starburst.

You recently released a product called Starburst Galaxy, which is the actual

astrophysical phenomenon that is what generates stars. And so I'm wondering if you can, in your expert opinion,

explain sort of how the

metaphor of a starburst in a starburst galaxy maps to the technical platform that you're managing and building and some of the interesting

metaphors and parallels that go between them.

No. And it's funny because when I was initially speaking to Starburst, I kinda laughed and said to my husband, it would be really funny if I worked there because, you know, it's my PhD and is in it. We just kinda giggled about it. But now

here I am 2 years later working here, and it's great. So starburst galaxies that are undergoing a period of really intense star formation, and it's usually caused by, like, a gravitational encounter with another galaxy.

And then those stars age

several 1000000000 years, and, eventually, they all explode. And it creates this coordinated burst of luminosity from the stars and all this energy comes away from the stars. And what's interesting about these is that they're exceptionally bright, but you can also predict how bright they're going to be. So if you know how bright they are and you know how bright you measure them to be, you can say a lot about the universe that exists between you and the starburst.

And so

you can study them from really far away too, because they're so bright. And so when you study things that are far away in space, you're actually studying them back in time because light has a finite speed.

And so, you know, an object that's 1 light year away and it's light and takes a year to get to you. So when you see it, you're seeing it how it was a year ago.

So the farther away you look back in space, the farther back in time you're looking, which is really cool. And so all that said, starburst galaxies are these beacons of light in the universe. And they're interesting both internally

and in the context of providing information about the universe around them because we can study that light that they're emitting and absorbing and bending with their gravity.

So

to me,

there's an obvious parallel with Starburst, the data company. Our technology allows users to use whatever format of analysis works for them, whether it's SQL or a BI tool or Python or whatnot.

And with that capability, we can shed light on new insights

faster and easier than they'd be than we'd be able to if they had to follow the legacy paradigms of moving data around and architecting

it specifically for a single use case, which presupposes that you know exactly what you're looking for.

So the cool thing to me,

both starburst galaxies

and starburst data technology

is that they're interesting on their own, but when you apply them to the universe around them, they provide these additional insights and like a faster,

more convenient way to get information about their surroundings.

So maybe I'm pushing that metaphor a little far, but I do think it's there. Yeah. It's definitely great.

And so in terms of

the

overall trends of data architecture and some of the ways that things like Starburst and Trino

are able to

influence and, in some ways, circumvent architectural

constructs and constraints.

What do you see as the dominant factors that typically influence a team's approach to data architecture and design given the current ecosystem of technologies and systems and objectives that are in play?

It really depends on the starting point in a lot of ways. Right? So it depends where you are and how, what I call, data mature you are. So for an older legacy system that's undergoing, like,

a multiyear strategic digital transformation, that's gonna look an heck of a lot different than a cloud native startup that's just getting off the ground.

And like I said, there's this concept of data maturity that

I like to think about, and it's a measure of how much the people,

processes, and technologies are data forward at an organization.

And so what's interesting is not directly correlated to the size or the age of a company or the technology platforms that they're using or their budgets or the number of people

or their overall strategy of the company, but rather it's a combination of all of these things.

And it really says, how much is this organization

enabled to use its data

to drive forward its vision and make strategic decisions based on data?

So

when a organization is making a decision on how to approach or define a strategy, there are all these factors that end up going into it. But what I think is really important

is that there's a focus

on the business value that will be attained with the data. So you start with the question of how are we gonna make this business more successful

through data and then getting into the details and tracking that back to the actual data needs and making sure that you're leaving room for innovation around that as well. And then from there, you can start to evaluate where you are and design an architecture and choose technologies and a strategy that helps you evolve your data environment

into a more mature stance.

As far as the

current patterns of the, quote, unquote, modern data stack, which is still somewhat of a nebulous term, unintended, but very apt,

and also the advent of data mesh. And

to some extent, the collision of those 2 principles,

what do you see as some of the

points of confusion or opportunity that exist for people who are either evolving their existing data stack or coming into a greenfield to

adopt those patterns and paradigms and maybe some of the potential pitfalls that they need to be aware of as they go down that journey.

It's a really good point that the modern data stack is incredibly nebulous. You know? I mean, there's

websites dedicated to figuring out what it means, and every vendor has their own definition of it. And it's all over the data zeitgeist these days. Right? And so is data mesh for that matter. And

they kind of, in some ways, in my mind, come

in at opposite angles. Right? So let's start with the modern data stack. Right? And I will talk for hours about this, so definitely cut me off if you need to. So

with the modern data stack, it's the idea of taking data from a source, copying it to a some sort of centralized data storage, data focused storage, and then using analytics tools to gain insight from that data.

And when you say it like that,

describing it, it sounds beautifully simple. But like most things in life, it's not that simple. And so

in reality, you really have to curate that data. And that's the challenge is where do you transform it and how do you end up handling issues around things like latency, complexity of pipelines,

vendor lock in that you end up getting. Because, you know, as you're trying to evolve your stack, you're working with partners, but you end up getting locked into certain technologies that can get very expensive as you get

more mature and more advanced. And so

a key point that I think about a lot is that the modern data stack is actually not that modern. Right? 40 years ago, folks were doing the exact same thing. They were taking data from their mainframes and putting it into Teradata and then using Cognos on it. Right? So it's the same idea about curation, and

there's nothing super new to that. Right? And so the main improvements that have happened are the separation of storage and compute and cloud architecture, and you've got these managed SaaS platforms and tools, which is great. But, honestly, it's largely the same paradigm we've always had with the centralized target data store that's driving that architecture and that organizational

structure around it. And, you know, I think the key is also understanding that there's a people story there too. And so all of that said,

you know, I think the modern data stack is really wonderful when you're starting up a data story and getting started off on your journey of data maturity. Right? There are really good benefits in that. It can be quick to spin up. I mean, you can go from 0 to having that modern data stack in a day. And if you think outside the box and you focus on the ultimate goal of getting business value and data insights from your data, there are some really modern improvements that you can make on that that will accelerate the speed to value for that data.

So for example, you know, you could be cloud native in these days, and it's

really quick to spin up that stack where you could use, you know, s 3 for storage

and Starburst Galaxy for curation.

And then

use your favorite BI tool layered on top for analytics. And in a short amount of time, you're up and running and producing value, and that is really cool.

And so that's sort of where I think the modern data stack is going,

and it's allowing

organizations to scale and mature and grow

as much as they need to without locking you into specific architectures.

Then for data mesh,

on the other end of the spectrum, right, data mesh really doesn't make sense at the scale of a single modern data stack. Right? It's more of for large and complex enterprise companies, like a telecom or a financial services company. Something where you've got regulatory complexity on top of your physical complexity, your

organizational complexity, your data complexity. And that's obviously

not going to be an environment where you can spin up a quick modern data stack and call it a day. So there are typically completely independent business units with independent data strategies and architectures,

structures, data.

And in these cases,

I think the idea of bringing your data together into a centralized store, the way you do with that modern data stack just isn't gonna fly. You know, organizations have tried this for years to have that single source of truth for data, and it was just never working. It's never truly a success.

And that's what leads the idea of the data mesh, that centralized data architectures and organizations lead to the same challenges over and over and over again. And this is why digital transformations fail, and that's why we need to do better so that we can get business value out of the data at scale.

So that's where Data Mesh comes in focusing on both the organizational

and architectural

decentralized data strategy.

In terms of the

sort of technical underpinnings of these different platforms,

a lot of the

conversation around the modern data stack has been centering around the different data warehouse vendors.

Snowflake has definitely been getting a disproportionate amount of that focus.

And as somebody who is working on and building

a system that can act as 1 of those central clearing houses of information

in the form of Trino and building out the sort of modern lake house paradigm where you're getting the benefits of the warehouse with the scale of the lake.

I'm wondering how you think about

the,

sort of, the staying power of the current formulation of the modern data stack versus

the overall principles that it's encompassing

and some of the ways to the things like data virtualization

as offered by Trino, etcetera,

is able to facilitate some of the concepts that are core and central to the data mesh paradigm and some of the ways that those 2 can be sort of in synergy with each other versus at odds with each other.

Yeah. And I think given that they're sort of coming in from different angles, whereas the modern data stack is more nimble and a faster solution, and data mesh on the other hand is a journey and an evolution of a strategy. You know, I think what's really interesting to me is where these 2 things intersect. Like, how can we be more intentional

about transitioning from the modern data stack driven world where we're focusing on speed to value and then maturing thoughtfully into that

decentralized

and data product driven world.

Right? And so I think the lakehouse

can be a key part of that story too because the real benefit of the lakehouse is that it's providing the functionality of that warehouse, including that user experience and visibility of the data to the end users

with the scale

and

the low and nearly linear cost per gig of a data lake. Right?

So it's really about being intentional

about

where you're applying the business logic to the data. Right? Is it coming in

initially

on writing, or is it coming in on reading? And so there's sort of this idea of the modern data stack where

if you develop a modern data stack and you end up getting a larger and larger modern data stack, you sort of end up with a data lake anyway because you typically do have a staging area before data gets loaded into

Snowflake or whatever cloud data warehouse that you have.

So you end up with the staging area that ends up being much like a data lake, and then you've got the cloud data warehouse sitting on top of it. And so

in my mind, you know, you've got this transition from ETL to ELT

that focuses on bringing data together into that centralized storage layer, which is the staging area. And then you've got that warehouse part of the lakehouse that's

given in the modern data stack. So in some ways, you're kind of building it without even intentionally building it. And the challenge is to make that a more thoughtful thoughtful architecture for a lot of modern Data Stack users. And then as you build out to scale, how do you really

articulate that architecture and organizational structure

for a larger scale as you acquire other companies, as you break out different business units, things like that. And that's where the mesh idea comes in, which is really decentralized

architectures based on each domain

doing what's best for themselves, but producing data products

and really treating data as a first class product.

In terms of the adoption of the lakehouse, it's definitely been picking up speed in the past even year because of the

growing maturity of the underlying technologies and capabilities that are offered, particularly with things like iceberg and hoodie and delta formats to provide

a more

well engineered

table structure

on top of the

underlying storage where there isn't actually any real table to be spoken of. So getting some of that power of

fully integrated database engines with things like time travel and, you know, MVCC and the ability to evolve schema in a more natural format.

And I'm wondering what you see as the pieces that have been missing to date that have

made things like Snowflake and BigQuery and Redshift the

default, de facto core elements of the modern data stack

and how the evolution of these lakehouse technologies

are starting to maybe level the playing field and make that a more viable option as the

core central default technology that an organization might orient around to be able to get the kind of combined benefits of cheap storage at scale

and performant queries on this semi structured and structured data.

Yeah. I mean, I think you hit upon a lot of the recent developments that

have really accelerated the growth of the lakehouse as a viable option. I think the cloud data warehouse, I mean, in 5 minutes, you can have a query running. Right? And I think that's really that user experience of

going from 0 to queries

is really attractive, whereas

it's taken a bit longer for data lakes to get up to that user experience. Right? Like, we now have Lake Formation. We have all these other great technologies that are allowing people to spin up lakes, but it's still not as clean

and easy for just any data engineer to do this. Right? Like,

I think, you know, I don't wanna say my mom could do it, but she probably could. Right? She probably does Snowflake account and get up and running. But I also think that there's a lot of technologies out there that

are allowing you to get to the point where you can kind of set it and forget it with the infrastructure side of things, which is what really is the power of that cloud data warehouse.

And the challenge is that

unlike a lake, the cloud data warehouse gets very expensive very quickly. Right? Like, you know, my mom's spinning up a data

warehouse. My poor mother

picking on her. But if my mom were to spin up a cloud data warehouse, then she forgets about it, and then she gets a crazy bill because she forgot about it. Right? Whereas with the data lake, it's just storage. Right? It's really your storage and your compute as opposed to someone else's storage in their compute. Right? And so

with that higher startup cost for the lake analytics, you know, that's been higher than spinning up a cloud data warehouse just because of the virtue of the technology. But I would argue that

by thinking outside the current strengths of the modern data stack, there's this whole new class of tools emerging that provide the speed to value and the user interface

on top of lake or the lakehouse. And so that would be things like Starburst Galaxy where, you know, you just point us to your storage, and then we handle the compute.

Going back to the higher level architectural principles

of the data mesh and the modern data stack, picking a bit more on the modern data stack right now.

As you mentioned, a lot of the core ideas

that are

being executed on it aren't really new at all, but they're being repackaged as this bright, shiny approach to how to build things because of the fact that we have these cloud technologies, and so that's what makes it modern.

I'm wondering

what are the core elements of it that

will continue to have staying power,

and what are the pieces

of how people think about the, quote, unquote, modern data stack that

are liable to shift in the next year, 2 years, 5 years' time because it's just the latest trend

and some of the areas of convergence that you see

coming down the road for some of the currently disaggregated

technologies where there's opportunity

for simplifying

the experience, simplifying the technologies, and starting to combine them into

not necessarily as fully integrated of a stack as we had with things like Informatica,

but, you know, not as disaggregated as we have right now where you have to have 5 accounts across 5 different vendors to be able to get what most people view as the modern data stack. It's a lot to unpack there. So

I'll start by talking about the modern data stack. And, you know, I do think that it really is legacy

paradigms

built with modern tools, which, I mean, there's nothing wrong with that. Right? But like you said, you end up with 5 different tools,

which are built around this paradigm of moving your data away from the source. So the closer you can get to analyzing the data at the source, the better off you'll be.

And with modern

cloud technologies, that's a reasonable expectation, right, because you do have

scaling. Right? You have auto scaling. You have horizontal and vertical scaling. There's no reason you can't now

query your data at the source or as close to the source as you can get. And so I think building the idea of building all these pipelines to centralize your

data, hopefully, is a thing of the past because I do think that is a legacy technology

idea. And so,

really, the idea of query engines and data virtualization and data federation and query federation, I think that all comes into play here because

you do have this idea of your data's already stored. Don't store it somewhere else, but instead, use a query engine

to

query it directly at the source. And then you can hook in whatever analytics tool downstream you want

with all the, you know, modern networking and things like that. And so you really don't need to move data around as much as you used to. And so if you still do want that data lake capability, you can have that, but, again, still use that query engine rather than having

to actually move all the data into a centralized storage platform. Right? And you can retain control over your data in a way that you couldn't in the past, which is great. And then you had asked, you know, what other forces I think will have influence over the trajectory of these architectures?

Looking into my crystal ball,

I do think that,

you know, a key in deciding factor is this idea of data as a product. And this is what it's the heart of the data mesh, but it's really something that you know, there's

so many blogs and podcasts and everything out there about data as a product now because

it's long been

the case that data has been a side product or a byproduct of business, and we've been trying to,

after the fact, treat it like a product.

But we're getting closer to the source again. And so it's something that, you know, it's no longer a pet project that you hire a couple of data engineers and trust them to just handle it all. It's now a main product that you're creating. And so focusing your strategy around data, you need to be certain that your data is high fidelity, it's reliable,

it's produced with the consumers and the consumption in mind. So that the treatment of data as a first class product in the business is really essential now, and that needs to happen at all levels.

And it also needs to be part of the culture of your company if you wanna truly be data driven.

And so I think with the exponential growth of data, because the volume alone

has, you know, been just absolutely

bigger than anyone could have imagined, I think. You know, interesting features like separation of storage and compute, you know, are now essential. They're not optional.

And so, you know, if you wanna be data driven, that doesn't come for free. And so folks have really embraced at a high level, but when it comes to actually executing on making data a key part of culture and strategy, it takes more than good intentions. It takes training and tooling and cultural alignment. So I think that will inform the technologies going downstream of the next 2 to 5 years.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests.

DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production.

No more shipping and praying. You can now know exactly what will change in your database.

DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

What are the pieces of

technology

or the operating paradigms or, quote, unquote, best practices

of today

that will go the way of MapReduce 5 years ago as we continue to iterate into the future?

I mean,

I'm not gonna say that cloud data warehouses are on their way out. Right? Because I don't think we're close enough to that. But I do think that people are starting to

understand that the modern data stack is, like, key set of

5 technologies that you put together. You know, it's this magical data stack that you can get value from immediately.

You know, I think people are starting to question that, you know, it's as simple as it can be because there's all these different plug and play pieces that you need to put in to really make it viable at

scale. And so I think folks are starting to think more about what are we gonna do in 3 years. Right?

And what are we gonna do when we have so much data that the modern data stack with its cloud data warehouse is cost prohibitive. Right? And so I think people are starting to think about data as a product and how that works in that world, and how does the modern data stack either serve or not serve that use case.

In that sense of

people

starting to,

you know, dig themselves deeper as they say, okay. Well, I'll I'll just use these de facto tools that everybody else is using because they say it's easy to get up and running. What are those missing pieces that they start running up against as they go further along in their journey and they start to say, okay. Well, in my business unit, these 5 tools are great, but now I actually need to start expanding out to the entire organization or the entire enterprise, and, oh, shoot. This doesn't work anymore. Or, oh, shoot. I just spent $1, 000, 000, 000 on my data warehouse.

Yeah. Absolutely. I mean, I think

governance is a huge thing. Right? Like, governance, understanding lineage of data, understanding

the

quality of data. You know, there's all these data observability tools now, and I think that's

a really fascinating field because the analytics are only as good as the quality of the data you're putting in. And that's especially true

the more advanced you're getting in your analytics. I mean, data science, if you've ever done data science, you need just, like, absolutely

massive quantities of incredibly reliable data. Right? And so the modern data stack isn't really intended for that use case in my mind so much as a data lake would be. Right? And so, again, it gets back to leaving the data closer to the source, so you're doing less to it. So it is more reliable.

And so I do think governance is a huge piece of that. I also think performance is a huge piece of that. Right? Like, your cloud data warehouses get really expensive if only because of storage,

but the compute to get the performance that people require is just absolutely incredible. And so

you end up really paying for that performance.

On the

data lake and data lakehouse

aspect of performance, I know that that's an area that that has seen a lot of investment in tools such as Trino and in the work that you're doing at Starburst. I'm wondering what are some of those performance bottlenecks or the constraints or

some of the ways

that data teams need to think about the way that they're laying things out on disk, partitioning,

which table format to use? Do I need to use iceberg? Do I need to use Hoody? Do I need to use Delta? You know, maybe I need Hoody because I need streaming inserts, and so I wanna sure that I have performant queries on newly written data. Like, what are some of those challenges that people are facing as they try to adopt some of the Lakehouse technologies so that they can scale, but they also are looking for high performance or maybe it's just that, okay. I need to actually

use my lakehouse for massive scale analytics, but for anything, you know, performant with low latency, I actually need to stick it into ClickHouse or Druid or whichever OLAP store I might have, you know, is the flavor of the month.

Yeah.

Flavor of the month is a good way of putting it. Yeah. I mean, I think that there are still some

edges that need to be addressed in that lakehouse world. Right? It doesn't have a seamless experience yet, like I've mentioned. You know, users of cloud data warehouses,

they're used to that experience. And so this

is something where we wanna deliver the experience

within the lakehouse

ecosystem as well that you get at that cloud data warehouse. And so there are a lot of tools like DBT and Good Expectations that are delivering

cloud data warehouse like value in the lakehouse context as well, which helps it to get more mature

and helps it really to allow users

to think more about the business value and think less about the particular formats. Right? And I would also add that, arguably, the lakehouse is suffering from all of the associated issues with both the data lake and the data warehouse. Right? You get the best of both worlds, but you also can sometimes get the worst of both worlds. And so in the data lake, you have data that's difficult to access and understand, and you need context added to it. Whereas in the data warehouse,

you have issues around agility and enabling different data context. So,

you know, for the way a marketing organization would consider a customer is gonna be very different than the way a risk team considers a customer. And so, you know, this is 1 respect why things like data mesh becomes so interesting because you start thinking of data as a product, and it's aligned cross functionally regardless of whether it comes from that warehouse, that lake, or a lake house.

And so as data teams are faced with these decisions of, okay, do I

build around the modern data stack, or do I just evolve my current system to add whatever missing capability there is, or, you know, do I need data mesh, or am I not at the scale where that makes sense? What are some of the ways that you have seen teams start to approach those decisions and questions and some of the discovery efforts that are needed to be able to make informed choices in this

constantly evolving and very confusing world of data that we're living in? It's funny because it can be confusing, but it's also

a buffet of choice in a lot of ways. And so

as with any product development, you know, I, again, solidly think that we need to treat data as a product. And so I think domains or the people creating data products

can benefit from things like product planning and agile methodologies and creating MVPs just like any other product development organization would. And so finding the right technology partners

that allow folks to develop quickly and iterate and get value out of those first few data products in short order is a really great way to start proving value

in any transition. So that's the same as you would with any other

technology product. And I think there needs to be sort of a people aspect of it, which is why I think data mesh is really hitting home for people these days is because its creator calls it associate technological

paradigm. And, you know, I think the fact that people are included in that is really important. So a domain, you know, a group of data creators with clearly defined business purpose like finance or sales or manufacturing. And so

both within and across domains, you want to make sure that data product developers are using the same definition of what a data product is.

And then on the consumption side, you want feedback mechanism

for data product consumers and developers to work together

on how to get data downstream to users. And so

while the modern data stack can be helpful in these cases for, like, spinning up

data environments for individual domains, you also need to think cross functionally about how that

becomes

a layer in which data products are produced and consumed.

Right? And so I think there's this interesting world where the modern data stack intersects with

data mesh. Right? And I think there's an interesting story that's coming out there that I do think is still being formed by the community.

You know, I have thoughts about it, obviously, with Starburst at the center. But, you know, I do think that there's interesting ways of thinking about all of these different paradigms at different scales and where they intersect.

And so I think you need to define your strategy and focus on the business value. Right? Like and sort of iterate on all of these things at once. So, you know, whether it's a lake house or a warehouse or a data lake or, you know, the modern data stack and how that intersects with all of this. You know, I think you need to focus on

shortening and streamlining the path from the data to the value that it creates.

1 of the complexities that comes up as you are starting to go down this path of data as a product is the need to bring

application teams along for the journey.

And a lot of times,

their

incentives aren't necessarily aligned with that of the data team in terms of being able to actually make this a reality because their goal is I need to ship features for the widget because people are asking for the widget to be blue instead of green.

I don't have time to figure out all of the things that go along with making sure that the data that I'm generating in that process actually maps to the concepts and the business objects that need to be exposed in this, you know, data interface for other people to be able to consume because that's not something that I'm ever going to be using, and it's not something that this person who,

you know, goes to the website to interact with the blue widget is ever going to care about. So how do we think about

updating the incentive structures to make sure that application development teams and data teams and business teams are all aligned

in that process of both producing these applications that are actually driving the business,

but also generating the data products that are necessary

internally

and for feeding back into those applications

to kind of keep everybody

moving along and

aligned in terms of how to think about things,

the sort of data modeling principles that are necessary to make these performant and

understandable, and just all of the education that goes along with making this a reality.

Yeah. I mean, I think you absolutely hit the nail on the head with 2 points there. 1 is incentive

incentivization

and 1 is education. Right? I don't think there's a recipe you can follow necessarily, but I do think it depends on your overall data strategy. But if you're thinking of data as a product, then the product development teams need to understand that that is 1 of their deliverables.

Right? And it may slow down widget production for a bit, right, while they get their feet under them and learn how to do

some basic data engineering. And the other option is to take data engineers and put them on the product development team so that they're working alongside the widget developers.

But I do think that it needs to be part of the overall

corporate strategy as opposed to being a separate data strategy that is completely independent

from a product development strategy.

And so I think that's what we mean when we say that data is a product. Right?

Like, data is something that that team is now responsible for that they weren't before, and

you have to frame it well. Right? There's change management that comes in here, and you have to frame it as, hey. We're gonna teach you new skills, and you're gonna learn new things, and this is something you put on your resume. And, you know, I think that there needs to be product ownership and product management around data the same way there would be around anything else. And so, you know, and there's a product life cycle too. So data can expire

or it can be phased out or it can be versioned and all of these good things. But, you know, I think we need to take it that step further and really say it's not just that the developers are now responsible for feeding a pipeline. It's more that the developers are the ones who understand the data. Right? Like, they are the subject matter experts here, and so they should be responsible for it as opposed to

throwing it over a fence to another team, that centralized

mythical data team that's an expert in all data across the enterprise because,

I mean, that never works. Right?

So, I mean, I've worked at companies that I will not name, but I've owned a central data function. And, you know, I've had engineers be like, oh, we deleted all our data because we figured we were putting it in the warehouse. Why would we back it up? And I'm like, that's not how this works. Right?

Like, you have to care about your data. And they're like, but we don't. And, you know, it's your problem. So, you know, there has to be, like, that cultural shift where, you know, the engineers

understand that they're producing this data, and it is 1 of the products that they're responsible for. Yeah. And

the metaphor of throwing it over the fence, in some cases, isn't even actually apt because a lot of the times, there isn't even really any

intentionality

in the application team of handing off the data. It's just it's in the database. Good luck.

Or, like, I dumped it in an s 3 bucket. Oh, you need to know what it looks like? That's a you problem. Right? Exactly.

I can tell you and I have both been in that situation. But, yeah, I think it's something that, you know, it needs to be driven from the top down and the bottom up in different ways. But I think it's education and intentionality. And, know,

you know, top level corporate strategy of we're data driven and then nothing under it. Right? Absolutely. And from the developer tooling perspective, this is a subject that has been coming up a lot is

if you're building

a regular web application, whether it's using something like Spring in the Java world or Django in Python or Rails with Ruby,

a lot of the way that you interact with the data is through this abstraction of the ORM where the database is there, but you don't think about it as the database. You just think about it as this is where the objects go until I need them again.

And so your primary interface is through the code, and so a lot of times that leads to if you're just looking at the database tables and the structures there, it can seem very disjointed and chaotic as to why are all these tables named this way, or why do I have 5 tables for this 1 concept?

Because a lot of times, application teams aren't thinking about the data modeling from

a database engineering perspective. They're just thinking about it from an object interaction

perspective. And I think that that's also

where a lot of this confusion comes in for data teams who are trying to then reverse engineer meaning out of these database tables that they're replicating into the warehouse or into the lake. And I'm wondering what you see

as some of the opportunity

for

either injecting

a new abstraction layer

alongside the ORM or in tandem with the ORM or inside the ORM to be able to

build up these

domain objects and these business objects that are actually semantically meaningful

for building these data products so that you don't have to do as much reverse engineering from the database layer or so that you can

provide a more natural API from the application for doing some of this data extraction

in a semantic way instead of just in a very mechanical way that then requires a bunch of extra processing steps down the partnership. Right? Like, this is a product.

It is a downstream thing that

the

data creators are responsible for. So they have to start thinking about modeling in that way. And so I do think that there's product management that needs to come in. Right? And whether it's a actual product manager or if it's,

you know, some other person who's involved like a data product owner, you know, that's something I see being bandied about a lot these days is the that role. But it's the idea that when products are being designed,

design the data as well. Right? Like, get out in the front of it rather than being more reactive.

But, also, when you're designing it and you're thinking about downstream users for the product, also think about the downstream users for the data.

And so,

again,

it's absolutely

additional scope. Don't get me wrong.

And it will slow down that product design phase. But that said,

you're saving money on the back end because you're no longer

having to retrofit things. Right? Like, you're no longer having to say, oh, this thing is this hideous JSON. Let's figure out how to get it into a table that can be used by Tableau or something. Right? So, like, you're actually giving,

you know, more thought upfront to save yourself time on the back end.

And I do think that

that is a different muscle to flex for engineers. Right? Like, that is not something that they're used to. They're used to sort of saying, oh, well,

you know, my deliverable is the widget. Right? That is what I care about, and that's what we've designed. And, you know, later on, we can futz with it to make sure that the data is a little better. But,

you know, instead of doing that, why not be intentional about both things from the get go?

Tired of deploying bad data?

Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes.

Build powerful workflows that connect your entire data stack end to end with a mix of your code and their open source low code templates.

Once launched, Shipyard makes data observability

easy with logging, alerting, and retries that will catch errors before your business team does.

So whether you're ingesting data from an API,

transforming it with DBT,

updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles

shipyard today to get started automating with their free developer plan.

Putting you on the spot a little bit in terms of some of the kind of architecture design aspects

when you are starting to

move towards data mesh where you say, okay, I want to have this domain data product.

What are some of the guiding principles that you have seen useful in understanding what level of granularity is applicable where, you know, is the data product just all of the data that pertains to this application where, you know, the data product is the data that lives in the database for this 1 web app? Is the data product this aggregation of web apps that all pertain to a given business unit within the organization? Like, what are some of the guiding principles that you have seen be helpful in figuring out

what is that appropriate

domain boundary for me to then build this sort of mesh interface on top of? Yeah. Absolutely. And I mean, I think at Starburst, we've seen people come at it from different angles, and I do think that as long as you're consistent, you can kind of do it however works for your organization. Because a lot of this is sort of the idea of having a contract

with the consumers

and the developers who are creating the data products to say,

this is what we're building. And a lot of it is just metadata. Right? Saying, like, here's the context for this thing

and allowing it to be more self serve on the consumption side. Right? But in reality, you know, there's no such thing as self serve.

And so I do think that, you know, we have a data products interface where we allow people to just use SQL to create data products so they're not they don't have to learn new technology, and they present it. And it's either a view or materialized view that, therefore, consume down stream.

And we're building that view, so you're, again, getting closer to the source of the data. So I do think that providing something that has less lift for the engineers creating the data products is really important.

On the downstream side, making it so that the consumers are using something that's familiar to them like SQL, is really important as well. But I do think that when you're talking about,

you know, the scope,

like I said, the consistency is the key. And then on top of that, you want to make sure that

you've provided all the information that a Giantstream user needs to be independent. You want to make sure that these

are, you know, describable and accessible and well governed and, you know, clean and reliable and all that good stuff. But if you do that,

then you will necessarily have considered the downstream use case as well.

So

I was talking to someone at a client who had 38 domains within 1 BU. Right? And that made sense for them. Right? And then I've talked to customers who are smaller companies, and they have 4 domains.

Right? And it so I think it just depends, like, how your business organization is structured as well, how you think about breaking down your business into individual units.

As the

data lake gains steam and gains polish, what are some of the elements that you think are necessary

that are still being developed to

bring it to the kind of level of accessibility

that

the data warehouse vendors currently offer? What are some of the areas of investment that are happening that people should be keeping an eye on as they are starting to make these decisions of, you know, which technology stack to use, which vendors to go with, which architectural paradigms are going to make sense for them. Do I want a data lake, a lake house, a warehouse?

Do I just want everything to be in my Kafka topics?

All of the above. Yeah. And I think people do get paralyzed by choice these days in some ways just because there is a ton of choice and, you know, everybody says, oh, you've got the modern data stack. You've got data lake. You've got lake houses. You've got data mesh. And it's like, you know, you can become an expert on

precisely 0 of these things just because they're such broad topics. Right? And so I think

focusing on the business use cases,

you know, job number 1 here, really understanding, like, is your job to get real time analytics, or is your job to provide

downstream analytics that your customers are gonna consume and that you're gonna actually sell your data in the long run. Right? So, like, you have to really understand what is your use case and then work back from there.

I think that finding flexible technologies that really focus on performance

and scalability

and

simplicity

are really important. You wanna avoid getting locked into 1 vendor because 5 years from now, things are gonna be completely different. Right? They were different 5 years ago. They're gonna be different 5 years from now. That's 1 thing I will say for sure. Right? And so I think that having that flexibility and avoiding getting locked into a specific technology is really key. And so I think the lake house is, you know, an effect of that. Right? And that people went all in on a lake or went all in on a data warehouse, and then the lake house allows you to sort of inch away from whatever you chose and sort of get the benefits of the additional architecture.

And so I think there's sort of these strategies, these huge strategic things like data mesh where it's a journey and you'll never kinda get there. And then there's things that you could spin up today like Modern Data Stack. Right? And then there's a whole world in between. So you have to sort of figure out

where do you see yourself on that maturity spectrum, and then what are your business goals? And then

sort of drill down from there into, you know maybe that means you need an s 3 data lake. Right? Maybe it means that you already have everything in parquet and you're good to go. Or maybe it means that everything's in Excel

and you have to work back and really start from the beginning there. Not to knock on Excel, but, you know, because I feel like that's a whole episode in itself.

And

as you

look toward

the sort of continuing evolution of this landscape, what do you think are going to be the

major shaping forces over the next 2 to 5 years that push the architectural trends from where they are now to wherever they are going to?

I do think that data as a product is 1 of those trends. And

like we were saying,

I think allowing

non data engineers

to

produce data products

is going to be key. And I think there's a few different factors there. 1 is hiring. It is really hard to hire people, and so you want people and technologies that can be flexible.

And,

you know, data reliability and governance is key. I mean, governance is always a thing. Right? Everybody's been doing governance forever because it's 1 of the hardest problems we have. But I do think that that is going to influence our trajectory because we've got all these new and exciting

regulatory compliance

initiatives that we need to handle over the years. And so, you know, now it's GDPR. Who knows where it will be 5 years from now?

And then, you know, I think the technology and, you know, the cloud evolution has been really fascinating. Right? Like, 10 years ago, it wasn't where it will be, and so it's gonna continue evolving. And,

you know, quantum computing will be a really interesting play here too. Like, I think there's a lot to be done for performance. But, you know, I think people want answers now. They don't wanna be spending, you know, 6 months spinning up a data stack. They want their answer now,

and then they wanna know that they can rely on that answer. And then they need to figure out their longer term strategy for theirs. So I do think that governance and speed are really 2 key pieces

here. As you have been working at Starburst with your customers, what are some of the most interesting or innovative or unexpected ways that you have seen this sort of data virtualization,

data lakehouse

technology, however you wanna phrase it, being used

particularly in these contexts of the modern data stack and data mesh?

Yeah. Absolutely. I mean, the modern DataStax side of things, obviously, we have Starburst Galaxy now, which is our completely managed and hosted Trino as a service sort of

platform. And so you really can spin up a starburst environment incredibly quickly. Right? You just need to point us to your data, create a quick account, point us to your data, and you're good to go.

I do love seeing how our customers are spinning up really

exciting analytics very quickly with Starburst Galaxy, which is really fun. I love talking to some of our enterprise customers too because they've just done some really interesting thing. We have 1 digital customer, Comcast,

that built out a lake house that handles all our streaming and traditional

structured data. And they built it using traditional

data modeling. And they provide this self-service data repository

for all their different departments, and each department spins up their own cluster with their own technology,

and then they can query the data however they want. So, you know, it's interesting seeing how this works at that kind of scale.

And we also see a lot of organizations that are doing both analytics and machine learning.

And in that case, the lakehouse really fits well because

the warehouse side of things

is serving the analytics use cases, the BI tools and reporting and things like that, whereas the lake is serving the ML case.

Right? And I think that's probably,

you know, 1 of those best of both worlds situations. And then you've also got things like time travel. Right?

Time travel is amazing, and you've got that capability to see how your data was on some arbitrary date in the past, which is useful for debugging and compliance and BCDR and all that good stuff. Yeah. People are doing some cool stuff.

In your experience of working in this ecosystem

and

exploring and

understanding and helping your customers come to terms with these architectural patterns, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

It's really fun being on the cutting edge of data. Right? Like, it's also incredibly

challenging.

And I think, for me, the key is just realizing that change is hard for people. Right? And whether it's because we're on a remote world and there's a pandemic and all of our data

is no longer applicable or it's changed the way we interpret it, or just trying to change the way people think about data. You know? People are uncomfortable with what they know, and they don't like the unknown. I mean, it's human nature. I have a toddler. I get it. But I do think we're coming up with some really new and innovative ways to think about data management

and data access and data processing and then how all of those intersect.

Right?

That's the complex problem that I think we're all really trying to answer.

You know, convincing folks to think beyond these old paradigms like the modern data stack. Right? It's a fun challenge that,

you know, I get to think about it a lot at Starburst. So it's kind of fun seeing how people are innovating to answer all of these questions.

And for people who are starting to explore

the modern data stack and data mesh, what are the cases where the lakehouse paradigm is the wrong choice and they are better suited going in 1 direction or the other of the lake or the warehouse. Yeah. I mean, I think it gets back to what are the business questions you're trying to answer. Right? If you're just trying to do, like, straight reporting and BI tooling, like,

you know, maybe a warehouse is right for you. If you don't have huge data volumes, a warehouse can work really well. Or if you don't care about performance as much. Right? Or if you don't have a huge budget, maybe on the other side, like, a data lake is better. So, you know, if you've got all of your data in the lake, consider why you need that lake house. What's the business purpose? And

I would argue that with more modern technology, the storage layer is actually the key. And the business logic layer on top of that that you'd apply either within the CDW

or within some sort of query engine, that's where the business logic gets applied. Right? And so

don't lock yourself into an expensive vendor.

I mean, all of these vendors, what they're really doing is it's like, you know, it's object storage plus the SQL access layer. So you kind of have the object storage already in the lake. So I would recommend focusing on the business driver and streamlining that technology stack. So using something like a query engine on top of a lake, that's really what the cloud data warehouses are doing anyway, and it's becoming more common to handle this in house because you do have these cool tools like Starburst that can help you do that.

As you continue to iterate on the Starburst technology platform and Starburst Galaxy in particular, what are the things you have planned for the near to medium term or any of the pieces of integration

that are

missing or necessary or upcoming that will help to make it a more

equal citizen in the modern data stack with some of these data warehouse platforms?

1st and foremost, obviously, Starburst Galaxy is really taking off. It allows us to bring that power of Starburst to

users and really get them up and running with the power of Trino in virtually no time at all for setup, and there's no infrastructure to worry about. It's already fully managed and hosted. So I think Starburst Galaxy really puts us firmly in that modern data stack

category.

Also, within both Galaxy and enterprise, which is what I focus on, we have this built in access control system coupled with, you know, data products functionality, and it makes us a really excellent partner for enterprises on their data mesh journey, which is where I think a lot of people will end up there. So if you can, like, get started on that journey sooner, I think it's really important.

And then the big news from last week is that Starburst just acquired Verada, which is a performance accelerator for the Starburst ecosystem, and it's been great because we have a new office in Israel. We have fantastic new engineers who have just joined us who really understand the power of Trino and what we're doing in the marketplace.

And beyond that, you know, the acceleration of our

already best in class query speed that we see in Starburst is just I think it's gonna blow people's minds. I'm really excited about that. Alright. Well, for anybody who wants to follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I do think that there is a gap

in that inflection point I discussed of what happens when

you go from small to enterprise.

Right? You know, you can have a modern data stack

if you're small, and that works really well. But then once you've got multiple business units, you've got multiple modern data stacks,

how do you get from there

to something really mature like a data mesh? Right? It's that sort of adolescent phase or the teenage phase of the data management story. And I'm really

fascinated by that, and I think that

it hasn't been addressed yet, as far as I know. And I'm really curious

to see

there's a lot of start ups out there, and there's a lot of enterprises out there, and I think that sort of inflection point is a really interesting area to study. So, you know, I like sticky problems, and that's a really cool sticky problem I hope to tackle soon.

Thank you very much for taking the time today to join me and share your thoughts on the current state of data architectures and the technology forces that are helping to shape them. Definitely very

interesting and constantly evolving and hard to keep track of area. So I appreciate all of your time and energy in helping us explore some of those patterns and paradigms and how to think about them. Appreciate that, and I hope you enjoy the rest of your day. Thank you so much for having me. It's been really fun.

Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links